YOLO Series

YOLO V1

You Only Look Once: Unified, Real-Time Object Detection: mAP@VOC07 63.4/45FPS

yolo-v1 framework

Spit the image into SxS (S=7) grid cells. If the object center falls into the grid, then the grid is reponsible for detecting it.
Each grid predicts B (B=2) bboxes (x, y, w, h, p),where p is confidence score Pr(Object) * IOU. Each grid also predicts C classes. The final tensor is 7x7x30. x, y are relative to grid cells and w, h are relative to the whole image.
During test time, Pr(class_i | Object) * Pr(Object) * IOU = Pr(class_i) * IOU.

yolo-v1 loss

YOLO9000: Better, Faster, Stronger: map@VOC07 78.6/40FPS

Batch normalization: 2% mAP, can remove dropout
High Resolution classifier: warm up 10 epochs with 448x448, 4% mAP
Convolutional with anchor boxes: increase 7% recall
Dimension clusters: get prior anchor boxes with k-means
Direct location prediction: x = t_x * w_a + x_a -> b_x = sigmoid(t_x) + c_x, in total 5%
Fine-grained features: merge high/low level feature with pass-through layer, 1% increase
Multi-scale training: randomly choose from multiples of 32 {320, 352, ...608} every 10 epochs

The backbone is darknet-19 (19 convolutions). The final tensor is (5 + 20) * 5 [anchors]= 125 [channels].

YOLOv3: An Incremental Improvement: mAP COCO@0.5 57.9

Darknet-53: 77.2 top-1 78 FPS (ResNet-152: 77.5 Top-1 37 FPS). No avg-pooling layer. 53 conv layers.
predicts boxes at 3 scales and predicts 3 boxes at each scale. In total N x N x (3 * (4 + 1 + 80)), where N is feature map size.
Sampling: Assign one groundtruth for each anchor IOU@0.5. If an anchor is not assigned to a gt, it incurs no loss for regression or class predictions, only objectness.
Use binary cross entropy for class predictions.
Yolo-v3 uses sum of squared loss for regression loss. Yolo-v3 SPP uses CIOU.

With SPP, the performances increases from 32.7 to 35.6 mAP COCO@0.5:0.9. Implemented by ultralytics, further increased to 42.6 mAP.

Mosaic image augmentation: piece 4 image together as input
Insert a SPP layer before first feature map output level
IOU -> GIOU -> DIOU -> CIOU
- IOU: loss is zero when no intersection
- GIOU: IOU - (Ac - u) / Ac, -1<=GIOU<=1, loss = 1 - GIOU. Can degrade to IOU.
- DIOU: IOU - d^2 / c^2, where c, d are diagonal distances.
- CIOU: considers overlap area, center point distance and aspect ratio.
Focal loss: only a dozen anchors are matched to gt bboxes. This is further exacerbated for one-stage network. CE(pt) = -alpha_t log(pt) -> -alpha_t * (1-pt)^gamma log(pt), hard negative mining.