Skip to content

YOLO Series

YOLO V1

You Only Look Once: Unified, Real-Time Object Detection: mAP@VOC07 63.4/45FPS

yolo-v1 framework

  • Spit the image into SxS (S=7) grid cells. If the object center falls into the grid, then the grid is reponsible for detecting it.
  • Each grid predicts B (B=2) bboxes (x, y, w, h, p),where p is confidence score Pr(Object) * IOU. Each grid also predicts C classes. The final tensor is 7x7x30. x, y are relative to grid cells and w, h are relative to the whole image.
  • During test time, Pr(class_i | Object) * Pr(Object) * IOU = Pr(class_i) * IOU.

yolo-v1 loss

  • The first two terms are for bounding boxes.
  • The second two terms are for confidence losses.
  • The last term is for classes loss.

limitations

  • Cluttered small objects
  • New or unusual aspect ratios
  • Poor localization

YOLO V2

YOLO9000: Better, Faster, Stronger: map@VOC07 78.6/40FPS

Tricks

  • Batch normalization: 2% mAP, can remove dropout
  • High Resolution classifier: warm up 10 epochs with 448x448, 4% mAP
  • Convolutional with anchor boxes: increase 7% recall
  • Dimension clusters: get prior anchor boxes with k-means
  • Direct location prediction: x = t_x * w_a + x_a -> b_x = sigmoid(t_x) + c_x, in total 5%
  • Fine-grained features: merge high/low level feature with pass-through layer, 1% increase
  • Multi-scale training: randomly choose from multiples of 32 {320, 352, ...608} every 10 epochs

The backbone is darknet-19 (19 convolutions). The final tensor is (5 + 20) * 5 [anchors]= 125 [channels].

YOLO V3

YOLOv3: An Incremental Improvement: mAP COCO@0.5 57.9

Tricks

  • Darknet-53: 77.2 top-1 78 FPS (ResNet-152: 77.5 Top-1 37 FPS). No avg-pooling layer. 53 conv layers.
  • predicts boxes at 3 scales and predicts 3 boxes at each scale. In total N x N x (3 * (4 + 1 + 80)), where N is feature map size.
  • Sampling: Assign one groundtruth for each anchor IOU@0.5. If an anchor is not assigned to a gt, it incurs no loss for regression or class predictions, only objectness.
  • Use binary cross entropy for class predictions.
  • Yolo-v3 uses sum of squared loss for regression loss. Yolo-v3 SPP uses CIOU.

More tricks

With SPP, the performances increases from 32.7 to 35.6 mAP COCO@0.5:0.9. Implemented by ultralytics, further increased to 42.6 mAP.

  • Mosaic image augmentation: piece 4 image together as input
  • Insert a SPP layer before first feature map output level
  • IOU -> GIOU -> DIOU -> CIOU
    • IOU: loss is zero when no intersection
    • GIOU: IOU - (Ac - u) / Ac, -1<=GIOU<=1, loss = 1 - GIOU. Can degrade to IOU.
    • DIOU: IOU - d^2 / c^2, where c, d are diagonal distances.
    • CIOU: considers overlap area, center point distance and aspect ratio.
  • Focal loss: only a dozen anchors are matched to gt bboxes. This is further exacerbated for one-stage network. CE(pt) = -alpha_t log(pt) -> -alpha_t * (1-pt)^gamma log(pt), hard negative mining.