YOLO Series
YOLO V1
You Only Look Once: Unified, Real-Time Object Detection: mAP@VOC07 63.4/45FPS
- Spit the image into SxS (S=7) grid cells. If the object center falls into the grid, then the grid is reponsible for detecting it.
- Each grid predicts B (B=2) bboxes (x, y, w, h, p),where p is confidence score Pr(Object) * IOU. Each grid also predicts C classes. The final tensor is 7x7x30. x, y are relative to grid cells and w, h are relative to the whole image.
- During test time, Pr(class_i | Object) * Pr(Object) * IOU = Pr(class_i) * IOU.
- The first two terms are for bounding boxes.
- The second two terms are for confidence losses.
- The last term is for classes loss.
limitations
- Cluttered small objects
- New or unusual aspect ratios
- Poor localization
YOLO V2
YOLO9000: Better, Faster, Stronger: map@VOC07 78.6/40FPS
Tricks
- Batch normalization: 2% mAP, can remove dropout
- High Resolution classifier: warm up 10 epochs with 448x448, 4% mAP
- Convolutional with anchor boxes: increase 7% recall
- Dimension clusters: get prior anchor boxes with k-means
- Direct location prediction: x = t_x * w_a + x_a -> b_x = sigmoid(t_x) + c_x, in total 5%
- Fine-grained features: merge high/low level feature with pass-through layer, 1% increase
- Multi-scale training: randomly choose from multiples of 32 {320, 352, ...608} every 10 epochs
The backbone is darknet-19 (19 convolutions). The final tensor is (5 + 20) * 5 [anchors]= 125 [channels].
YOLO V3
YOLOv3: An Incremental Improvement: mAP COCO@0.5 57.9
Tricks
- Darknet-53: 77.2 top-1 78 FPS (ResNet-152: 77.5 Top-1 37 FPS). No avg-pooling layer. 53 conv layers.
- predicts boxes at 3 scales and predicts 3 boxes at each scale. In total N x N x (3 * (4 + 1 + 80)), where N is feature map size.
- Sampling: Assign one groundtruth for each anchor IOU@0.5. If an anchor is not assigned to a gt, it incurs no loss for regression or class predictions, only objectness.
- Use binary cross entropy for class predictions.
- Yolo-v3 uses sum of squared loss for regression loss. Yolo-v3 SPP uses CIOU.
More tricks
With SPP, the performances increases from 32.7 to 35.6 mAP COCO@0.5:0.9. Implemented by ultralytics, further increased to 42.6 mAP.
- Mosaic image augmentation: piece 4 image together as input
- Insert a SPP layer before first feature map output level
- IOU -> GIOU -> DIOU -> CIOU
- IOU: loss is zero when no intersection
- GIOU: IOU - (Ac - u) / Ac, -1<=GIOU<=1, loss = 1 - GIOU. Can degrade to IOU.
- DIOU: IOU - d^2 / c^2, where c, d are diagonal distances.
- CIOU: considers overlap area, center point distance and aspect ratio.
- Focal loss: only a dozen anchors are matched to gt bboxes. This is further exacerbated for one-stage network. CE(pt) = -alpha_t log(pt) -> -alpha_t * (1-pt)^gamma log(pt), hard negative mining.