Overview of Faster-RCNN
Codebase Hierarchy
- backbone: feature extraction
- network_files: Fast-RCNN & RPN
- train_utils: cocotools etc
- dataset class for VOC
- MobileNetV2 as backbone
- resnet50 + FPN as backbone
- multi-gpu training
- inference script
- generate record_mAP.txt on COCO
- pascal_voc_classes.json: pascal labels
Code Analysis
Entry Point
Same as the framework plot, there're three major components for Faster-RCNN: backbone, rpn, roi_heads.
- backbone: takes image as input and output features
- rpn: takes features and targets as input and is responsible for anchor generation, matching anchor to targets, sampling, computing losses (e.g., objectness, regressions), generating proposals etc.
- roi_heads: takes features, proposals, image sizes, targets as input and generate detections, compute losses. Its function overlaps with those of rpn. The difference is it refines the input proposals and compute losses for different classes rather than objectness.
class FasterRCNNBase(nn.Module):
Main class for Generalized R-CNN.
backbone (nn.Module):
rpn (nn.Module):
roi_heads (nn.Module): takes the features + the proposals from the RPN and computes
detections / masks from it.
transform (nn.Module): performs the data transformation from the inputs to feed into
the model
def __init__(self, backbone, rpn, roi_heads, transform):
super(FasterRCNNBase, self).__init__()
self.transform = transform
self.backbone = backbone
self.rpn = rpn
self.roi_heads = roi_heads
# used only on torchscript mode
self._has_warned = False
def eager_outputs(self, losses, detections):
def forward(self, images, targets=None):
# type: (List[Tensor], Optional[List[Dict[str, Tensor]]]) -> Tuple[Dict[str, Tensor], List[Dict[str, Tensor]]]
images (list[Tensor]): images to be processed
targets (list[Dict[Tensor]]): ground-truth boxes present in the image (optional)
result (list[BoxList] or dict[Tensor]): the output from the model.
During training, it returns a dict[Tensor] which contains the losses.
During testing, it returns list[BoxList] contains additional fields
like `scores`, `labels` and `mask` (for Mask R-CNN models).
if and targets is None:
raise ValueError("In training mode, targets should be passed")
assert targets is not None
for target in targets: # 进一步判断传入的target的boxes参数是否符合规定
boxes = target["boxes"]
if isinstance(boxes, torch.Tensor):
if len(boxes.shape) != 2 or boxes.shape[-1] != 4:
raise ValueError("Expected target boxes to be a tensor"
"of shape [N, 4], got {:}.".format(
raise ValueError("Expected target boxes to be of type "
"Tensor, got {:}.".format(type(boxes)))
original_image_sizes = torch.jit.annotate(List[Tuple[int, int]], [])
for img in images:
val = img.shape[-2:]
assert len(val) == 2 # 防止输入的是个一维向量
original_image_sizes.append((val[0], val[1]))
# original_image_sizes = [img.shape[-2:] for img in images]
images, targets = self.transform(images, targets) # 对图像进行预处理
# print(images.tensors.shape)
features = self.backbone(images.tensors) # 将图像输入backbone得到特征图
if isinstance(features, torch.Tensor): # 若只在一层特征层上预测,将feature放入有序字典中,并编号为‘0’
features = OrderedDict([('0', features)]) # 若在多层特征层上预测,传入的就是一个有序字典
# 将特征层以及标注target信息传入rpn中
# proposals: List[Tensor], Tensor_shape: [num_proposals, 4],
# 每个proposals是绝对坐标,且为(x1, y1, x2, y2)格式
proposals, proposal_losses = self.rpn(images, features, targets)
# 将rpn生成的数据以及标注target信息传入fast rcnn后半部分
detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
# 对网络的预测结果进行后处理(主要将bboxes还原到原图像尺度上)
detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)
losses = {}
return losses
return detections
RPN module will first generate anchors and cls predictions, anchor offsets. The proposals will then be constructed based on the offsets and anchor templates. After filtering, the proposals and targets will be used to compute RPN loss.
Anchor Template Generation
# 遍历每个预测特征层的grid_size,strides和cell_anchors
for size, stride, base_anchors in zip(grid_sizes, strides, cell_anchors):
grid_height, grid_width = size
stride_height, stride_width = stride
device = base_anchors.device
# For output anchor, compute [x_center, y_center, x_center, y_center]
# shape: [grid_width] 对应原图上的x坐标(列)
shifts_x = torch.arange(0, grid_width, dtype=torch.float32, device=device) * stride_width
# shape: [grid_height] 对应原图上的y坐标(行)
shifts_y = torch.arange(0, grid_height, dtype=torch.float32, device=device) * stride_height
# 计算预测特征矩阵上每个点对应原图上的坐标(anchors模板的坐标偏移量)
# torch.meshgrid函数分别传入行坐标和列坐标,生成网格行坐标矩阵和网格列坐标矩阵
# shape: [grid_height, grid_width]
shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x)
shift_x = shift_x.reshape(-1)
shift_y = shift_y.reshape(-1)
# 计算anchors坐标(xmin, ymin, xmax, ymax)在原图上的坐标偏移量
# shape: [grid_width*grid_height, 4]
shifts = torch.stack([shift_x, shift_y, shift_x, shift_y], dim=1)
# For every (base anchor, output anchor) pair,
# offset each zero-centered base anchor by the center of the output anchor.
# 将anchors模板与原图上的坐标偏移量相加得到原图上所有anchors的坐标信息(shape不同时会使用广播机制)
shifts_anchor = shifts.view(-1, 1, 4) + base_anchors.view(1, -1, 4)
anchors.append(shifts_anchor.reshape(-1, 4))
RPN Head
RPNHead performs sliding-window scanning on the feature maps by using a 3x3 convolutional kernel. The output is cls_logits and box_pred, both of which are implemented with 1x1 convolution.
Region Proposal Network
This module is a wrapper that chains the following functions:
- assign_targets_to_anchors: find the matching gt for each anchor and categorize the anchors into positives/negatives, bacground and ignorables.
- filter_proposals: filter out boxes that are too small. Perform NMS, select post_nms_top_n based on probability.
- compute_loss: calculate rpn loss, including class loss, box regression loss.
- forward: chain up mentioned functions and return boxes, losses to the main entry file.
anchor_generator (AnchorGenerator): module that generates the anchors for a set of feature
head (nn.Module): module that computes the objectness and regression deltas
fg_iou_thresh (float): minimum IoU between the anchor and the GT box so that they can be
considered as positive during training of the RPN.
bg_iou_thresh (float): maximum IoU between the anchor and the GT box so that they can be
considered as negative during training of the RPN.
batch_size_per_image (int): number of anchors that are sampled during training of the RPN
for computing the loss
positive_fraction (float): proportion of positive anchors in a mini-batch during training
of the RPN
pre_nms_top_n (Dict[str]): number of proposals to keep before applying NMS. It should
contain two fields: training and testing, to allow for different values depending
on training or evaluation
post_nms_top_n (Dict[str]): number of proposals to keep after applying NMS. It should
contain two fields: training and testing, to allow for different values depending
on training or evaluation
nms_thresh (float): NMS threshold used for postprocessing the RPN proposals
How is matching performed for proposals?
# 遍历每张图像的anchors和targets
for anchors_per_image, targets_per_image in zip(anchors, targets):
gt_boxes = targets_per_image["boxes"]
if gt_boxes.numel() == 0:
device = anchors_per_image.device
matched_gt_boxes_per_image = torch.zeros(anchors_per_image.shape, dtype=torch.float32, device=device)
labels_per_image = torch.zeros((anchors_per_image.shape[0],), dtype=torch.float32, device=device)
# 计算anchors与真实bbox的iou信息
# set to self.box_similarity when lands
match_quality_matrix = box_ops.box_iou(gt_boxes, anchors_per_image)
# 计算每个anchors与gt匹配iou最大的索引(如果iou<0.3索引置为-1,0.3<iou<0.7索引为-2)
matched_idxs = self.proposal_matcher(match_quality_matrix)
# get the targets corresponding GT for each proposal
# NB: need to clamp the indices because we can have a single
# GT in the image, and matched_idxs can be -2, which goes
# out of bounds
# 这里使用clamp设置下限0是为了方便取每个anchors对应的gt_boxes信息
# 负样本和舍弃的样本都是负值,所以为了防止越界直接置为0
# 因为后面是通过labels_per_image变量来记录正样本位置的,
# 所以负样本和舍弃的样本对应的gt_boxes信息并没有什么意义,
# 反正计算目标边界框回归损失时只会用到正样本。
matched_gt_boxes_per_image = gt_boxes[matched_idxs.clamp(min=0)]
# 记录所有anchors匹配后的标签(正样本处标记为1,负样本处标记为0,丢弃样本处标记为-2)
labels_per_image = matched_idxs >= 0
labels_per_image =
# background (negative examples)
bg_indices = matched_idxs == self.proposal_matcher.BELOW_LOW_THRESHOLD # -1
labels_per_image[bg_indices] = 0.0
# discard indices that are between thresholds
inds_to_discard = matched_idxs == self.proposal_matcher.BETWEEN_THRESHOLDS # -2
labels_per_image[inds_to_discard] = -1.0
How is the loss computed for RPN?
计算RPN损失,包括类别损失(前景与背景),bbox regression损失
objectness (Tensor):预测的前景概率
pred_bbox_deltas (Tensor):预测的bbox regression
labels (List[Tensor]):真实的标签 1, 0, -1(batch中每一张图片的labels对应List的一个元素中)
regression_targets (List[Tensor]):真实的bbox regression
objectness_loss (Tensor) : 类别损失
box_loss (Tensor):边界框回归损失
# 按照给定的batch_size_per_image, positive_fraction选择正负样本
sampled_pos_inds, sampled_neg_inds = self.fg_bg_sampler(labels)
# 将一个batch中的所有正负样本List(Tensor)分别拼接在一起,并获取非零位置的索引
# sampled_pos_inds = torch.nonzero(, dim=0)).squeeze(1)
sampled_pos_inds = torch.where(, dim=0))[0]
# sampled_neg_inds = torch.nonzero(, dim=0)).squeeze(1)
sampled_neg_inds = torch.where(, dim=0))[0]
# 将所有正负样本索引拼接在一起
sampled_inds =[sampled_pos_inds, sampled_neg_inds], dim=0)
objectness = objectness.flatten()
labels =, dim=0)
regression_targets =, dim=0)
# 计算边界框回归损失
box_loss = det_utils.smooth_l1_loss(
beta=1 / 9,
) / (sampled_inds.numel())
# 计算目标预测概率损失
objectness_loss = F.binary_cross_entropy_with_logits(
objectness[sampled_inds], labels[sampled_inds]
ROI Heads
ROI head has certain overlapping functions with RPN heads, as two-stage indicates. Instead of generating offsets for the anchor templates, the ROI head refines the proposal from the last stage by predicting class labels and offsets. The supervision signals come from matching targets to proposals.
Assign Targets to Proposals
Prepare supervisions for roi heads ?
matched_idxs = []
labels = []
# 遍历每张图像的proposals, gt_boxes, gt_labels信息
for proposals_in_image, gt_boxes_in_image, gt_labels_in_image in zip(proposals, gt_boxes, gt_labels):
if gt_boxes_in_image.numel() == 0: # 该张图像中没有gt框,为背景
# background image
device = proposals_in_image.device
clamped_matched_idxs_in_image = torch.zeros(
(proposals_in_image.shape[0],), dtype=torch.int64, device=device
labels_in_image = torch.zeros(
(proposals_in_image.shape[0],), dtype=torch.int64, device=device
# set to self.box_similarity when lands
# 计算proposal与每个gt_box的iou重合度
match_quality_matrix = box_ops.box_iou(gt_boxes_in_image, proposals_in_image)
# 计算proposal与每个gt_box匹配的iou最大值,并记录索引,
# iou < low_threshold索引值为 -1, low_threshold <= iou < high_threshold索引值为 -2
matched_idxs_in_image = self.proposal_matcher(match_quality_matrix)
# 限制最小值,防止匹配标签时出现越界的情况
# 注意-1, -2对应的gt索引会调整到0,获取的标签类别为第0个gt的类别(实际上并不是),后续会进一步处理
clamped_matched_idxs_in_image = matched_idxs_in_image.clamp(min=0)
# 获取proposal匹配到的gt对应标签
labels_in_image = gt_labels_in_image[clamped_matched_idxs_in_image]
labels_in_image =
# label background (below the low threshold)
# 将gt索引为-1的类别设置为0,即背景,负样本
bg_inds = matched_idxs_in_image == self.proposal_matcher.BELOW_LOW_THRESHOLD # -1
labels_in_image[bg_inds] = 0
# label ignore proposals (between low and high threshold)
# 将gt索引为-2的类别设置为-1, 即废弃样本
ignore_inds = matched_idxs_in_image == self.proposal_matcher.BETWEEN_THRESHOLDS # -2
labels_in_image[ignore_inds] = -1 # -1 is ignored by sampler
