Research Article

Fast Vehicle and Pedestrian Detection Using Improved Mask R-CNN

Table 2

Feature extraction network data comparison.

Network structureFeatureAnchorsLateralTop-down

Featurized image pyramidC447kNoNo48.332.058.762.2
Featurized image pyramidC512kNoNo44.925.355.564.2
Bottom-up pyramid200kYesNo49.530.559.968.0
Top-down pyramid200kNoYes46.126.557.464.7
Finest level onlyP2200kYesYes51.335.159.767.6
FPN200kYesYes56.344.963.466.2

The first column is the feature extraction algorithm, where the second and third rows are the two-layer outputs of the algorithm, and the fourth row contains the single feature map and pyramidal feature hierarchy. The second column is the name of the output layer, where the “{}” symbol indicates the independent prediction at each layer. The evaluation standard uses average recall (AR). The number in the upper right corner of the AR indicates the number of anchors generated by each image. The letters “s,” “m,” and “l” in the bottom right corner denote the small goal, medium goal, and large goal, respectively.