Research Article

Fast Vehicle and Pedestrian Detection Using Improved Mask R-CNN

Table 4

Experimental data.

Backbone_classFPN + resnet101_81FPN + resnet86_81FPN + resnet50_81FPN + resnet101_3FPN + resnet86_3FPN + resnet50_3

Epoch_steps160_1000160_1000160_1000160_1000160_1000160_1000
Total params641585845854962445088120637441705813521044673706
Trainable params640470965845349645028856636326825803908244614442
FLOPs1302058281188342939154260912937792411800638990714705
Memory_size257.6 M235.0 M180.9 M255.9 M233.3 M179.2 M
Train_time27.98 h21.77 h18.2523.02 h20.43 h18.73 h
Test_avg_time (M4,952)2.14 s2.014 s1.39 s2.10 s1.97 s1.36 s

The first row is the network structure and the number of identification categories. For example, FPN + resnet101_81 uses the Resnet-101 residual network to identify 81 types in the FPN. The second row indicates that all network structures are trained 160 ∗ 1000 = 160,000 times. Memory_size refers to the weight of memory after each network structure is trained. Train_time refers to the time taken for each network structure training. Total params and Trainable params represent the total memory parameters and training memory parameters, respectively, of the network structure. Floating point operations (FLOPs) indicate the number of floating-point operations for each network structure, that is, the amount of calculation. Test_avg_time (M4952) refers to the average time to test 4952 images for each network structure. Min_train_loss refers to the minimum value of the weight loss after 160,000 training for each network structure.