Abstract

Focusing on DOTA, the multidirectional object dataset in aerial view of vehicles, CMDTD has been proposed. The reason why it is difficult for applying the general object detection algorithm in multidirectional object detection has been analyzed in this paper. Based on this, the detection principle of CMDTD including its backbone network and multidirectional multi-information detection end module has been studied. In addition, in view of the complexity of the scene faced by aerial view of vehicles, a unique data expansion method is proposed. At last, three datasets have been experimented using the CMDTD algorithm, proving that the cascaded multidirectional object detection algorithm with high effectiveness is superior to other methods.

1. Introduction

With the development of deep learning, rapid progress has been made in remote sensing or aerial image processing and analysis [15]. However, these methods cannot deal with multidirectional object detection. Different from the traditional object detection, which the detection frame is generally horizontal or vertical giant, the detection frame given by multidirectional object detection can be rectangular in any direction. Recently, a plurality of popular general object detection algorithms has been ineffective on vehicle object detection datasets with the best result reaching only 52.93%. Compared with the experimental result on the DOTA dataset [6], the rotation of the detection object boundary box, a large number of small vehicle objects in aerial view of images, and insufficient utilization of data information are main causes for the poor performance.

To improve the effect of multidirectional aerial view of the vehicle object detection dataset, the CMDTD (a cascaded multidirectional object detection algorithm) algorithm has been utilized with FasterRCNN as the benchmark method through using the cascade idea of Cascade RCNN for reference. Moreover, vehicle objects are classified in a coarse-to-fine manner for fitting boundary prediction via a multi-information cascade output end. Data augmentation training is performed on the classes with few samples in statistical data in order to address the problem of sample imbalance. By using the method, the effect of multidirectional object detection can be effectively improved. Hence, it has certain research value for implementing multidirectional vehicle object detection in complex scenes.

2. Research Models

2.1. ResNeXt Backbone Network

Compared with land vehicles, although there are few classes of aerial vehicles objects, big differences between the classes can be witnessed. In order to extract superior object features, ResNeXt [7], which is more powerful than ResNet, is selected in the backbone network of CMDTD with the adoption of the feature fusion method of FPN [8]. The submodule structures of ResNet and ResNeXt are shown in Figure 1. The submodule of ResNet is composed of three convolutional layers. Regarding the input feature map X, X will extract new features, i.e., F(X) upon three convolution operations. And then, residual blocks are output upon the superimposition of new and old features. The process can be expressed as

According to equation (1), the feature obtained by F(X) is the difference between Y and X, which is also called residual characteristic, and the submodule is called residual block. Based on ResNet, the residual was expanded using ResNeXt.

As shown in Figure 1, three convolutional layers are divided into 32 groups of convolutional combinations with consistent parameter sizes, and the sum is equal to the size of the three original convolutional layers. The feature map X is superimposed, or , after 32 groups of convolution features. On this basis, new features are superimposed with the old features for acquiring output features, which is shown as

Although parameters are not increased in the new combination, the complexity of feature transformation is increased, strengthening the expressive ability of network towards features.

2.2. Multi-Information Cascade Output

The output end of the general detector and the multidirection and multi-information detection output end proposed in this paper are demonstrated in Figure 2. Based on Figure 2(a), the output end of the general detector should undergo a class determination and border prediction after acquiring the region proposal in RPN (regional proposal network). Being consistent with the output end of the general detector in the first half part, the multi-information cascade output end proposed in this paper will regress to the horizontal outer border of the object. However, it only determines whether the object is foreground or background in the second stage of class determination. Moreover, the length and width information (length, width, and aspect ratio) of the object can be calculated upon obtaining the horizontal frame of the object. Next, RoI pooling performs a new object feature extraction based on the acquired horizontal frame. Finally, a fine classification of the object is performed in the second FCN (fully connected network) based on the length and width information of the object and extracted features. At the same time, the third FCN predicts the positions of four vertices of the object according to the extracted features, acquiring the quadrilateral bounding box of the object.

Compared with the detection of land vehicles, sizes of objects in aerial view of vehicles are diverse, causing that it becomes more difficult to regress the positions. Preciser object boundary can be obtained in the second stage in comparison to the object proposal region, contributing to the boundary positioning in the future. Regarding class determination, objects of different classes may have the same texture and color features yet in varied sizes. For example, small vehicles and large vehicles with similar color characteristics can be distinguished as per the size and aspect ratio of the object. Hence, introducing the aspect ratio information of the object during the fine classification of the object might improve the accuracy of classification. During implementation, the length, width, and aspect ratio calculated by the horizontal frame can be obtained at the detection end, which can be introduced to the FCN together with features upon reducing by 1000 times.

During training, the end is composed of four losses, including two position losses and two class losses. The position loss is the prediction losses of horizontal boundary and the vertex, while the class loss is the foreground classification loss and the fine classification loss. Among them, L1 smoothing loss and cross entropy loss are applied in the position loss and the classification loss, respectively.

2.3. Prediction of Vertex Information

The positioning object boundary at the detection end is presented in Figure 3. In this paper, the object box has been regressed for three times at the detection end. It is performed in chronological order, namely, object candidate box, object horizontal bounding box, and rotating rectangular box, which are corresponding to the dashed box in light blue, the dotted box in blue, and the rectangular box in green.

The dashed box in light blue is far away from the real boundary of the object. It is because the region proposal network regresses to the object boundary according to the anchor. If the anchor point position deviates from the object boundary, a poor regression effect will be achieved. In the experiment, the aspect ratio of anchor is [1 : 1, 1 : 2, 2 : 1, 1 : 3, 3 : 1] with the base size of 6 pixels, so as to adapt to dense small objects and long objects. The dashed box in blue is the horizontal bounding box of regression in the second stage. This regression is approaching to the real boundary of the object as the proposed position of the object is the horizontal rectangle that is externally connected to the regression object. The rectangular box in green is obtained by regressing the four vertices of the horizontal bounding box, and the predictive equation is shown as follows:where x, y, , and h are the center coordinates, length, and width of the horizontal predictive frame, respectively. And the fully connected network can obtain the first vertex of the object through regressing tx1 and ty1. During training, the point closest to the upper left corner of the horizontal bounding box of the object is taken as the first vertex, and the second, third, and fourth vertices are obtained in clockwise order. The real predicted amount can be calculated aswhere and are the real coordinates of the vertex; and are the real predictions. And calculation is conducted with the smoothing L1 loss during training. Other coordinate points are calculated in a similar manner.

2.4. Unbalanced Data Distribution and Data Augmentation

As can be observed from Figure 4, intensive vehicles and ships can be found due to mass of scenes such as ports and parking lots in aerial view of vehicles images. In the training set after cutting, ships and vehicles have high proportions. In contrast, other classes have extremely proportions.

Images of the 5 classes (including football field, athletic field, rugby field, baseball field, and roundabout) with the least proportions among data are expanded, so as to cope with the problem of unbalanced training data. According to the expanded process presented in Figure 5, horizontal flip, vertical flip, and simultaneously horizontal and vertical flip are performed on images. At this point, data of the class with few proportions have been expanded by 3 times. Since other classes can be witnessed in the augmented image, these classes will be augmented appropriately. And the augmented ratios of all classes are shown in Figure 6.

3. Experimental Results

3.1. Experimental Dataset and Evaluation Indicators

Verification has been performed on the vehicle-based dataset: DOTA [6] and HRSC2016 [9] in this paper. Moreover, CMDTD also performs verification on the multidirectional scene and ICDAR2015 [10]. As a large-scale aerial view of the vehicle image dataset, the DOTA dataset contains 1,411 training images, 937 test images, and 458 verification images. These image sizes are ranged from 800 × 800 to 4000 × 4000, including 15 classes and 188, 282 instance objects. The dataset provides horizontal rectangle labeling and vertex labeling with two detection tasks opened. The first task is a multidirectional object detection task, and the second task is a horizontal object detection task.

HRSC2016 is a dataset of aerial view of maritime vehicle images that are originated from 6 main ports. To be specific, the numbers of training set, verification set, and test set are 436, 181, and 444, respectively, and the image size is ranged from 300 × 300 to 500 × 900. The ICDAR2015 dataset is originated from a detection task of the ICDAR 2015 Robust Reading Competition. It is a task that collects images taken in reality. Specially, 1000 out of 1500 images are training images, while the remaining 500 images are test images with the size of 720 × 1280.

To compare with other methods, CMDTD adopts the calculation standard evaluation method of mAP in DOTA and HRSC2016 and evaluates with F-measure in ICDAR2015. The index is calculated by the recall rate and precision, shown as follows:where is generally set as 0.5.

3.2. Experimental Setup

The experimental environment configured for CMDTD is shown in Table 1.

Concerning the ICDAR2015 dataset, CMDTD is processed through cutting the image into 1088 × 1088 in a way similar to the DOTA dataset. Besides, the network input size is set to 1088 × 1088 during training, with training frequency and parameter details consistent with HRSC2016. The input size of the network is set to 720 × 1280 for testing, so that the image can be input in its original size for testing.

3.3. Result Comparison

The comparison results of five methods such as CMDTD and RRPN [11], ICN [12], and SCRDet [13] in the DOTA dataset task 1 (multidirectional detection) are displayed in Table 2. The mAP indexes of CMDTD are higher than that of other methods, reaching up to 72.81%. In addition, concerning small object detection, the first and the second positions in the detection accuracy of CMDTD belong to ships and cars with 85.5% and 69.51%, respectively.

The comparison results of 5 methods in CMDTD and RFCN [6], ICN [12], and SCRDet [13] in the DOTA dataset task 2 (horizontal detection) are shown in Table 3. The detection result of CMDTD in the horizontal direction is obtained by taking the smallest external rectangle from the result of the rotating rectangle. Of which, the method proposed in this paper ranks the second with mAP reaching 73.7%, in surpass of most of the other methods.

The comparison results of CMDTD in 6 methods including the ICDAR2015 dataset, CTPN [18], and RRPN [11] are shown in Table 4. These methods are detection methods based on one stage or two stage. The comprehensive score of the method proposed in this paper ranks the second, reaching 83.88%. An excellent performance can be observed.

The comparison results of CMDTD in 7 methods including the HRSC2016 dataset, RRD [22], and R2CNN [20] are demonstrated in Table 5. The mAP index of the method proposed in this paper ranks the top, reaching 89.68%.

3.4. The Influence of Different Modules on Model Results

Influences of different settings on the model results, including the influence of cascade, the influence of location information on classification [23], and the influence of data augmentation are investigated by CMDTD on DOTA task 1. The results are shown in Table 6. Experimental results are presented as follows:

For the influence of data augmentation, the mAP of the model trained by CMDTD on the dataset without data augmentation achieved only 72.48%. Furthermore, the detection effects of football field, rugby field, and baseball field are reduced by 5.8%, 3.33%, and 3.29%, respectively.

Detection results of CMDTD in two aerial views of image datasets can be found in Figures 7 and 8. It can be observed from the figure that there are a large number of objects in the DOTA dataset with significant differences in size and aspect ratios. The object of the HRSC2016 dataset is long rectangle. CMDTD can effectively capture objects in various directions [24], which also presents a satisfactory detection effect on small objects [25].

4. Conclusion

Firstly, an aerial view based on aerial view of vehicle image object data set is proposed to overcome the shortcomings of the General Object Detection Algorithm in aerial view. On this basis, CMDTD has been proposed. The reason why it is difficult for applying the general object detection algorithm in multidirectional object detection has been analyzed in this paper. Based on this, the detection principle of CMDTD, including its backbone network, multidirectional multi-information detection end module, and aerial view of vehicle data augmentation method has been studied. At last, three datasets have been experimented using the CMDTD algorithm, proving that the cascaded multidirectional object detection algorithm with high effectiveness is superior to other methods.

Data Availability

The data used to support the findings of this study are included within [1, 2].

Conflicts of Interest

The authors declare that they have no conflicts of interest.