Abstract

Recently, deformable convolution networks have shown the superior performance in object detection due to its ability to adapt to the geometric variations of object. These methods learn the offset fields under the supervision of localization and recognition. Nevertheless, the spatial support of these networks may be inexact because the offsets are learned implicitly via extra convolutional layer. In this work, we present curvature-driven deformable convolutional networks (C-DCNets) that adopt explicit geometric property of the preceding feature maps to enhance the deformability of convolution operation and make the networks easier to focus on pertinent image region. To be consistent with postprocessing technology of object detection, we multiply the class prediction probability by the similarity of predicted boxes and ground truth boxes as the final class prediction probability and substitute it into the binary cross entropy loss function. The obtained loss function correlates the bounding box regression and classification. Experimental results on PASCAL VOC and COCO data set show that C-DCNets-based YOLOv4 with the proposed loss function outperforms state-of-the-art algorithms.

1. Introduction

Attention mechanisms make a neural network pay more attention to relevant parts of the image than irrelevant parts. Therefore, they can model long-range dependencies. Spatial transformer module [1] is a dynamic mechanism, which can actively spatially transform an image (or a feature map) to enhance the representations produced by CNNs. “Squeeze-and-Excitation Networks” (SENet) [2] improve the network representation by explicitly modeling the interdependencies between the channels of network’s convolutional features. “Convolutional Block Attention Module” (CBAM) [3] applies channel attention modules and spatial attention modules sequentially so that each branch can learn “what to focus” and “where to focus” on the channel axis and spatial axis, respectively. “Selective Kernel Networks” (SKNets) [4] focus on the adaptive receptive field (RF) size of neurons by introducing the attention mechanisms. As a particular instantiation of spatial attention mechanisms [58], deformable convolutional networks can capture spatial transformation since it is utilized to exploit query content and relative position effectively. The current state-of-the-art methods for modeling geometric transformations are Deformable Convolutional Networks (DCNv1) [6], Deformable ConvNets v2 [7], and Point Set Representation for object detection (RepPoints) [8]. In DCNv1, there are two modules that aid CNNs in modeling geometric variations. One is deformable convolution, in which the grid sampling positions of standard convolution are shifted by 2D offsets learned via extra convolutional layer. The other is deformable RoIpooling, which adds 2D offsets to each bin position in the regular bin partition of previous RoI pooling [6]. The incorporation of these modules into a neural network gives it the ability to adjust its feature representation to object configuration, specifically by deforming network’s sampling and pooling patterns to fit the object’s structure. In DCNv2 [7], the learned offset fields and modulated amplitude control the sampling position together. However, their spatial support may exceed the region of interest because the offsets and modulation scalar are learned implicitly by additional convolutional layer. In RepPoints [8], the point distance loss and the object recognition loss are adopted to learn the object localization, as deformable convolutions are operated on an irregular-form grid points and its recognition feedback can guide training for the positioning of these points. Compared with DCNv1 and DCNv2, RepPoints have more constraints on classification module, but its offset fields are still learned implicitly by convolutional layer. In order to further improve the deformation ability of Deformable Convolutional Networks, we introduce the intrinsic geometric property of the input feature maps, and a curvature-driven deformable convolutional networks (C-DCNets) are proposed, which use the offset learning guided by curvature fields of the preceding feature maps to focus the network on pertinent image region. The proposed method produces leading results on PASCAL VOC and COCO data set for object detection.

The goal of object detection is to predict a set of bounding boxes and category labels for each object of interest. But there are many near-duplicate predictions because of the anchor sets and the heuristics that cast target boxes to anchors. Traditional object detection pipelines [9, 10] assign foreground/background scores of each class for multiscale sliding windows based on the features calculated in each window. And deep-learning based object detectors employ region proposals [11, 12] generated by convolutional neural networks to replace sliding windows. For deep-learning based one-stage detectors (e.g., SSD [13], YOLOv1YOLOv4 [1417]), they also use nonunique assignment rules between ground truth boxes and prediction boxes even if there are no region proposals. Hence, almost all state-of-the-art detectors [1219] need postprocessing. Besides, considering the imbalance of positive and negative samples, feature imbalance, target imbalance, and image scene imbalance in target detection, researchers [20, 21] propose some preprocessing methods similar to sample augmentation to obtain a balanced learning representation. Nonmaximum suppression (NMS) [912] is a postprocessing part of the object detection framework to avoid near-identical boxes. Its evaluation score is the product of Intersection over Union (IoU) and the class prediction probability, but these two loss functions are used in training to deal with box regression and classification separately. To be consistent with NMS, we multiply the class prediction probability by IoU of predicted boxes and targets as the final class prediction distribution and substitute it into the binary cross entropy loss function. The obtained loss function can achieve the best bipartite matching between the predicted boxes and ground truth boxes. The main contributions of this work are summarized as follows:(1)Curvature-driven deformable convolutional networks (C-DCNets) are proposed, which make the spatial support of the networks adapt much more to saliency region(2)A new loss function associated with bounding box regression and classification is proposed, in which the class prediction probability in the binary cross entropy loss function integrates the similarity of predicted boxes and targets(3)We evaluate a C-DCNets based detection frameworks with the proposed loss function on PASCAL VOC and COCO data set, against a very competitive Faster R-CNN [12], YOLOv4 [17], DETR [22], and deformable DETR [23] baseline

The rest of this paper is organized as follows: In Section 2, the related works of attention mechanisms and postprocessing techniques are reviewed. In Section 3, a curvature-driven deformable convolutional networks and a loss function associated with bounding box regression and classification are explained. In Section 4, the experimental results are given. Finally, Section 5 concludes this paper.

2.1. Attention Mechanisms

Attention mechanisms are first studied in natural language processing (NLP) [2428], where encoder-decoder attention modules are developed to facilitate neural machine translation. Certain key elements are given priority according to a given query to calculate the output for the query element. And then, self-attention modules are utilized for modeling intrasentence relations. When assigning the attention weight to a certain key for a given query, we need to consider the content of the query and the content of the key. The query content may be the features of a word in a sentence, and a key may be another word within the sentence. Besides, the relative position of the query and key should be considered. Shortly afterwards, attention mechanisms are becoming popular in computer vision [2932]. Some works have even applied the attention mechanisms to SAR image [3337]. In particular, [35] applies channel attention modules to ship classification in SAR images. Reference [36] combines traditional hand-crafted HOG features and CNN features to improve classification accuracy. A polarization fusion network with geometric feature embedding (PFGFE-Net) [37] is proposed for SAR ship classification. References [3032] successfully extend relation networks and attention modules to the image domain, and a long-range object-object and pixel-pixel relations are modeled. Reference [30] establishes the relationship between objects through interaction of their appearance feature and geometry. In [31], the response of a pixel is calculated as a weighted sum of all pixel features. Reference [32] proposes a learnable region feature extractor, and the previous region feature extraction modules are unified from the perspective of the pixel-object relations. A common problem with such methods is that the aggregation weights and the aggregation operation need to be calculated on the elements in a pairwise fashion, which brings a high cost as the amount of calculation is quadratic to the number of elements. Different with the huge amount of calculation [6, 3032]¸ [7, 8] can be perceived as a special attention mechanism where only a sparse set of elements have nonzero aggregation weights. According to Zhu’s [5] distinction of different attention factors based on how to obtain the attention weight for a key considering that a query is determined, Deformable Convolutional Networks [68] utilize an attention mechanism based on the query content and relative position term; they operate more effectively and efficiently on object detection and semantic segmentation. The attended elements are specified by the learnable offsets [6] and the computational overhead is just linear to the number of elements. Modulation scalar is further introduced in [7]. RepPoints [8] is an object detection method that simultaneously models fine-grained localization information and identifies local areas significant for object classification. RepPoints can learn a geometric representation of objects. However, its deformability also depends only on implicit learning through an additional convolutional layer, and its spatial support may exceed the region of interest. To strength the deformability of convolution operation under irregular sampling grid, we propose curvature-driven deformable convolutional networks (C-DCNets) based on explicit geometric property of the preceding feature maps, where the curvature fields are utilized to guide the offsets learning, and the proposed C-DCNets modules are learned under the supervision of loss function that correlates the position accuracy and the class prediction probability.

2.2. Postprocessing Technology of the Object Detector

Most deep learning-based detectors use postprocessings such as nonmaximal suppression (NMS) to avoid near-duplicates boxes. The original NMS does not consider the context information. Greedy NMS [38] performs from high confidence score to low confidence score. Soft NMS [39] solves the problem of confidence score degradation caused by object occlusion. The DIoU NMS [40] adds the information of the center point distance to the bounding box screening process on the basis of Soft NMS. Learnable NMS methods [41] and relation networks [30] explicitly model relations between different prediction boxes with attention. In Fast NMS [42], each instance can decide to keep or discard in parallel, but it removes slightly too many boxes.Some algorithms use a global inference schemes to model interactions between all predicted bounding boxes. For constant-size set prediction, [43] uses deep neural networks to predict a set of class-agnostic bounding boxes along with a single score for each box. Reference [44] uses recurrent neural networks. End-to-end object detection with transformer (DETR) [22] is the first combination of bipartite matching loss and transformers with parallel decoding. And it uses Hungarian algorithm [45] to find a bipartite matching between prediction boxes and ground truth boxes, which enforces permutation-invariance, and guarantees that each target box has a unique match. However, it suffers from computational complexity and low performance of small object detection. In [23], Deformable transformer (Deformable DETR) is proposed for end-to-end object detection, whose attention modules only concern a small set of key sampling points around a reference point. It combines the advantage of the sparse spatial sampling of deformable convolution and the relation modeling capability of transformers. For each query, multiscale deformable attention checks multiple sampling points from multiscale inputs. It has superior performance in small object detection without the help of FPN [46]. However, due to the lack of global inference mode, deformable DETR still uses traditional NMS to improve its performance. And the complexity of the deformable transformer is very high when the number of object queries is large. No matter which kind of NMS, its evaluation score is the product of class prediction probability and IoU.

3. Curvature-Driven Deformable Convolutional Networks for End-To-End Object Detection

3.1. Curvature-Driven Deformable Convolutional Networks

Geometric priors [4749] play an essential role in Bayesian theory. Gradient priors and curvature flow are widely used in image denoising, restoration, super resolution, and other fields. Gradient reflects the first derivative of the image, which is easy to be affected by noise, and curvature describes the degree of curvature of an image levelset, which reflects the change of the first derivative. Compared with the gradient information, curvature fields reflect the trend of sampling points towards the salient region of the image. In this paper, we apply the curvature of the preceding feature maps to the learnable offset to obtain the final offset; the larger the curvature, the larger the final displacement.

3.2. Curvature-Driven Deformable Convolution

The definition of a curvature of an image levelset is as follows:and the curvature vector iswhere

Given a convolutional kernel of sampling locations, let and denote the weight and prespecified offset for the k-th location, respectively. For example, and defines a convolutional kernel of dilation 1. Let and denote the features at location from the input feature maps and output feature maps , respectively. The curvature-driven deformable convolution can be expressed aswhere is the learnable offset for the k-th location and is the curvature vector at the k-th location. We use bilinear interpolation to compute :where enumerates all integral spatial locations in the feature maps , and is the two-dimension bilinear interpolation kernel:where . The curvature fields are generated from the input feature maps and is obtained via a convolution layer applied over the same feature maps. The convolution kernel is of the same spatial resolution and dilation as those of the current convolutional layer. The output offset fields have the same spatial resolution with the input feature maps, and the product of point-by-point multiplication of offset fields and curvature fields will be superimposed to the normal grid sampling positions in the standard convolution. The channel dimension corresponds to offsets . During training, both the convolutional kernels and the offsets are learned simultaneously. To learn offsets, the gradients are backpropagated through the bilinear operations equations (5) and (6). The added convolution layer and fully connection layer for offset learning are initialized with zero weights. Their learning rates are set to 0.1 times of the learning rate for the existing layers.

Figure 1 shows the deformable convolution. Figure 2 shows the sampling locations (93 = 729 red points in each image) in three levels of deformable filters for activation units (green points). The receptive field and the sampling locations in the standard convolution are fixed no matter how many levels convolution are stacked. The sampling locations in DCNv1 and DCNV2 (shown in the middle) are adaptively adjusted according to the scale and shape of the object. The normalized modulation amplitudes in DCNv2 are obtained by additional convolutional layers learning; therefore, the offsets and modulation scalar in dcnv2 are entangled with each other. Compared with DCNv1, there is no big difference in the change of sampling positions because that the modulation scalar acts on the whole convolution term. The sampling locations in our C-DCNets are shown at the bottom. It can be seen from the figure that the sampling locations in our C-DCNets are more concentrated in the salient region of the image.

3.3. Curvature-Driven Deformable RoI Pooling

Given the input feature map and a RoI of size and top-left corner , RoI pooling divides the RoI into (k is a free parameter) bins and outputs a feature map . For bin , we havewhere is the number of pixels in the bin. In curvature-driven deformable ROI pooling, offsets are added to the spatial binning positions.

Equation (8) is implemented by bilinear interpolation equations (5) and (6). Figure 3 shows the process of obtaining offsets. First, RoI pooling equation (7) generates the pooled feature maps. Second, curvature fields can be obtained from the input feature maps, and a fully connection layer generates the normalized offsets , which are then transformed to the offsets in (8) by element-wise product with the width and height, as . The effect of curvature-driven deformable RoI pooling is shown in Figure 4. The regular grid structure in the standard RoI pooling will no longer be maintained, and the deformation ability of the sampled grid will be enhanced.

3.4. Loss Function Associated with Bounding Box Regression and Classification

Traditional object detection pipelines employ nonmaximum suppression (NMS) for selecting the best prediction bounding box with the maximum score and remove spurious neighboring detection boxes. First, it sorts all detection boxes on the basis of their scores. The detection box M with the maximum score is selected and all other detection boxes with a significant overlap (using a predefined threshold) with M are suppressed. The evaluation score of NMS is the product of class prediction probability and IoU. But the training loss function in object detection networks uses the loss of position accuracy for bounding box regression and the loss of classification for recognition separately, which is not consistent with the evaluation score of NMS.A linear combination of the loss and the generalized IoU loss [50],is used in DETR, where are hyperparameters. These two losses are normalized by the number of objects inside the batch. The generalized IoU loss , where , means area. The union and intersection of box coordinates are used as shorthands for the boxes themselves. The areas of unions or intersections are computed by min/max of the linear functions of and , which makes the loss sufficiently well behaved for stochastic gradients. means the largest box containing and (the areas involving B are also computed based on min/max of linear functions of the box coordinates). Different with DETR, our problem is not N-to-N matching. Generally speaking, the number of prediction bounding boxes is much larger than the number of ground truth boxes. For YOLOv3/YOLOv4, the last several convolutional layers predict 3-d tensor where the input image is divided into grid and is the classes number.To correlate the bounding box regression and classification, we multiply the class prediction probability by of predicted boxes and ground truth boxes as the final class prediction distribution and substitute it into the binary cross entropy loss function:where denotes if object appears in cell and denotes that the jth bounding box predictor in cell i is responsible for that prediction. We replace the original binary cross entropy loss function in YOLOv4 with (10) and original with ; therefore the whole loss function in YOLOv4 isAnd,is the loss of confidence predictions for boxes. The maximum matching between the prediction bounding boxes and ground truth boxes can be obtained by the loss function associated class prediction probability and position accuracy because this loss function is directly representative of the core evaluation metric. And DCNets modules in detection network are also supervised by this loss function. Figure 5 shows the example results from COCO validation using YOLOv4 trained employing (left to right) original loss function and the proposed loss function (11).

4. Experimental Results

4.1. Ablation Study

We use PASCAL VOC 2007 [51] and COCO 2017 data set [52] and follow their original protocol. For PASCAL VOC 2007, training is performed on the union of VOC 2007 trainval and VOC 2012 trainval and evaluating on VOC 2007 test set. For COCO, our models are trained and evaluated on the 120k images of the COCO 2017 trainval and 20k images of the COCO 2017 test-dev set. For evaluation, we use , where AP50 means mean average precision (mAP) with IoU threshold of 0.50 and AP95 means mAPs with IoU threshold of 0.95 as performance measurement. We do not deliberately distinguish between small objects, middle objects, and large objects because Faster-RCNN and YOLOv3/YOLOv4 are superior to small object detection. And the latest transformer based target detector [53] shows excellent small target detection performance. ImageNet [54] pretrained ResNet-50 [55] is utilized as the backbone. For comparison with the same standard, we report results of Faster-RCNN/YOLOv4 with the backbone ResNet-50 even though YOLOv4 recommends CSPDarknet53 backbone. In training and inference, parameter setting and training strategy mainly follow DCNv1 [6] and DCNv2 [7] except the image resolution, iterations, and learning rates. The images are resized to have a shorter side of 600 pixels. A total of 30k and 50k iterations are performed on PASCAL VOC and COCO, respectively. The learning rates are . DCNv1 [6] and DCNv2 [7] show that the more regular convolutions are replaced, the better the final result is. According to DCNv2, employing deformable layers in the conv3conv5 stages achieves the best tradeoff between accuracy and efficiency for object detection on COCO. To construct different deformable convolutional networks, we replace the layers of convolution in the conv3conv5 stage in YOLOv4 and Faster R-CNN with deformable/modulated-deformable/curvature-driven-deformable conv layers. And aligned RoIpooling is replaced by deformable/modulated-deformable/curvature-driven-deformable RoIpooling. In our experiments, the models are trained with 2 Nvidia GTX 2080Ti GPUs.

The comparison results of DCNv1, DCNv2, and curvature-driven deformation modeling on PASCAL VOC data set are shown in Table 1, and the comparison results on COCO data set are shown in Table 2. On PASCAL VOC, the DCNv1 obtains an increase of in scores compared to the baseline, and DCNv2 module obtains further gains about on the basis of DCNv1. On COCO, the DCNv1 obtains an score of for Faster R-CNN and for YOLOv4 when the layers of convolution in the conv3 conv5 stage and the aligned RoIpooling layer are replaced by their deformable counterparts, which are higher than the baseline about , respectively. DCNv2 obtains further gains about in scores with a small increase in parameters addition and FLOPs. The accuracies of DCNv1 and DCNv2 are lower than that reported in [6, 7]; the main reason is that the model we trained is slightly worse. Compared with the significant improvements of PASCAL VOC by DCNv1 and DCNv2, the improvements on COCO are not very significant. The reason is that COCO is larger and more challenging, which makes it more difficult to learn the offsets and modulation scalar implicitly. As shown in the table, our C-DCNets get better results. On PASCAL VOC, our curvature-driven deformation model yields a on Faster R-CNN and for YOLOv4, which is and higher than that of the DCNv1 for Faster R-CNN and YOLOv4, respectively. On COCO, our curvature-driven deformation model yields a on Faster R-CNN and for YOLOv4, which is and higher than that of the DCNv1 for Faster R-CNN and YOLOv4, respectively. Note that the parameter quantity of our C-DCNets is the same as that of DCNv1 model, FLOPs are slightly increased with DCNv1 model, and the performance is better than that of DCNv2. The improvement of performance is mainly due to the stronger deformation ability of our model.

Extensive ablation studies in object detection are performed to validate the efficacy and efficiency of the combination of the C-DCNets and the proposed loss function (11). We apply YOLOv4 model with C-dconv@c3c5+Cdpool and replace YOLOv4’s original loss function with the proposed loss function (11). For DETR, we choose ResNet-50-based DETR model with 3 encoder, 3 decoder layers, and width 256 because of the limitation of our GPU configuration. DETR model is trained for 100 epochs on 2 Nvidia GTX 2080Ti GPUs, and batch size is set as 4. Other parameters mainly follow DETR [22]. For deformable DETR, we just run one-stage mechanism with single-scale inputs, the backbone is ResNet-50, and the number of object queries is set as 100. Other hyperparameter setting and training strategy mainly follow deformable DETR [23]. Our DETR has a lower performance than published results because the baseline DETR has 6 encoder, 6 decoder layers, and width 256 with long training schedule. Deformable DETR achieves the best performance with less training epochs compared with DETR. Even through deformable DETR greatly reduces the amount of computation, its complexity is much great than YOLOv4, YOLOv4 with C-DCNets, and YOLOv4 with C-DCNets and the proposed loss function (11). Actually, deformable transformer will degenerate to deformable convolution when multiscale attention is not applied, and K = 1. The comparison results of different end-to-end detectors are shown in Table 3. It can be seen from the table that YOLOv4 with C-DCNets and the proposed loss function achieves better results with low complexity.

5. Conclusion

In this paper, curvature-driven deformable convolution networks (C-DCNets) are proposed. The deformation ability of convolution operation with irregular grid is further enhanced. The final offset fields are not only driven by the task goal, but also guided by the curvature fields of the preceding feature maps, which deform networks sampling and pooling patterns to fit the object’s structure. To be consistent with the evaluation score of postprocessing of detection network, a new loss function associated with bounding box regression and classification is proposed, in which the class prediction probability in the binary cross entropy loss function integrates the similarity of the predicted boxes and targets. Experimental results on PASCAL VOC 2007 and COCO 2017 data sets show that C-DCNets based YOLOv4 with the proposed loss function outperforms state-of-the-art detectors without bells and whistles.

Data Availability

The experimental data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (No. 61 701 201).