Abstract

Convolutional neural network (CNN) model based on deep learning has excellent performance for target detection. However, the detection effect is poor when the object is circular or tubular because most of the existing object detection methods are based on the traditional rectangular box to detect and recognize objects. To solve the problem, we propose the circular representation structure and RepVGG module on the basis of CenterNet and expand the network prediction structure, thus proposing a high-precision and high-efficiency lightweight circular object detection method RebarDet. Specifically, circular tubular type objects will be optimized by replacing the traditional rectangular box with a circular box. Second, we improve the resolution of the network feature map and the upper limit of the number of objects detected in a single detect to achieve the expansion of the network prediction structure, optimized for the dense phenomenon that often occurs in circular tubular objects. Finally, the multibranch topology of RepVGG is introduced to sum the feature information extracted by different convolution modules, which improves the ability of the convolution module to extract information. We conducted extensive experiments on rebar datasets and used AB-Score as a new evaluation method to evaluate RebarDet. The experimental results show that RebarDet can achieve a detection accuracy of up to 0.8114 and a model inference speed of 6.9 fps while maintaining a moderate amount of parameters, which is superior to other mainstream object detection models and verifies the effectiveness of our proposed method. At the same time, RebarDet’s high precision detection of round tubular objects facilitates enterprise intelligent manufacturing processes.

1. Introduction

The detection and recognition of circular tubular objects are always a basic problem and difficult problem in image processing. Circular tubular objects are common in the production practice of traditional manufacturing, so the intelligent detection of circular tubular objects is also an important part of the process of enterprise intelligence. Many traditional feature-based images processing methods had been proposed for detection of circular tubular objects, such as Genetic algorithms [1, 2], Gradient pair vectors [3], and Hough transform filters [46]. Such methods have low detection accuracy and strongly rely on “hand-crafted” features from feature engineering, which have great limitations.

In recent years, data-driven deep learning technology has shown superior performance in the field of artificial intelligence, and convolutional neural network is one of the most popular deep learning structures [7, 8]. In the field of computer vision, detection methods based on convolutional neural networks have made great breakthroughs in many fields, such as image classification [9], target detection [10, 11], semantic segmentation [12], instance segmentation [13], and gesture recognition [14]. In 2012, Krizhevsky et al. constructed an 8-layer convolutional neural network AlexNet [15]. AlexNet used ReLU as the activation function of CNN for the first time, successfully solving the problem of gradient diffusion of Sigmoid when the network was deep and applied Dropout in the network to randomly ignore some neurons to avoid model overfitting. Since then, more and more neural network models have been proposed. The VGG (Visual Geometry Group) [16] network designed by Simonyan and Zisserman mainly improved network performance by increasing the depth of the network. The convolutional layer of the network uses a small-size convolution kernel. Compared with a large-size convolutional layer, a small-size convolutional layer has fewer parameters and can increase the nonlinearity of the mapping function. With the deepening of network layer, the VGG network reaches a performance bottleneck at the 16th layer and then tends to be saturated. Increasing the depth of the neural network like VGG can improve the performance of the network to a certain extent, but this approach has two bottlenecks. On the one hand, the deeper the network structure needs to learn more parameters, making the network easy to overfit. On the other hand, the network with more layers requires more computing resources. The GoogLeNet [17] network model developed and designed by Szegedy et al. used a novel inception structure as the basic module for cascading, and the network reaches a depth of 22 layers. Inception uses 3 convolution kernels of different sizes to extract feature information of different scales from the previous input layer. The convolution kernel is used to reduce the data dimension of the previous layer and the amount of convolution calculations for the subsequent and convolutional layers to greatly reduce network parameters while increasing the depth of the network and make full use of computing resources to improve the computational efficiency of the algorithm. Although methods such as ReLU and batch normalization can solve the gradient disappearance or explosion of deep neural networks to a certain extent, the problem of gradient disappearance or explosion is still very serious when training a very deep network. In 2015, He et al. proposed a 152-layer residual networks (ResNet) [18], which added shortcut connections when constructing the network, so that the output of subsequent layers was not the input mapping in the traditional neural network, but the input mapping and the superposition of the input, which solves the problem of the disappearance or explosion of the gradient of the deep neural network training. This type of detection method based on deep learning technology has high accuracy, strong model robustness, and strong transferability, but it also has the disadvantage of slow inference speed due to the large model. At the same time, most of these detection methods are based on the traditional rectangular box to detect and recognize objects, so they are not optimized for circular or tubular objects, resulting in poor recognition and detection of such objects.

The detection and recognition of circular tubular objects are a basic problem in the field of image processing. Zelniker et al. [19] used a maximum likelihood estimation method based on convolutional neural network to better estimate the circular parameters of the center and radius of the circle in digital images as a method of circular object detection. Ayala-Ramirez et al. [20] propose a circle detection method based on genetic algorithm (GA), which can detect subpixel circles in composite images, but this method has poor performance in dealing with small circle targets. Yang et al. propose a circular object detection algorithm based on convolutional neural network in CircleNet [21], which is used to identify and detect the glomeruli of spherical biomedical objects. CircleNet improves the network’s detection accuracy of circular tubular objects to a certain extent by adding a circular detection head to the network, but this model also has the problems of high model complexity and low detection efficiency. Because circular tubular objects often appear in a large number of clusters, another difficulty in detecting and identifying circular tubular objects is dense scene detection. Recognition and detection of objects in dense scenes are a difficult problem in the field of object detection. There are roughly two reasons. One is that highly overlapping instances are likely to have very similar features, making it difficult for the detector to generate distinguishable prediction results for each instance; the other is that there is a serious overlap between instances, and the prediction results may be incorrectly suppressed by NMS. For these reasons, mainstream algorithms in the object detection field, such as R-CNN [2224], YOLO [2527], and SSD [28], have poor detection effects in dense scenes.

In order to solve the above problems, based on the network architecture of object detection algorithm CenterNet [29], combining with the characteristics of circular tubular type image and the particularity of circular tubular object detection, we design a high-precision and high-efficiency lightweight circular object detection method RebarDet. First, a circular representation structure is introduced on the basis of the CenterNet network architecture, and a circular box representation is used to replace the traditional rectangular box representation. The circular box representation is optimized for the spherical shape of the circular tubular target and has better rotation consistency than the rectangular box representation. At the same time, the parameter amount of detection representation is reduced from 4 to 3 to improve the detection effect of the network while reducing the model parameters. Then, we expand the prediction structure of the CenterNet network. The specific method is to increase the resolution of the network feature map and the upper limit of the number of single detection objects in the network and optimize for the dense phenomenon of circular tubular object detection. In addition, by combining the advantages of RepVGG [30] block in information fusion and information extraction, we introduce RepVGG block into the CenterNet network structure, which reduces the amount of model calculation and improves the feature extraction capability of the network. Combining the above improvements, we obtain our work RebarDet, a high-precision and high-efficiency lightweight circular object detection method. Based on the rebar dataset provided by Glodon, extensive experiments have been done to verify the performance of RebarDet in the rebar detection environment. The experimental results show that our work has brought about a significant improvement in detection performance. We also compare with the current mainstream target detection network on the three performance indicators of AB-Score, inference speed, and parameter quantity. The experimental results show that the accuracy and inference speed of our proposed RebarDet exceed the existing mainstream object detectors when the parameters are moderate.

In general, the main contributions of this work are as follows: (1)The circular representation structure is introduced on the basis of the CenterNet network structure, and the spherical shape of the circular tubular object is optimized to improve the detection effect of the network while reducing the model parameters(2)Improve the resolution of the network feature map and the upper limit of the number of single detection objects in the network, expand the prediction structure of the CenterNet network, and optimize the dense phenomenon of circular tubular object detection(3)By combining the advantages of RepVGG block in information fusion and information extraction, the RepVGG block is introduced into the CenterNet network structure, which reduces the amount of model calculation and improves the feature extraction ability of the network(4)The proposed work RebarDet under the condition of moderate parameter quantity, the accuracy and inference speed are both higher than the existing mainstream object detectors, and it reaches 0.8114 AB-Score and 6.9 fps on the rebar dataset provided by Glodon, significantly better than mainstream object detection models

2. Methods

This work follows the general process of key point detection. Suppose the input image is , where and are the width and height of the image, respectively. In the prediction phase, the network generates a heat map of key points as , where is the step size corresponding to the original image, and corresponds to the number of detection points in the object detection. For example, in the COCO [31] object detection task, the value of is 80, representing 80 categories. In this way, represents a predicted value of a detected object, indicating that for category , an object of this category is detected at the current coordinate , and indicates that there is no object of category at the current coordinate. For a certain category in each label map, calculate the real key point in it for training, the calculation method of the center point is and the coordinate after downsampling is set to , where is the downsampling factor 4, so the final calculated center point is the center point corresponding to the low resolution. Next, use to mark the image, and use a Gaussian kernel: to distribute the key points on the feature map in the form of markers in the downsampled image, where is a standard deviation related to the object size and . In the whole process, if two Gaussian distributions of a certain class overlap, the one with the larger element is directly selected.

2.1. Overview

Figure 1 shows the overall network structure of RebarDet. We introduce the circular representation structure and RepVGG module into the network and expand the network prediction structure. Compared with the usual object detection backbone networks such as VGG and ResNet, the Hourglass [32] is better used as the feature extraction network of the model. Because compared to other backbone networks, Hourglass is more conducive to key point detection. The Hourglass network structure includes convolutional layers, deconvolutional layers, fully connected layers, etc., repeatedly using top-down and bottom-up methods, and continuously encoding and decoding to infer the location of detection points. The overall network structure is stacked with submodules and subnetworks. It is constructed in such a way that Hourglass has a high degree of flexibility while having a complex structure and has an excellent performance in describing complex features. The repeated encoding and decoding operations of the network make the network has stronger presentation ability and can better mix global and local information. Compared with other object detection networks, the advantage of using the Hourglass network is that the feature points of the object may appear in different layers of the network, and the final feature map of Hourglass can better detect all the key points of the object. We introduced the RepVGG module at the beginning of the network. After the input image enters the network, it first passes through two RepVGG modules for feature extraction. The multibranch topology of RepVGG will sum up the information extracted by different convolution modules, improve the ability of the shallow network to extract image features, reduce the loss of image details, and be more conducive to the subsequent deep network to extract high-level semantic information of the image. Then, the image passes through the residual module of the network and is downsampled, where RebarDet reduces the original two residual modules to one residual module, and the number of downsampling is also reduced from 2 to 1, which increases the resolution of subsequent feature maps and is more conducive to the detection of dense objects by the network. Then, the image enters the Hourglass backbone network for feature extraction, and the extracted key point feature information is input to the final circular detection head. As can be seen from Figure 1, the network has three detection heads: heat map head, local offset head, and circle radius head, which, respectively, detect the object category, center point coordinates, and radius to complete the recognition and detection of circular objects.

2.2. Circle Representation

Existing detectors in the field of object detection, such as Faster R-CNN, YOLO, SSD, etc., generally use rectangular boxes parallel to the horizontal axis to represent the object of network prediction and recognition. Comprehensive considerations are given to the best fitting effect of rectangular boxes for most objects. However, the fitting effect of rectangular boxes is poor in some specific scenes, such as the recognition task of circular tubular objects. As shown in Figure 2(a), the detection effect of using a rectangular box to fit the rebar object is not very good, and the number of missed detections is 5. The reason for the poor performance of rectangular box in circular tubular object recognition task may be that the cross-section of circular tubular object is round, and the distribution of circular tubular object is mostly dense, so it is difficult to fit rectangular box. RebarDet introduces a circular representation structure, using a circular box instead of a traditional rectangular box, to optimize for the spherical shape of a circular tubular object. Specifically, we build a circular detection head to enable the convolutional neural network to predict the regression of the center point (px, py) and the radius of the circular box and finally obtain the circular representation of the recognition object. As shown in Figure 1, the network uses heat map head (), local offset head (), and circle radius head () to complete the identification and detection of circular objects. Figure 2(b) is the detection effect of the rebar object after applying the circular box. It can be seen that the detection effect of the circular box on the rebar object is better than that of the rectangular box, which can achieve the industrial accuracy of zero missed detection. Figure 3 is the experimental comparison of the rotation consistency of the rectangular box and the circular box. After the original image is rotated by 90 degrees for the rectangular box, the number of missed objects increases significantly, while the circular box only has 2 missed objects after the original image is rotated. The above experimental results show that the circular box has better rotation consistency than the rectangular box.

2.3. Expansion of Prediction Structure

The CenterNet proposed by Zhou et al. have a shortcoming in actual training, that is, if the center points of multiple objects of the same category in the image overlap when the network is sampled, CenterNet is also powerless for this situation, because there is only one center point, so these two objects can only be trained as one object. For the detection of circular tubular objects, the distribution of circular tubular objects is dense and compact in most cases, so the overlapping phenomenon of object centers is more serious. To alleviate this phenomenon, we expand the prediction structure of the network in RebarDet. Specifically, we increase the resolution of the feature map before the Hourglass subnetwork, so that the network can detect the center point of the object at a larger feature map level, which can reduce the overlap of the object center point. As shown in Figure 4(a), the size of the feature map has been increased from the original to . Figure 4(b) is a simplified schematic diagram of the object detection effect after the feature map is enlarged. It can be seen that compared to the small feature map, the large feature map can divide the object to be detected more dispersed, improve the detection ability of the network in dense scenes, and reduce the risk of missed detection. In addition, we expand the upper limit of the number of objects that the network can detect at a time from 128 to 256, which makes the network perform better in dense multitarget recognition scenarios.

2.4. RepVGG Block

The architecture of RepVGG is very simple and effective, which is equivalent to adding identity and residual branches to the block of the VGG network. The entire network is only composed of convolution, BN layer, and ReLU modules. Figure 5 is a schematic diagram of the structure of RepVGG block. In the training phase, RepVGG has a multibranch topology in which convolution (additional BN layer) and BN layer are connected in parallel on both sides of convolution (additional BN layer). This structure sums up the feature information extracted by different convolution modules, which can improve the information extraction ability of a single convolution module. Considering that the shallow convolution of the network is responsible for extracting low-level semantic features, and the information richness of this part of the low-level semantic features directly determines the effectiveness of the high-level semantic features of the subsequent convolutional layer, so we set the first two layers of the backbone network as RepVGG block. The feature of multiple modules in parallel is utilized to improve the information extraction ability of the network’s shallow convolution and improve the feature extraction ability of the whole network while reducing the amount of model calculation [33]. Regarding the number of RepVGG blocks and the number of channels, we discuss in Section 3.3.3 and obtain experimental results.

3. Experiments

3.1. Rebar Dataset

To verify the effectiveness of RebarDet on circular tubular object detection, we evaluate RebarDet in the context of rebar detection. All experiments in this paper are based on the rebar dataset released by Glodon on the DataFountain platform in 2019, which is also the only open-source rebar dataset so far. The dataset contains a total of 250 cross-sectional images of rebars and provides rectangular box annotations for object detection task training. Among them, 24,442 rebar samples in 200 rebar images are used as training data, and 6,499 rebar samples in 50 rebar images are used as test data. Eventually, we formed a cohort with 200 training and 50 testing images.

3.2. Score Metrics

In the rebar quantity counting competition published by Glodon on the DataFountain platform, the competition uses -score as the scoring algorithm. However, -score is actually a measure of classification problems, and it is not necessarily suitable for counting problems such as rebar counting. In our research, it is found that the model with high -score is not effective in the actual rebar counting detection on the construction site. So in this paper, we use the scoring algorithm AB-score of a counting competition on iFLYTEK OPEN PLATFORM to evaluate RebarDet. AB-score consists of two parts, A and B. Part A is the mAP value with IoU equal to 0.5. For part B, as shown in Table 1, different scores are assigned to different counting results obtained from a single picture. If the predicted number is completely correct, the score is 1. If the predicted number differs by 1, the score is 0.5. If the difference is 2, the score is 0.1. Otherwise, the score is 0. The final score in part B is the mean of all the image scores. The final score of AB-score is as follows:

Figure 6 is a comparison chart of the detection effect of the trained circular tubular object detection model on the test set. The abscissa is the difference between the predicted result and the real result, and the ordinate is the number of predicted pictures. The -score and AB-score of the model in Figure 6(a) are 0.9872 and 0.7983, and the -score and AB-score of the model in Figure 6(b) are 0.9476 and 0.8114. It can be seen from the figure that the number of pictures that are completely predicted correctly by the latter is more than that of the former, and the error in a smaller range is also better than that of the former. This shows that the model with high -score is not as effective as the model with high AB-score in the actual test, indicating that AB-score is more suitable for the detection task of circular tubular objects.

3.3. Loss Function

The loss function of network training follows CenterNet and consists of three parts: , , and . The predicted heat map is optimized by pixel regression loss with focal loss: where and are the hyperparameters of the focal loss, and is the number of keypoints in the image . The radius of the circular box is optimized with : where is the true radius of each circular object . CenterNet performs an downsampling operation on the image in the network. When the feature map is remapped to the original image, it will cause an accuracy error. Therefore, an additional local offset: is used for each center point to compensate. All center points of class c share the same offset prediction, and this offset value is trained by loss:

Finally, the overall objective is we set and referring from CenterNet.

3.4. Ablation Study

To verify the effectiveness of our proposed RebarDet model on detection of circular tubular objects, we conduct ablation experiments on the rebar dataset provided by Glodon. In addition, we conduct comparative experiments on mainstream object detection models and compare the performance of various parameters of the models to prove the superiority of the RebarDet model. Specifically, we resize all training set images to resolution and keep the relevant hyperparameters of model training consistent to ensure a fair comparison. All models are evaluated using AB-score, see Section 3.2 for details of AB-score.

3.5. Circle Representation

In order to verify the effectiveness of the circular representation structure, we use the original CenterNet model and our improved CenterNet model with the circular representation structure for comparative experiments. The experimental results are shown in the fifth and sixth columns of Table 2. We obtained an AB-score of 0.6984 after adding the improved circular representation structure to the basic CenterNet model. Compared with the basic CenterNet model, adding the circular representation structure can bring an improvement of 0.2353 AB-score to the detector, which proves that by adding a circular representation structure to the network can improve the performance of the detector. We think there are several reasons why the circular representation structure can bring such a big improvement. First, in the detection of circular tubular objects, the fitting effect of the circular box is better than that of the traditional rectangular box. Second, the circular detection head is used to detect and represent the object, which reduces the amount of parameters for detection and representation. Third, the circular box has better rotation consistency than the traditional rectangular box.

3.6. Expansion of Prediction Structure

Max objects represent the upper limit of the number of objects detected by the network at a time. As shown in the third and fourth columns of Table 2, max objects and feature map resolution obtain AB-scores of 0.7568 and 0.7751, respectively. The entire expansion of prediction structure brings an improvement of 0.0767 AB-score to the model, which proves the effectiveness of our designed expansion of prediction structure. We believe that increasing the size of the feature map can divide the objects to be detected more scattered, which greatly improves the detection ability of the network in dense scenes. The increase of max objects improves the loadability of network detection of dense objects to a certain extent.

3.7. RepVGG Block

In order to verify the effectiveness of RepVGG Block, we discuss how many RepVGG blocks are applied in the network and how many channels are adopted and conduct corresponding comparative experiments and obtain the optimal number and number of channels for applying RepVGG block. The experimental results are as shown in Table 3. Applying a number of RepVGG blocks of 2 and a number of channels of 128, the model achieves the best accuracy. As shown in the second and third columns of Table 2, applying RepVGG block brings an improvement of 0.0363 AB-score to the model, which proves the effectiveness of the multibranch topology of RepVGG on the model. We think that the multibranch topology of RepVGG can improve the extraction ability of shallow convolutions at the beginning of the network, which parallels convolution (additional BN layer) and BN layer on both sides of convolution (additional BN layer) and BN layer to sum up the feature information extracted by different convolution modules, thus improving the information extraction ability of a single convolution module. The richness of low-level semantic information extracted by this part of the convolutional module has a great influence on the high-level semantic features of the subsequent convolutional layers. Therefore, improving the extraction ability of this part of the convolutional module can improve the performance of the entire network.

3.8. Network Performance

In order to prove the superiority of the model, we use three performance indicators of AB-score, inference speed, and parameter quantity to conduct comparative experiments on mainstream object detection models. The experimental results are shown in Table 4. In terms of detection accuracy, RebarDet obtains the highest AB-score, 0.8114. While obtaining the highest detection accuracy, RebarDet also maintains a high model inference speed of 6.9 fps, which is only slightly lower than the model inference speed of YOLOv5 of 7.1 fps. On the model parameters, the SSD model using VGG-16 as the backbone network has the least amount of parameters, but the AB-score and inference speed of the SSD model are not satisfactory. While the RebarDet model based on Hourglass-104 maintains a moderate amount of parameters, its detection accuracy is the highest and the model inference speed is also better than other mainstream object detection models.

4. Conclusion

In this paper, we introduce RebarDet, a high-accuracy and high-efficiency lightweight circular tubular object detection method. By introducing the circle representation structure and RepVGG module on the basis of CenterNet and expanding the network prediction structure, the circle detection performance of the network has been greatly improved. RebarDet is tested on rebar dataset for object detection. The experimental results show that RebarDet’s detection accuracy and model inference speed are better than existing mainstream object detectors with moderate parameters.

Data Availability

The datasets used and analysed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that there are no conflicts of interest associated with the manuscript.

Acknowledgments

Cong Lin and Zhoujian Chen are the co-first authors and contributed equally to this work. Qiong Chen and Wencai Du are the cocorrespondence authors. This work was supported by the National Natural Science Foundation of China under Grant 62072121 and Natural Science Foundation of Guangdong Province 2021A1515011847.