Abstract

As for recognizing Zhuang minority pattern symbols, current recognition models often cause high computational overhead and low accuracy since Zhuang minority pattern symbols have large feature vectors and some complex features. In this paper, we present the efficient attention receptive field you only look once (Earf-YOLO), a new scheme to address those problems. Firstly, a global-local-transformer (GLocalT) structure is proposed, through which other control systems are introduced into the axial self-attention module, and global-local training strategies are also designed. The structure can use other control systems to compensate for the lost feature information along the height, width, and channel axes. The global-local training strategy can encode long-term dependencies between features and reduce local information loss, fully illustrating that the structure has high feature expression ability. Besides, strength receptive field block (SRFB) is suggested to use the dilated convolution to control the receptive field’s eccentricity and enrich the feature information of the receptive field during its training. With more branches, it can better extract multiscale features, enrich the feature space of the convolution block, and reparametrize multibranch during prediction to fuse them into the main branch, all of which contribute to the improvement of the model performance. Finally, some advanced training techniques are adopted to enhance the detection effect further. In the end, comparative experiments are conducted on the datasets of Zhuang pattern symbols and PASCAL VOC, whose results indicate that the AP and FPS of the suggested model reach their highest values, manifesting its high efficiency.

1. Introduction

Ethnic minorities integrate their religious culture and totem culture into the pattern symbols of clothing and architectural decoration, usually with profound connotations. They are the basis for classifying national images [1]. As the minority patterns often feature their exquisite colors and structural and artistic styles, they are significant in retrieving the origin, distribution, and development of ethnic groups. With globalization and modernization, ethnic pattern culture is disappearing gradually. How to inherit, protect, spread, and utilize the traditional ethnic culture of pattern symbols in China should be valued. Therefore, correctly and efficiently recognizing the symbols of minority patterns is vital in realizing digital protection and in inheriting ethnic culture.

Different from the modern symbols, the minority symbols have the following characteristics: (1) complex pattern structures; (2) bright colors; (3) rich accessories with different visual styles; and (4) rich connotations, especially texture details, often with rich ethnic characteristics. For example, the symbols of Zhuang patterns are bright in color, evident in color gradation, and different in components. Besides, different branches of the same nationality have different pattern symbols. Let us take Zhuang patterns as an example: different pattern symbols in different branches reflect their unique esthetics and styles.

Object detection, an essential branch of AI and pattern recognition, has been successfully applied to many areas, such as transportation [2, 3], rescue [4, 5], and the demand in these areas is still growing. Without object detection, symbol extraction will fail to obtain all features in images as they often contain many totems, patterns, and designs that serve as the basis for extracting and detecting features from minority patterns.

Recently, Huo et al. [6] classified ethnic costumes in natural settings into 11 representative ethnic pattern symbols, including Miao, Mongolian, and Korean, based on the component detection and the feature fusion of the costume pattern symbols. Sun et al. [7] classified the ethnic costumes by using Faster R-CNN to extract attribute features from symbols of costume patterns. However, the large feature vector of the pattern symbols extracted from ethnic costumes will increase data storage and computational overhead. Besides, the semantic gap between low-level features and high-level attributes presents the following difficulties: (1) the symbols of ethnic patterns have distinct colors, various styles, and distinctive texture patterns. How to divide the visual style of ethnic pattern symbols and bridge the semantic gap between high-level visual attributes and geometric features is critical to improving recognition accuracy. (2) Some small ethnic symbols with small coverage, low resolution, and inconspicuous features decrease the detection efficiency. (3) The current object detection models such as the YOLO series [811] often need high computational overhead.

The YOLO series [811] plays a vital role in object detection tasks in the single-stage detector. We propose an improved model, Earf-YOLO, based on YOLOv4 [11] to solve the above three problems. Earf-YOLO can extract global and local features and increase the model’s receptive field, improving the detection accuracy at a relatively fast detection speed. The overview of Earf-YOLO is shown in Figure 1.(1)A new transformer architecture is designed to better describe the feature information of the pattern symbols. It adopts a gating self-attention mechanism to better converge features from height, width, and channel axes. It divides the feature map into patches and inputs them and the original feature map into the transformer to learn long-distance dependencies between features and reduce local information loss between features.(2)To increase the receptive field of pattern symbol extraction, enhance the ability of complex pattern symbols, and reduce the computational overhead of pattern symbol recognition model, the strength receptive field block (SRFB) structure is designed to replace the redundant convolution layer in the feature pyramid of the model. It not only improves the ability of the convolutional neural network to extract deeper features but also reduces the computational overhead of the model, accelerating the model training and recognition speed.(3)Some advanced techniques, including the Soft-NMS [12], GIoU Loss [13], and Focal Loss [14], are integrated into Earf-YOLO, and their effects during training are verified. Experiment results demonstrate that these advanced techniques can improve detection performance.(4)The frames per second (FPS) and average precision (AP) of previous models and the proposed model are compared on the Zhuang pattern symbol datasets, as shown in Figure 2. The final result illustrates that Earf-YOLO achieves high performance in detecting pattern symbols.

2.1. Traditional Object Detection Model

The detection task of Zhuang pattern symbols is to extract the style element features of Zhuang pattern symbols through the model to realize the positioning and classification of Zhuang pattern symbols. In recent years, many researchers have researched the object recognition models. Ribeiro et al. [15] proposed an end-to-end dual neural network architecture to recognize expiration dates in snack packaging. They used neural networks to fuse global and local features to recognize features. In recognizing Zhuang pattern symbols, we should focus on the classification and shape of multiple pattern symbols. Symbol classification and object positioning are different in detecting, so a new detection network is needed. The network of our model shows that the object classification focuses on judging local features, and the object positioning focuses on judging the global feature region. Nguyen et al. [16] demonstrated an object frame generation method based on a deep convolutional neural network (DCNN), which trained an object positioning detector to learn deep feature information from the bounding candidate frame detected in the image. When recognizing pattern symbols, there are color overlaps between geometric features and background features, making the model unable to explore the deep feature information of the relevant graph primitives and background. Erhan et al. [17] focused on processing similar instance objects in an image and proposed a display-inspired neural network to detect objects of an unknown category. Although the pattern symbol image of Zhuang nationality contains multiple similar objects and is also detected for multiple objects, the detection accuracy of the model is low due to the complex background of the Zhuang patterns.

2.2. Two-Stage Object Detection Model

Because of the low accuracy of traditional object detection algorithms, Girshick et al. [18] proposed a two-stage detection model based on R-CNN. Firstly, R-CNN uses a selective search algorithm to extract 2000 candidate frames from the images to be detected. Then, R-CNN scales 2000 candidate frames into 227 × 227 and uses a convolutional neural network to extract features from candidate frames to obtain feature vectors. Finally, the model inputs feature vectors into the support vector machine and the fully connected network. The support vector machine can classify feature vectors to get category information, and the fully connected network performs regression operations on feature vectors to obtain corresponding coordinates. Although R-CNN is cleverly designed, the model detection is divided into multiple stages, resulting in a significant decrease in detection efficiency. Therefore, Girshick [19] proposed a Fast RCNN. It does not need to input all the candidate frames into the deep learning model. Instead, it only needs to select all candidate frames, input the selected candidate frames into the network for feature mapping, and obtain the prediction category and the position of the prediction frame. The model performs a selective search to improve detection speed but spends a lot of time selectively searching for candidate frames. To solve this problem, Faster RCNN [20] and Mask RCNN [21] added a region proposal network based on Fast RCNN. It extracts candidate frames by setting anchors of different scales and replaces the traditional candidate frame generation methods, such as the selective search method, which improves the computing speed of the network. With the development of deep learning, affected by the complexity of the primary network, the number of candidate frames, classification, and the complexity of the regression subnetwork among other factors, the above techniques require high computation overhead, which seriously influences the model prediction and training performance.

2.3. One-Stage Object Detection Model

As for the low efficiency of the two-stage object detection algorithm, YOLOv1 [8] removes the candidate frame extraction branch of the two-stage algorithm and directly implements feature extraction, candidate frame classification, and regression in the same deep convolutional network, making a single network complete classification determination and locate regression. YOLOv1 abandons the candidate frame stage and speeds up the detection speed. However, it is not accurate enough in locating objects and has a low recall rate, resulting in low detection accuracy. Farhadi et al. [9] proposed the YOLOv2 model to address this problem, mainly using a multiscale classifier and multiscale object frame position detector to improve the model accuracy. Although the accuracy of YOLOv2 improves a lot, its accuracy is still not ideal in subsequent industrial applications. YOLOv3 [10] designs a Darknet53 residual network and feature pyramid network by learning the residual network and the RPN of Faster RCNN to improve network depth and network space representational ability. Therefore, a large number of scholars have made relevant studies on YOLOv3. Based on the YOLOv3 model, Li et al. [22] conducted rapid detection of cracks in the fuselage or engine blade of an aircraft structure by depthwise separable convolution and feature pyramid. Shi et al. [4] optimized the YOLOv3 model by reducing its parameters, improving its detection speed for underwater objects, optimizing the residual network, and strengthening its feature extraction ability. Although the methods mentioned above based on YOLOv3 can identify large objects well, it is easy to neglect hard-detected and overlapping features. Bochkovskiy et al. [11] proposed the YOLOv4 model to solve those problems by applying advanced bag of freebies and bag of specials methods to achieve better detection results. However, the model is difficult to deploy on the platform with few resources because of its large number of network parameters and its large computation overhead. As for the massive overhead of neural networks that limited the model’s detection and inference on mobile devices, Zhou et al. [23] proposed the RSANET model, which introduced lightweight convolution (LCNet) and attentional pyramid networks with residual as the prediction head. Their experiments proved that the model could reduce computational overhead effectively. John and Mita [24] proposed a residual semantic-guided attention feature pyramid network, including input and output branches. The model used the input branch to extract the features of a single sensor and then used the residual connection to integrate the extracted features into the output perception branch. Although both models can perform well on certain experimental datasets, they have low detection accuracy, high detection error, and a high neglected detection rate for detecting small specific objects in Zhuang patterns. Based on the previous research, we present an improved Earf-YOLO model in this paper to optimize the YOLOv4 to address the above problems.

3. Methods

This section details the Earf-YOLO, the proposed Zhuang pattern symbol recognition model, including the introduction of its structure and its contributions.

3.1. Network Structure

Accuracy and computational overhead are essential indexes to determine the performance of an object detection model. YOLOv4, one of the classical models for detecting an object, requires a high computational overhead to ensure accuracy. Therefore, we focus on detecting symbols of Zhuang patterns accurately with minimal computational overhead. Based on the previous studies, we propose the Earf-YOLO model based on the YOLOv4, as shown in Figure 3, mainly composed of the BackBone, neck, and transformer predict. First of all, BackBone mainly uses the CSPDarkNet53 featured by introducing the CSPNet structure [25] to reduce the computational overhead, eliminate the redundant gradient information when the network is optimized reversely, enhance the convolutional network’s learning ability, and ensure accuracy while making the network lightweight. Secondly, Neck network adopts the structure of strength receptive field block (SRFB), global-local-transformer (GLocalT), and path aggregation network (PANet) [26]. The SRFB structure can effectively improve the receptive field of the network and extract important context features. GLocalT can extract the local and global features of Zhuang pattern symbols. PANet is the improved version of feature pyramid network (FPN) [27], to which a bottom-up path augmentation structure is added to avoid losing shallow feature information during transmission, improving prediction accuracy. Finally, transformer predict is used for regression and classification. Unlike YOLOv4, the Earf-YOLO uses the global transformer to predict three feature maps of different sizes to detect small, medium, and large objects. The size of the prior frame is obtained by clustering the sample objects through the k-means algorithm, based on which the size and position of the prediction frame can be calculated by relative offset.

3.2. Global-Local-Transformer

With the wide application of transformer [28] in natural language, transformer [29] is also used for computer vision tasks. Recent studies show that transformer-based models can achieve good detection results only by training datasets with rich features. It is challenging for models to learn the position-coding of the image if traditional transformer architecture is used because there are various pattern symbols and small textures in datasets of Zhuang pattern symbols. Therefore, according to the gating attention mechanism of medical segmentation proposed by Valanarasu et al. [30], we designed a global-local-transformer (GLocalT) structure to detect symbols of Zhuang patterns. It inputs global and local features into the transformer for extracting and fusing features, respectively. In addition, it adopts a gating axial attention layer to serve as the basic structure block of the transformer. The architecture of GLocalT is shown in Figure 4(a).

Gating axial attention layer. In traditional transformers, is often used to calculate the global affinity, and the value matrix is aggregated together, where , , and represent query matrix, key-value matrix, and value matrix, respectively. They all calculate the mapping matrix by inputting . The mapping matrix can be obtained through the model’s learning. This approach enables the model to capture nonlocal information from the global feature mapping. However, this calculation requires a large amount of computational overhead, and its computation overhead will increase when the feature map size increases. The self-attention layer of this method is not conducive to extracting any position feature information when calculating nonlocal context feature information. The positional feature information is crucial in recognizing symbols of Zhuang patterns and is usually used to locate objects. Researchers [3032] decompose the self-attention module into two self-attention modules to make the computional affinity less complex. The first module performs self-attention on the height axis, and the second module performs self-attention on the width axis. Adding axial attention to the height and width axis can effectively simulate the original self-attention mechanism and better calculate efficiency. In addition, to make the self-attention mechanism more sensitive to position information when calculating affinity, they attached position bias items and gating mechanism to all , , and , thus enabling the model to capture remote interactions with precise positional information. Therefore, based on the above discussions, for any given input feature mapping with height H, width W, and channel , a gating attention mechanism with position encoding of height, width, and channel is used to improve the model’s ability to compute the affinity between features. To promote the model’s computational efficiency, we simultaneously adopt three parallel gating attention mechanisms. The gating attention mechanism is computed as shown in Figure 4(b), and the computation can be expressed by the following equations:where represents the additional position codes for all , , and , respectively. represents the weight calculated by the gating mechanism and they are learnable parameters. The information amount of key, query, and value can be controlled by controlling the position embedding code of the feature map. Generally, the value of will be high if the model accurately learns relative position coding. Finally, the feature maps obtained from the width, height and channel axes are added and performed convolution operations on the gating axial transformer, as shown in Figure 4(c). The calculation method is

Global-local training. The image is divided into multiple patches for training, making the transformer accelerate its convergence speed and helping the model extract finer texture details. However, as for recognizing Zhuang pattern symbols, when the image is divided into multiple patches to train the model, the object frame may be larger than the image block, limiting the information dependence between pixels. In order to improve the overall understanding of image features, we input feature maps into the global branch and local branch to process feature maps, as shown in Figure 4(a). Global branches can use transformer to learn long-distance dependencies between features, while local branches can use transformer to make up for local lost detail features caused by patches. In the global branch, we input the whole feature map into GLocalT to simulate the dependence of long-distance correlation and extract the global feature of the pattern symbol. In the local branch, the feature map is divided into 16 patches with the size of . refers to the size of the original image. Then the 16 patches are input into GLocalT, which focuses on the local features’ finer details. The feature maps output by the two branches are finally added and passed through the 1 × 1 convolution layer to produce the final output.

3.3. Strength Receptive Field Block

The research of Ren. et al. [33] and Chen et al. [34] proved that expanding the receptive field of the model could improve accuracy. He et al. [35] proposed spatial pyramid pooling (SPP) block, which used max pooling of multiple parallel k × k convolution kernels to obtain receptive fields and extract feature information. The SPP structure can increase the receptive field of the model and get multiscale feature information of Zhuang pattern symbols. However, it fails to consider the effect of the eccentricity of the model’s receptive field in the recognition process of Zhuang pattern symbols, which makes the effect of each pattern symbol image pixel in the perception field of the model is the same, and the vital information in the receptive field is not emphasized. It also makes the model increase inference time when the model conducts prediction. Based on the above discussion, we put forward the strength receptive field block (SRFB) structure, which not only adopts multiple convolutional kernels of different sizes to carry out multibranch pooling. In the branches, the SRFB structure uses the convolutional layer’s void rate to control the receptive field’s eccentricity and transforms various matrices into a single convolution during prediction to optimize the network structure of YOLOv4 [11]. Compared with the SPP structure, the SRFB structure has more “microstructures” with rich feature information, increasing the receptive field of the model’s feature extraction. Each feature extracted by convolution contains extensive feature information that reduces the computational overhead during prediction. The SRFB structure, as shown in Figure 5(a), uses parallel layers with kernel sizes of 3 × 3, 1 × 3, 1 × 1, and 3 × 1, each of which will be batch normalized.

During training, the SRFB structure uses parallel convolution layers with kernels of 3 × 3, 1 × 3, 1 × 1, and 3 × 1 to increase the receptive field of the structure, enhance the model’s feature aggregation ability, and deepen the network’s expression ability of the nonlinear layer. The SRFB involves batch normalization to reduce network overfitting and speed up training. The batch normalization formula is shown in (5).where is the input k-th channel feature map. represents the input k-th channel convolution kernel, and represents the mapping channel of output features corresponding to the j-th convolution kernel. denote learnable parameters, which can be obtained by the gradient descent algorithm.

The additivity of convolution proves that two-dimensional convolution kernels with different sizes operate at the same step to produce the same resolution, whose outputs can be added. The additivity of convolution can be considered the addition of the corresponding positions of the convolution kernels to produce an equivalent kernel with the same output, as shown in (6). During prediction, the SRFB uses the additivity of convolution to convert 3 × 3, 1 × 3, 1 × 1, and 3 × 1 convolution kernel into a new 3 × 3 convolution kernel to enrich the convolution feature information, as shown in Figure 6.where signifies the feature matrix, and represent two-dimensional convolution kernels with compatible sizes. represents the sum of the corresponding positions, and represents the two-dimensional convolution operator. Compatibility means that the smaller kernel can be patched to the larger kernel.

The homogeneity of convolution proves that batch normalization of the feature space of the neural network can be equivalently integrated into convolution during the prediction. According to the homogeneity of the convolution, a new kernel plus bias can be constructed on each branch, as shown in the following equations:

By adding the parallel convolution kernel to the asymmetric convolution kernel, the three normalized 3 × 3, 1 × 1, 1 × 3, and 3 × 1 convolutional branches are merged into the standard convolutional layer. This new structure can obtain rich feature information without the additional computational overhead. The result after the merging iswhere , , , and represents the output results of 3 × 3, 1 × 3, 3 × 1, and 1 × 1 convolutional layer, respectively.

However, the kernel of the SRFB structure can be equivalently converted when it implements inference, as shown in Figure 5(b). The kernel uses different calculations to obtain the gradient because the SRFB structures are randomly initialized during training, so they cannot be converted equivalently.

3.4. Bag of Freebies

Generally, strategies that only increases training cost but does not increase inference loss are called “bag of freebies” in the object detection field. Bag of freebies mainly optimizes the loss function to make the model better fit the data. An image may have thousands of objects in the object detection field, but only a small part needs to be detected. Compared with the two-stage detector, the one-stage detector does not use a region proposal network, which will result in imbalanced distribution of positive and negative samples during training, and the loss value of the object detection susceptible to the loss value of the negative samples. Lin et al. [14] proposed that focal loss could be obtained by modifying the cross-entropy loss function to reduce the negative sample influence. In this paper, focal loss optimized the classification loss function of YOLOv4 to decrease the background influence when recognizing Zhuang pattern symbols. Focal loss decides the total loss function by setting weights to the cross-entropy loss function, as shown in (10), thus solving the unbalanced distribution of positive and negative samples and easy and complex samples. Focal loss defines a weight factor and then takes it to the cross-entropy loss function to solve the imbalance of negative and positive samples. When the number of positive samples is small, the value of will be large, and the loss value of positive samples will increase. Focal loss suggests an adjustment factor to reduce the weight of easy samples and make the model focus on training complex samples to solve the imbalance between easy and complex samples.where represents the classification probability of predicted samples and indicates the label of positive and negative samples. If is 0, it is negative sample, and is 1, it is positive sample.

At present, many object detection models [36, 37] generally use L1 and L2 norms to calculate the loss value. The L1 and L2 norms independently calculate the loss value of the four coordinate variables of the prediction frame. The coordinate variables are irrelevant, but there is some correlation among the coordinate variables in real situations. When the model’s performance is evaluated, IoU is used to detect whether there is an object. If the norm regression of L1 and L2 is directly used to calculate the coordinate frame, the values of the evaluation indexes will also be affected. Yu et al. [38] proposed that IoU as a regression loss function could calculate the coordinate frame, which solved the above problems. However, if IoU is directly used as the boundary loss when the prediction frame and the ground-truth frame do not overlap, both IoU and the gradient will become 0, and the boundary loss cannot be optimized. Rezatofighi et al. [13] proposed the GIoU loss as a boundary loss. It retained the scale invariance of IoU as a loss function and added the distance between two frames to optimize the loss value, which solved the problem that the gradient was 0 because the prediction frame and the ground-truth frame did not overlap. The calculation of GIoU iswhere A and B are the prediction frame and the ground-truth frame, respectively, and C is the smallest closed frame containing both. When GIoU becomes larger, the GIoU loss will become smaller, and the network will be optimized to make the prediction frame and the ground-truth frame highly overlap. The boundary loss function of YOLOv4 optimized by GIoU is shown in the following formula:where and represent the width and height of the boundary frame, respectively. represents the probability of the object inside the current boundary frame.

In the Earf-YOLO model, the prediction results include the prediction category, confidence, and position of each prediction frame. Therefore, the model’s loss function in this paper iswhere indicate grids, indicates that each grid has a prediction frame. If the IoU of the j-th prediction frame and truth-ground frame in the i-th grid is greater than the threshold, then , otherwise . If the IoU of the j-th prediction frame and truth-ground frame in the i-th grid is less than the threshold, then , otherwise . If the IoU of the j-th prediction frame and truth-ground frame in the i-th grid is the greatest, then , otherwise . is the confidence score of the existing object in the -th prediction frame of the current -th grid. and represent the penalty weights of the loss function.

3.5. Bag of Specials

Postprocessing is a method to screen the prediction results of models, which belongs to the bag of specials. It can significantly improve the model’s prediction accuracy only by adding a small prediction overhead. The postprocessing method uses the NMS algorithm on the output result to delete the wrong prediction frame and find the most appropriate position for the prediction frame. The Hard-NMS algorithm sorts the prediction frames from high scores to low scores, selects the prediction frame with the highest scores, sets a threshold, deletes the prediction frames whose overlap rates with the highest-scored prediction frames exceed the threshold, and repeats the steps mentioned above with the left prediction frames until the last one. When the overlap rate of two objects in the image is larger than the fixed threshold, the Hard-NMS will set the score of the prediction frame as 0 and then delete it, which may lead to the low-scored objects not being detected and loss of accuracy.

The Soft-NMS [12] addresses the problem that the Hard-NMS mistakenly deletes the prediction frame when two objects overlap from a new perspective. As formula (14) indicates, the Soft-NMS does not delete low-scored prediction frames directly. It will lower their scores further and then set a threshold to delete low-scored prediction frames. The Soft-NMS will also use the Gaussian weight function (as shown in formula (15)) to multiply the scores of the current prediction frame with a weight function. This function will attenuate the scores of adjacent prediction frames that overlap the highest-scored prediction frame . The more overlapping the prediction frame is with the highest-scored one, the more serious the attenuation of the prediction frame will be.where is the score of the current prediction frame, is the threshold, and is the prediction frame with the highest scores. is the score of each prediction frame.

4. Experimental Results and Analysis

In this section, we introduce the experimental datasets and the parameters of the experimental settings, and then verify the performance of Earf-YOLO in experiments.

4.1. Zhuang Pattern Symbol Datasets

The datasets used in the experiment are symbols of Zhuang patterns. The Zhuang people have incorporated their wisdom and culture into Zhuang patterns, usually reflecting their yearning for a better life. For example, the delicate and beautiful flowers on Zhuang patterns are believed to represent natural beauty and colorful life; the birds on Zhuang patterns can arouse people’s longing for a happy life, as birds usually lead happy and free lives in the forest.

So far, there are no specific dataset composed of symbols of Zhuang patterns. The datasets used in this research are images taken by researchers in the Zhuang tribes. There are about 19,199 images of Zhuang patterns in the datasets, divided into 20 classifications. To ensure the justice of the model when it gets trained, we try to balance the number of images in each classification. We selected 10,592 images as training samples and 8,607 images as testing samples. The sample distribution is shown in Figure 7.

4.2. Experimental Settings

Experimental data are trained on Python 3.6, Keras 2.3.1, GTX 2070 8G, and Windows 10. The number of training iterations is 500. The image input size is fixed as 416 × 416. The optimizer is Adam. The attenuation strategy of the learning rate is the Cosine annealing attenuation strategy, whose initial learning rate is set as 0.001, the highest learning rate as 0.01, and the lowest learning rate as 0.0001. In the first 400 experiments, the first 170 convolution layers of the network are frozen; then, the remaining convolution layers are trained. In the last 100 experiments, all convolution layers are opened and trained. Average precision (AP), frames per second (FPS), and Param are taken as evaluation indexes to evaluate the model’s performance. AP stands for the average accuracy of IoU from 0.5 to 0.95, with the threshold increasing at intervals of 0.05. AP50 indicates the average accuracy when the IoU threshold is 0.5, and AP75 indicates the average accuracy when the IoU threshold is 0.75. APS, APM, and APL represent the average accuracy of small, medium, and large objects, respectively. The AP is proportional to the model detection effect. The larger the AP is, the better the detection effect will be. The larger the FPS is, the higher the detection efficiency will be. The smaller the Param is, the lower the network memory consumption will be.

4.3. Zhuang Nationality Pattern Symbols Contrast Experiment and Result Visualization

In this section, the suggested Earf-YOLO is evaluated on the datasets of Zhuang pattern symbols. In order to simplify the comparison results on the datasets of Zhuang pattern symbols, we compare the improved Earf-YOLO model with the latest one-stage and two-stage models on the network with ResNet101 and CSPDarkNet53 as backbones, respectively, as shown in Table 1. Table 1 indicates that the AP of Earf-YOLO on ResNet101 and CSPDarkNet53 reached 39.1% and 41.0%, respectively. Figure 1 and Table 1 show that the Earf-YOLO model achieves the best results in both speed and accuracy compared to other models.

In addition, in the testing set of Zhuang pattern symbols, the experiment was conducted between Earf-YOLO (with CSPDarkNet53 as the backbone) and the original YOLOv4 model to compare classification accuracy, whose results are shown in Figure 8 demonstrate that the average classification accuracy of Earf-YOLO is higher than that of YOLOv4. Besides, for some small and complex pattern symbols such as the Zhuang two lions pattern, the Zhuang copper coin pattern, and the Zhuang bird pattern, the average classification accuracy of Earf-YOLO remains high.

Meanwhile, some images are randomly selected from the datasets of Zhuang pattern symbols for visualization. This paper selects four pairs of representative detection results for comparison. Figure 9(a) shows the visualized result of YOLOv4, and Figure 9(b) shows the visualized result of Earf-YOLO (with CSPDarknet as the backbone). The visualized results suggest that Earf-YOLO is more accurate in detecting complex and small pattern symbol frames.

4.4. Contrast Experiments on PASCAL VOC Dataset

In the previous section, the evaluation indexes of the Earf-YOLO model were acquired on the Zhuang pattern symbol dataset, which does not prove the general efficiency of the model. Therefore, we conduct experiments on PASCAL VOC2007 and VOC2012, the public dataset, to further verify the model’s efficiency. The VOC dataset is composed of 20 categories. The dataset is annotated with the actual label position and the corresponding category information for each image. On PASCAL VOC2007 and VOC2012, we compared the proposed model with other advanced object detection models. The experimental results are shown in Table 2. Compared with YOLOv1, YOLOv2, YOLOv3, and YOLOv4, the AP of Earf-YOLO increases by 25.3%, 13.1%, 12.4%, and 2.8%, respectively. Compared with Faster RCNN, RefineDet512, and R-FCN-3000, the AP of Earf-YOLO increases by 15.3%, 11.6%, and 11.2%, respectively. Compared with DES512, DSSD, and ASSD, the AP of Earf-YOLO increases by 11.4%, 10.2%, and 8.7%, respectively. It can be seen from Table 2 that Earf-YOLO has the best performance, which illustrates the general efficiency of Earf-YOLO on other datasets.

4.5. Ablation Experiments

All ablation experiments in this section are first conducted on the Zhuang pattern symbol dataset. The experimental results are compared with the baseline, the YOLOv4 algorithm, with the backbone network of CSPDarkNet53. Finally, we integrate GLocalT and SRFB, the main contribution points of this article, into YOLOv3, YOLOv4, and YOLOv5 for comparative experiments on PASCAL VOC2007 and VOC2012.

The performance analysis of global-local-transformer (GLocalT) and strength receptive field block (SRFB) are discussed.

4.5.1. Baseline + GLocalT

As shown in line 2 of Table 3, compared with the baseline model, though FPS of YOLOv4 with GLocalT decreases by 2 and its Param increases by 2.093 M, its AP increases by 1.6%, demonstrating that YOLOv4 with GLocalT can better extract features with complex and small pattern symbols.

4.5.2. Baseline + SRFB

As illustrated in line 3 of Table 3, compared with the baseline model, the AP of YOLOv4 with SRFB increases 0.7%, its Param decreases by 6.148 M, and its FPS increases by 10, which proves that replacing redundant convolution with SRFB can improve the recognition accuracy, significantly reduce the computational overhead, and improve computational efficiency.

4.5.3. Baseline + SRFB + GLocalT

As illustrated in line 4 of Table 3, the AP, Param, and FBS of YOLOv4 with SRFB and GLocalT reach 40.1%, 57.623 M, and 27, respectively. The results demonstrate that YOLOv4 with SRFB and GLocalT reaches the highest performance, as the combination can address the problem that it is difficult to extract and fuse multilayer features thoroughly with less computational overhead.

Then the detection results of YOLOv4 integrated with some techniques are compared, shown in Table 4. When soft-NMS is involved, AP increases by 0.2% compared with YOLOv4 because when two object frames are close to each other, soft-NMS will not directly delete the frame with a large overlap area between the prediction frame and the ground-truth frame but decrease the score of the prediction frame. When focal loss is involved, AP increases by 0.6% compared with YOLOv4, which alleviates the imbalance of positive and negative samples and that of simple and complex samples. When GIoU loss as boundary loss is involved, AP reaches 38.7%. Finally, the model achieves the highest AP when the bag of freebies and the bag of specials are combined.

To prove the general efficiency of our main contributions, we incorporate GLocalT and SRFB into YOLOv3, YOLOv4, and YOLOv5 on the VOC dataset. Table 5 shows that with GLocalT and SRFB, the AP values of YOLOv3 and YOLOv5 increase. With GLocalT and SRFB, the AP value of YOLOv4 reaches the highest, indicating that the optimization based on YOLOv4 can detect more complex scenes.

5. Conclusions

Since present object detection models cannot fully extract features in different stages, and their computational overheads are too high in recognizing symbols of the Zhuang patterns, an object detection model, Earf-YOLO, is suggested in this paper. To be specific, first we propose the global-local-transformer structure. This structure uses gating axial attention layer to make the model better collect features on height, width, and channel axes, and more sensitive to the position information. It also uses the global-local training strategy to help the model focus on the global dependence between features and reduce local information loss. Then we design the strength receptive field block (SRFB), which uses the dilated convolution of multiscale branches to enhance the model’s feature extraction ability and fuses the convolution branches to reduce the inference time. Finally, we incorporate some advanced techniques to optimize the model. The structures and techniques mentioned above effectively address the problems YOLOv4 faces and improve its detection performance in recognizing Zhuang pattern symbols. However, Earf-YOLO has not been widely applied to the two-stage object detection models. Its detection effect on other datasets has not been discussed, either. Therefore, further improvement of the proposed structures and techniques will be the future focus, making them applicable to other two-stage object detection models and available to various datasets.

Data Availability

No data were used to support this study.

Conflicts of Interest

The authors declare that they have no conflicts of interest with this study.

Acknowledgments

This work is supported by the General Project of Guangxi Natural Science Foundation (2019GXNSFAA245053), the Guangxi Science and Technology Major Project (AA19254016), the National Natural Science Foundation of China (61862018), the Guangxi Natural Science Foundation Project (2018GXNSFAA138084), the Beihai city science and technology planning project (202082033), the Beihai city science and technology planning project (202082023), the Translation and Introduction of Guangxi Marine Culture under the Strategy of Maritime Power (2021KY0184), and the Guang xi graduate student innovation project(YCSW2021174). The author would like to thank Peng Xie from Southwest Jiaotong University and Jiaqi Xu from the University of Science and Technology of China for their insights and feedback on the first draft of this article.