After the birth of deep learning, artificial intelligence has entered a vigorous period of rapid development. In this process of rising and growing, we have made one achievement after another. When deep learning is applied to fruit target detection, due to the complex recognition background, large similarity between models, serious texture interference, and partial occlusion of fruits, the fruit target detection rate based on traditional methods is low. In order to solve these problems, a BCo-YOLOv5 network model is proposed to recognize and detect fruit targets in orchards. We use YOLOv5s as the basic model for feature image extraction and target detection. This paper introduces BCAM (bidirectional cross attention mechanism) into the network and adds BCAM between the backbone network and the neck network of the YOLOv5s basic model. BCAM uses weight multiplication strategy and maximum weight strategy to build a deeper position feature relationship, which can better assist the network in detecting fruit targets in fruit images. After training and testing the network, the map BCo-YOLOv5 network model reaches 97.70%. In order to verify the detection ability of the BCo-YOLOv5 network to citrus, apple, grape, and other fruit targets, we conducted a large number of experiments BCo-YOLOv5 network. The experimental results of the BCo-YOLOv5 network show that this method can effectively detect citrus, apple, and grape targets in fruit images, and the fruit target detection method based on BCo-YOLOv5 network is better than most orchard fruit detection methods.

1. Introduction

In recent years, the development of artificial intelligence has provided new processing methods and advanced computer vision technology for various fields [13]. We are also constantly exploring new directions in agricultural technology. With the popularization of intelligent agriculture, agricultural intellectualization is developing rapidly. Among them, the robot for citrus, apple, grape, and other fruit picking technology and fruit disease detection has become a research hotspot [4, 5]. Rapid and accurate identification and detection of fruit targets are of great significance to achieving automatic fruit picking, fruit yield prediction, and intelligent management of the orchard industry. The recognition and detection of fruit targets are the basic guarantee for the effectiveness of the picking operation of the picking robot. Therefore, in the natural environment, how to effectively identify the citrus fruit and accurately position its spatial position is the key technology to improving the availability of the citrus automatic picking system.

In recent years, for the problem of fruit target detection [6], many solutions based on traditional machine learning have been proposed. Xiong et al. [7] used the K-means clustering algorithm to segment citrus fruit. According to the result of image segmentation, the fruit position was determined. However, the segmentation effect of this method is not ideal when the environment is complex. Traditional machine vision algorithms are often poor in robustness, and the location of fruit targets in complex scenes is often inaccurate. In recent years, the convolutional neural network has been gradually used in the field of agricultural detection and gradually showed great algorithm advantages. The landmark algorithms include R-CNN [8], Faster R-CNN [9], SSD [10], and YOLO [11]. Jia et al. [12] proposed a Mask R-CNN network based on a mask region to identify and segment overlapping apples and used a full convolution network (FCN) to generate a mask to detect the location of apples. Hu et al. [13]proposed a forest smoke target detection network based on spatial domain attention mechanism and PDAM–STPNET with YOLOXL as the basic model. Hu et al. [14] used Fast R-CNN to extract tomato features, identify and locate candidate mature tomato areas in the complex greenhouse environment, and judge the maturity of tomato fruit. Xue et al. [15] improved YOLOv2 [16] and proposed a tiny Yolo network with dense blocks to detect and identify immature mangoes. Tian et al. [17] improved the YOLOv3 network and used the improved YOLOv3-dense network to detect the images of immature apples, bloated apples, and mature apples. The experiment proved that the YOLOv3-dense model can effectively detect apple fruit targets in different states and overlap the detection of apple fruit targets to a certain extent.

To solve the problem of low recognition accuracy in the network, most scholars use a variety of attention mechanisms with different characteristics to improve the accuracy of network recognition. Chen et al. [18] added a two-channel residual attention network model (B-ARNet) to the tomato leaf disease identification network to identify disease spots in tomato disease pictures. Their application to 8616 tomato images shows that the overall detection accuracy of the network is about 89% after adding the B-ARNet attention mechanism. It proved that B-ARNet played a positive role in the network of tomato disease identification.

However, due to the complex orchard environment, there is a complex spatial relationship between the fruit and the fruit and between the fruit and the branches, branches and leaves of the fruit tree, which makes the random occlusion among the fruit, branches, and leaves. At the same time, the uncertainty of light due to the change of time and weather also changes the characteristics of the fruit. In order to solve the problems of slow detection speed and high requirements for detection conditions in other models, this study improves the lightweight YOLOv5s network model and integrates the BCAM attention mechanism module to enhance the ability of the network to extract image features and improve the accuracy of fruit target detection. In this paper, a BCo-YOLOv5 network model is proposed to solve the above problems and realize better recognition and detection of target fruits.

The contributions of this paper are as follows.

We use YOLOv5s as the basic model for feature image extraction and target detection. BCAM is added between the backbone network and the neck network of the YOLOv5s basic model. BCAM uses the weight multiplication strategy and the maximum weight strategy to build a deeper position feature relationship and improve the detection accuracy of the network for citrus, apple, grape, and other fruit targets. Focusing on the response parts of features in different directions in the graph can reduce the redundancy of features and enhance the feature learning ability of different dimensions of the network.

We trained our BCo-YOLOv5 network on the public data set and verified the BCo-YOLOv5 network on the self-made data set. The verification results show that the detection accuracy of the BCo-YOLOv5 network model reaches 97.70%, respectively. The BCo-YOLOv5 target detection network proposed in this paper for fruit targets in a complex environment can effectively improve the detection accuracy of YOLOv5s basic network and can also effectively detect blocked fruits or small target fruits.

2. Materials and Methods

2.1. Data Acquisition

The data set used in the experiment comes from the data set website and orchard collection. The public data set includes apple and citrus data sets downloaded from the COCO data set [19]. 254 and 282 tagged apple and citrus images were carefully selected on the COCO public data set website; 266 labeled grape images were selected from the Winegrape [20] public data set.

The other part cooperates with the Central South University of forestry science and technology. The image data were collected from the economic forest fruit production, study, and research base jointly built by Central South University of Forestry Science and Technology and Changsha Forestry Bureau. The camera model is Canon EOSR, and its image pixels are 2400∗1600. We use cameras to take optical images of different kinds of orchard crops. The background of the shooting is the complex background of the orchard. Such photos can reflect many complex situations of fruit growth in the orchard so as to ensure that the collected images are more representative. 155, 123, and 149 images of apples, oranges, and grapes are collected in the real scene. However, the fruit targets in the pictures collected in the real scene have not been marked, so we mark the fruit pictures in each picture in the database through labeling tool. A total of 1229 labeled images were obtained in the final data set. Because the network training needs a large number of data sets, we use four ways to expand the labeled data sets: rotation, flipping, random clipping, and brightness transformation. A total of 4916 labeled images were finally obtained in the database. The annotated database will be used in the subsequent training and testing of the network. Table 1 shows our data source and data quantity.

2.2. BCo-YOLOv5 Network

This paper improves on the basis of YOLOv5s to improve the low accuracy of fruit detection when the fruit is covered by leaves or the fruit target is too small. First, we fuse BCAM after backbone and before neck network feature fusion. BCAM system is used to reconstruct attention in the backbone and neck. Finally, the improved BCo-YOLOv5 network is trained, verified, and tested in the self-made data set.

2.2.1. YOLOv5s

As shown in Figure 1, the structure of YOLOv5s is divided into four parts: input end, backbone network, neck network, and head output end. The adaptive anchor box calculation module can adapt to different data sets and automatically set the size of the initial anchor box. Backbone skeleton is to extract different levels of features from the image by depth convolution. The neck network layer includes a feature pyramid FPN [21] and a path aggregation structure PAN. The head output terminal is the last detection part of YOLOv5s, which can predict targets of different sizes. Now, I will introduce the backbone and neck parts.

2.2.2. YOLOv5s Network Convergence BCAM

Attention mechanism [2224] is a data processing method in machine learning. Attention mechanism is now added to various classical image classification networks and object detection models. Its most intuitive purpose is to improve the focus of the network on key feature targets and reduce the focus of the network on low sensitive areas. So we can filter out the image features we do not need. The attention mechanism originates from the brain’s processing method for the observed image signals when human beings observe objects. When people observe and identify the target, the prominent part of the target will often become the object of their attention while ignoring some global and background information. In conclusion, the attention mechanism adopts the above observation mechanism and applies this mechanism to machine learning. The birth of the attention mechanism makes machine learning have a wider range of improved ways and strategies. Most scholars have improved the attention mechanism to better achieve the classification, recognition, and detection of objects in the image.

Therefore, in order to more completely identify and detect the fruit target in the given image, we introduce a new attention mechanism, cross direction attention mechanism in the YOLOv5s basic network. We fuse BCAM after backbone and before neck network feature fusion. Using the coordinated attention mechanism to reconstruct the attention of backbone and neck can serve as a connecting link between the preceding and the following. In the bidirectional cross sensing module, first, the first two layers of BCAM model convolute the whole image, effectively extract, and mine the shallow features of the image. Then, BCAM assigns horizontal and vertical weighted attention coefficients to each feature. Then, BCAM uses the weighted multiplication strategy and the maximum matching strategy to expand the horizontal and vertical weight coefficients allocated in the previous step. Finally, BCAM generates deeper image depth features for images through the convolution layer and averaging pool. To sum up, BCAM can better assist the basic network YOLOv5s in detecting and locating fruit image targets in the overall image by expanding the relationship between feature weight coefficients and structural features. Secondly, it is also very helpful to improve the fine-grained detection of the network and improve the target detection rate of BCo-YOLOv5 on citrus, grape, and apple fruits.

BCAM is a three-step process. First, BCAM will generate the attention mechanism weights of the image in the horizontal and vertical directions. Secondly, BCAM amplifies the obtained upper level weight to obtain the second level weight. Finally, the two-layer weights obtained in the above two steps are fused to match and supplement to obtain the final fusion weight coefficient. Figure 2 is the algorithm block diagram of BCAM The specific steps are as follows.

First, the adjacent attention mechanism in the horizontal and vertical directions is extracted to generate a bidirectional first-order image weight feature .When each primary image weight feature is generated, the feature vector set in the figure is T, and the corresponding attention coefficient can be obtained for each pixel. The attention coefficient is . Then, the weight of each pixel I in each direction assigned to the pixel J in the feature sequence is obtained. Finally, the softmax [25] function is introduced to regularize the attention coefficients of pixels in each direction.

is a weight coefficient assigned to each pixel for the attention mechanism. The deep features extracted by the attention weight extracted by BCAM can not only effectively reflect the interaction between pixels, but also the directional features obtained by dimension reduction have high symmetry, which makes the network more conducive to extracting the complete and effective features of citrus, apple, grape and other fruit images.

Second, the horizontal and vertical weight features are multiplied to obtain the first secondary weight . The formula of the weight multiplication strategy is as follows: in equation (2), it is shown that the weight multiplication strategy can mine deep feature information. The weight coefficient is extended by multiplying with the minimum penalty. Weight multiplication can further amplify the influence of the weight coefficient. Since the coefficient itself is less than 1, the smaller the coefficient is, the smaller the result will be. This method can further reduce the weight of small coefficients and suppress the weight of obscure features. Furthermore, we use the horizontal and vertical weight features to obtain the second secondary image weight through the maximum weight strategy . The formula of the maximum weight strategy is shown in equation (3). The maximum feature is regarded as a valid feature, and the maximum feature is compared with the minimum feature α add times. Among α, the value range of is 0 to 1. It is proved by experiments that when α is 0.3, the maximum matching strategy has the best effect. In the process of maximum matching strategy, the maximum value is the main factor. However, we still combine another minimum feature to obtain the comprehensive feature. To supplement missing small features,

Finally, the feature C perpendicular to the horizontal weight is fused and matched to obtain the maximum value; the image weight coefficients in different directions are fused with the integrated feature information through concatenate function in BCAM.

3. Results

In order to analyze the effectiveness of BCo-YOLOv5 network in fruit target detection, we designed experiments to compare the effectiveness of different models. However, as there are no clear standards and descriptions for the specific codes and data segmentation of other models, we have to reproduce their models independently and conduct comparative experiments on the data sets we collected. In the comparative experiment, the test sets of different models are completely consistent. This part includes the experimental environment, experimental devices, and comparative experiments between other different models of BCo-YOLOv5 network.

3.1. Experimental Environment

This paper edits, compiles and runs the BCo-YOLOv5 network proposed in this paper on the Collaboratory. The programming environment for the code is Python3.7 and PyTorch. The hardware environment of the simulation experiment is Google cloud disk GPU and windows10 (64 bit), in which the system memory is 32 GB.

3.2. Experimental Setting

The self-made dataset used in this paper contains three kinds of fruits: citrus, apple, and grape. We validate the detection efficiency of our network on three fruit datasets. The size of the input image is 224∗224. This can improve the efficiency of image processing technology and reduce the time of training model and classification. In the training of the network, the selection of super parameters is difficult and time-consuming. The superparameters of BCo-YOLOv5 network in this paper are shown in Table 2. The Adam [26] optimizer is used in the model of this article. The batch size of the experiment is set to 64, the momentum parameter is set to 0.9, and the epoch number is set to 200. 200 iterations for each round, and one verification for every 1000 iterations. The weight attenuation value is . The initial learning rate of the first 50 epochs is set to 0.001, and the initial learning rate of the last 10 epochs is set to 0.005 to improve the fitting speed.

3.3. Experiments and Analysis of Results
3.3.1. Self-Comparison Test

This paper selects the recall rate, accuracy rate, and F1 score of the evaluation indicators to verify the performance of the model. F1 score is the measurement function of accuracy and recall, and its calculation formula is as follows:

In formula (5), P represents the target detection accuracy of the network (in this paper, the recognition accuracy of citrus, apple, grape, and other fruit targets) and R represents the recall rate of the network. The precision in the evaluation index refers to the average value of each correct detection rate in the fruit target detection samples. TP in equation (5) indicates that the predicted value is positive and the actual value is positive. From the perspective of the calculation formula, F1 value is the weighted average of the accuracy of the model and the recall rate of the model. In the sense of calculation, the upper limit of F1 value is 1 and the lower limit is 0. The higher the F1 value of BCo-YOLOv5 network, the better the performance of the BCo-YOLOv5 model.

This paper compares YOLOv5s with its improved BCo-YOLOv5 network, including recall, precision, and F1 scores.

Table 3 shows the comparison of recall rate, precision, and F1 score between the YOLOv5s model (Original network before adding BCAM module) and BCo-YOLOv5 network after BCAM. The comparative experimental results in Table 3 show that, compared with the original YOLOv5s network, the detection accuracy and corresponding recall rate of BCo-YOLOv5 network are increased by 6.83% and 6.67%, respectively, and the F1 value is increased by 6.77%. The map of the improved BCo-YOLOv5 model increased by 7.39% to 97.70%. The above data show that the improved BCo-YOLOv5 model based on YOLOv5s has reached the expected level of the experiment and is higher than the basic network before the improvement. It is proved that adding a BCAM attention mechanism can effectively improve the target detection ability of fruit.

3.3.2. Comparison Experiment between BCo-YOLOv5 Model and Other Networks

In order to compare the detection performance of BCo-YOLOv5 model with the existing target detection models. We compare the target detection network models R-CNN, Fast R-CNN, Faster R-CNN, and YOLOv3, YOLOv5s, YOLOv5l in the YOLO series of BCo-YOLOv5 in recent years. Record the numerical results of each model evaluation index in Table 4.

Compared with BCo-YOLOv5 model and other models, the mapping of BCo-YOLOv5 model proposed in this paper is generally higher than that of other networks. Compared with R-CNN、Fast R-CNN、YOLOV3 Faster R-CNN, and YOLOv5 [27] networks, the mAp value of our BCO yolov5 model increased by 18.39%, 15.16%, 10.56%, 7.67%, and 3.07% respectively. This shows that the fruit target detection ability of the BCo-YOLOv5 model constructed in this paper is higher than that of other common detection networks. Among them, the model size of BCo-YOLOv5 model is smaller than that of YOLOv5l. The BCo-YOLOv5 model with BCAM is added, and its recognition accuracy and recognition speed are higher than that of YOLOv5l. Secondly, the model size of BCo-YOLOv5 model is larger than that of YOLOv5s. The structure of BCo-YOLOv5 model is more complex than that of YOLOv5s. The ability to extract features is higher than that of YOLOv5s, and the recognition accuracy is also higher than that of YOLOv5s. Thus, the value of BCo-YOLOv5 model in the current neural network model is verified.

3.3.3. Visualization of Partial Test Results

Visualize part of the test results of the self-built dataset, and the test results are shown in Figure 3. Due to the phenomenon of occluded fruit targets and dense fruits in the data set, the network is prone to miss detection and false detection of occluded fruit targets and dense fruits. According to the visual results detected by the detection network in this paper, the detection effect of BCo-YOLOv5 network model is good, and there is no missed detection and false detection. It is effectively proved that BCo-YOLOv5 can solve the problems of missed detection and false detection caused by the occlusion of fruit targets and the too dense fruit. It shows the advantages of BCo-YOLOv5 network.

4. Conclusion

The BCo-YOLOv5 network model proposed in this paper uses YOLOv5s as the basic model for feature extraction and target detection. Secondly, BCAM attention mechanism is added between the backbone of YOLOv5s model and the neck network to enhance the local correlation feature extraction and directional feature extraction of images. So we can pay more attention to the remaining fruit targets after occlusion to avoid the loss of detection of occluded targets. According to the comparative experimental results between the BCo-YOLOv5 network and other networks, the method can effectively improve the accuracy of fruit target detection and is better than other networks in fruit target detection. On the other hand, according to the generalization experiment results of the BCo-YOLOv5 network on public data sets, it shows that the BCo-YOLOv5 network not only has a good target detection effect on our self-made data sets but also has an ideal detection effect on other kinds of items on COCO data sets. It also proves that the network has good generalization ability and has good development prospects.

Data Availability

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to partial authors’ disagreement.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Authors’ Contributions

Ruoli Yang and Yaowen Hu contributed equally to this work. Ruoli Yang proposed the methodology, wrote the original draft, conceptualized the study, was responsible for data curation, provided software, was responsible for data acquisition, investigated the study, and provided model guidance. Ye Yao validated the study and was responsible for project administration. Runmin Liu was responsible for formal analysis. Ming Gao visualized the study and reviewed and edited the manuscript.


The authors are grateful to all members of the Food College of Central South University of Forestry and Technology for their advice and assistance in the course of this research. Natural Science Foundation of Hunan Provincial (Grant no. 2021JJ31164).