Abstract

Accurate identification of apple leaf diseases is of great significance for improving apple yield. The lesion area of the apple leaf disease image is small and vulnerable to background interference, which easily leads to low recognition accuracy. To solve this problem, a lightweight bilinear convolutional neural network (CNN) model named BLSENet based on attention mechanism is designed. The model consists of two subnetworks, and each subnetwork is embedded with a Squeeze-and-Excitation (SE) module. By using the feature extraction ability of the two subnetworks and combining the bilinear feature CONCAT operation, the multiscale features of the image are obtained. Compared with the unimproved model LeNet-5 (84.63%), BLSENet has higher accuracy in the test set, which indicates that SE module and bilinear feature fusion have a positive effect on the performance of the model, and BLSENet has the ability to identify apple leaf diseases. The model has achieved the expected goal and can provide technical support for accurate identification and real-time monitoring of apple disease images.

1. Introduction

Apples are often attacked by diseases during the growth process [1]. Accurate identification of the types of diseases, timely prevention, and control are essential to improve the yield of apples [24]. At present, plant disease recognition has become an important research direction in the field of image recognition and intelligent agriculture.

Traditional machine learning algorithms need to classify images after extracting features [5]. The feature extraction process is time-consuming and labor-intensive, and the classification model has weak generalization ability and poor recognition effect [6]. Zhang et al. proposed an apple leaf disease identification method based on image processing technology and pattern recognition method. The RGB model was transformed into HSI, YUV, and grayscale, with background removal based on a specific threshold. The approach using region growing algorithm (RGA), genetic algorithm (GA), correlation-based feature selection (CFS), and support vector machine (SVM) achieved over 90% accuracy in recognizing various apple leaf diseases [7]. Bracino et al. proposed a machine learning model that can detect and classify the three most common apple diseases. The color and texture features of a single apple leaf image were extracted and selected. By comparing KNN, ANN, and GPR, it is determined that the GPR model with ARD squared kernel function is the best model [8]. Khan et al. employed contrast enhancement and a strong correlation-based segmentation method to segment images, optimizing the segmentation results through expectation maximization (EM). They utilized GA to extract features from the fused images and achieved significant classification accuracy using One-vs-All M-SVM [9]. Al-bayati et al. proposed a method for detecting apple leaf diseases using deep neural network (DNN). They employed Speeded Up Robust Features (SURF) for feature extraction, and the Grasshopper Optimization Algorithm (GOA) is used for feature optimization; good results have been achieved [10]. However, the selection of this feature relies heavily on human experience and has great uncertainty. It requires specific data preprocessing to obtain better experimental results [11]. Because traditional feature extraction and recognition methods are not end-to-end operations, this is not conducive to rapid real-time detection in practical applications.

In recent years, as deep learning can automatically extract disease features and avoid manual dependence, a series of research results have been achieved in crop disease recognition. Rohini et al. proposed a model based on CNN to classify apple leaf images into diseased and undiseased. In the construction of the CNN model, the combination of convolution layer, ReLU, and max-pooling layer is considered. This task represents a binary classification problem. The proposed model is effectively implemented on the considered dataset with an accuracy of 91.11% [12]. Singh et al. used three pretrained CNN models to identify diseases in the Beans Leaf image dataset. In addition, different optimization techniques are used to highlight the performance differences of different CNN models. The experimental results show that the performance of EfficientNetB6 is better than other models, and the accuracy rate is 91.74% [13]. Kumar et al. proposed a strategy based on transfer learning, using the learned VGG-16/VGG-19 CNN network to estimate the severity of tomato leaf disease. In addition, the author performs hyperadjustment on the hyperparameters of the pretrained CNN model to improve its effectiveness. In order to evaluate the performance of the fine-tuned CNN model, the study measures the accuracy and loss values after multiple iterations on the training and validation datasets. Compared with another CNN model evaluated on the same dataset, VGG-16 shows higher classification accuracy (92.46%) [14]. Ding et al. proposed a new apple leaf disease recognition model named RFCA ResNet. This model has dual attention mechanism and multiscale feature extraction ability, which can reduce the adverse effects of complex background on recognition results. In addition, by combining the use of the class balance technique in conjunction with focal loss, the adverse effects of imbalanced datasets on classification accuracy can be effectively reduced. The RFB module can expand the receptive field and realize multiscale feature extraction. The accuracy of RFCA ResNet is 89.61%. It is superior to other methods and has good generalization performance, which has certain theoretical significance and practical value [15]. Gaikwad et al. used CNN to classify leaf disease. The author collected datasets from a real-time environment, with a total of 14181 images and 10 class labels. The experiment used 3 different versions of datasets: color, black and white, and grey images. These datasets are trained on AlexNet and SqueezeNet and use the same hyperparameters. The recognition accuracy of the two models is basically the same, and the classification accuracy of color images is 86.8% and 86.6%, respectively, indicating that color images are effective for classification [16]. In recent years, researchers have used various deep learning networks and frameworks for experiments. With the deepening of research, it is currently the best choice to use deep learning to classify and identify apple leaf diseases [17]. Based on the aforementioned literature, we have discovered the diversity and complexity of the shape and color of diseases, which poses a challenge for achieving high-precision disease identification. While existing research encompasses various methods, including traditional feature extraction and deep learning techniques, the considerable variability in diseases has a notable impact on recognition accuracy. This diversity may result in existing models being unable to effectively capture and distinguish different disease features under certain circumstances, thereby limiting their practical applicability. To address this challenge, our focus has been on the multiscale extraction of disease features, incorporating methods such as multiscale feature fusion and employing more sophisticated deep learning architectures. These approaches aim to enhance the robustness of disease recognition systems by comprehensively capturing the complex characteristics of diseases. Therefore, this paper proposes a new CNN model, which can provide technical support for accurate identification and real-time monitoring of apple disease images.

In this paper, a bilinear classification model based on attention mechanism and feature fusion strategy named BLSENet is proposed for the classification of apple leaf diseases. The next arrangement and structure of this article are as follows. Firstly, the apple leaf disease dataset is presented, and the proposed network model BLSENet is introduced. Subsequently, the experimental results are described and analyzed in Section 3. The feasibility of the proposed model is verified by adjusting the model parameters and the ablation experiment of the model. Finally, the advantages and disadvantages of the proposed model are analyzed, and the future research direction is determined on this basis.

2. Methodology

2.1. Dataset

The dataset was collected from the College of Artificial Intelligence, Southwest University (as shown in Figure 1) [1821]. The collected images encompass diverse diseases, each meticulously captured by skilled professionals using high-resolution cameras under appropriate lighting conditions to ensure image quality and clarity. Following the collection, the images underwent initial screening, retaining samples with representative disease features. To ensure data accuracy, each image was annotated for disease types by expert plant pathology specialists to guarantee precise and consistent labeling. The dataset contains nine types of apple leaf disease, including Health, Alternaria leaf spot, Brown spot, Frogeye leaf spot, Grey spot, Mosaic, Powdery mildew, Rust, and Scab.

The number of datasets used for the experiment is shown in Table 1. In the dataset, a total of 14582 images are included. 8754 images are randomly selected as the training set, 2913 pictures (accounting for 20% of the dataset) as the verification set, and the remaining 2915 pictures (accounting for 20% of the dataset) as the test set, as shown in Table 2.

2.2. Model of Deep Convolutional Neural Network with Improvements
2.2.1. Multiscale Information Fusion Strategy

BLSENet is a bilinear CNN model. It is a new technology in fine-grained image recognition [22]. It has a good classification effect in terms of inability to distinguish category calculations with subtle visual differences [23]. The structure of BLSENet is shown in Figure 2. The input image is subjected to multiple Convolutions [24], Pooling [25], and BatchNormal [26] operations by two improved LeNet-5 CNNs, and two image features extracted by the CNN network are obtained. Then, the image features extracted by the CNN network are combined with the CONCAT operation to form the bilinear feature vector of the image [27]. Finally, the feature is classified by the fully connected layer classifier to obtain the probability of the identified category.

LeNet, also known as LeNet-5, is a classical CNN proposed by Lecun [28]. It is one of the origins of modern CNNs. It has an input layer, two convolutional layers, two pooling layers, and three fully connected layers [29]. The improved LeNet-5 is used in BLSENet named A model; the two fully connected layers of A model are removed and replaced with SE modules. Then, a BatchNormal layer is added behind the first convolutional layer of the A model, which is named the B model. The B model is used as the upper branch network, the A model is used as the lower branch network, and two feature vectors named FC11 and FC21 with a dimension of 1 × 120 are output. The vector obtained by cascading FC11 and FC21 is named FC31, with a size of 1 × 240. Subsequently, FC31 is reduced in dimension and a vector named FC32 with a dimension of 1 × 50 is obtained. Finally, the output of the fully connected layer is set to 9 to represent the category of leaf diseases.

2.2.2. Attention Mechanism Based on SE Module

The SE (Squeeze-and-Excitation Network) module is a computing unit; it can recalibrate the weight of the feature channel [30]. At the same time, the module can adaptively enhance the feature channel of the contrast information of the infrared image and suppress the irrelevant feature channel [31]. In this network, the SE module contains a Squeeze-and-Excitation operation. The training process is divided into two stages: the first stage is Squeeze and the second stage is Excitation.

Figure 3 shows the structure of the SE module. We hope to enhance the learning of convolution features by simulating the interdependence of channels so that the network can be sensitive to the information features that can be utilized in subsequent transformations. Therefore, our goal is to give it the opportunity to obtain global information, further improve the accuracy of the network by squeeze and excitation, and then send the filter to the next conversion. In recent years, SE modules have been widely used in deep learning to improve network performance. In many research fields, many network architectures use SE modules in the network to help improve the performance of the original network [3235]. The structure is shown in Figure 3. This method is simple and easy to embed into the CNN framework, and the computational complexity increases little, but better results are obtained.

2.3. Model Training Details

The CNN model proposed in this paper is based on PyTorch which is an open-source deep learning library. The experimental process was carried out on a workstation equipped with the Intel(R) Core (TM) i9-10980XE CPU @ 3.00 GHz 3.00 GHz and the 24 GB NVIDIA GeForce RTX 3090 GPU. The experimental environment is shown in Table 3.

In the experiment, we set epoch = 200, batch size = 16, and initial learning rate = 0.0001 according to the experience, and Adam as the optimizer and cross-entropy as the loss function is used to train the network. The entire training parameters are shown in Table 4.

3. Experimental Results and Analysis

3.1. Training Results

The training result curve of the BLSENet network is shown in Figure 4. Accuracy is defined as the proportion of correctly classified samples by the model among all predictions. Loss is defined as the metric measuring the difference between the predictions of model and the actual labels during the training process, with the goal of minimizing this difference. It can be seen from Figure 4 that a total of 200 epochs were performed in the experiment. Finally, the accuracy rate on the test set is 93.58% and a good training effect is achieved.

On the Apple leaf disease dataset, the relationship between the loss value of the training set and the number of epochs is shown in Figure 4(a). The loss value on the training set decreases with the increase in the number of epochs, and the loss in the training set decreases from about 1.26 to about 0.34. The relationship between accuracy and the number of epochs is shown in Figure 4(b). As the number of epochs increases, the accuracy in the validation set gradually increases. The training set tends to be stable after the number of epochs is greater than 100, and its accuracy is about 93%.

As shown in Figure 5, the results of the BLSENet model in the test dataset are analyzed and a confusion matrix is established. From the diagram, the model has good recognition accuracy for Brown spot, Frogeye leaf spot, Powdery mildew, Rust, and Scab, with accuracies of 95%, 90%, 94%, 93%, and 95%, respectively. This may be because there are a large number of images for Frogeye leaf spot, Powdery mildew, Rust, and Scab. The model can obtain sufficient training, resulting in a higher recognition rate. On the other hand, although there are not many images for Brown spot, which are comparable to Health, Alternaria leaf spot, Grey spot, and Mosaic, higher accuracy can be obtained. This may be because defects with a relatively large area are less likely to be affected by the background, making them easier to be correctly recognized by the model.

3.2. Comparison and Analysis of Experimental Results

After the model was established, epoch parameters were set to 100, 200, and 300 to select the appropriate value. The training results are shown in Figure 6. As can be seen from the figure, as epoch increases, both loss and accuracy show better results on the training set, but there is no significant difference between the three values. Then, the three parameter values were tested on the test dataset, and their accuracies were 90.22%, 93.58%, and 93.48%, respectively, as shown in Table 5. It can be observed that epoch = 200 has the best performance on the test dataset, and it does not consume a lot of training time. Therefore, considering the accuracy of the model based on the above analysis, the value of 200 was selected as the epoch.

To select the appropriate batch size parameter, batch size was set to 8, 16, 32, and 64. The training results are shown in Figure 7. As the batch size decreases, the network has better training results on the training set. The training results for batch size = 8 and batch size = 16 are similar. On the test dataset, their accuracies were 92.97%, 93.58%, 91.63%, and 90.57%, respectively, as shown in Table 6. It can be observed that batch size = 16 has the best performance on the test dataset. Therefore, considering the accuracy of the model based on the above analysis, the value of 16 was selected as the batch size.

To select the appropriate learning rate, three learning rates of 0.01, 0.001, 0.0001, and 0.00001 were tested on the test dataset. The training results are shown in Figure 8. From the figure, the training effect with a learning rate of 0.00001 is the worst because the learning rate is too small, which slows down the training efficiency. When the learning rate is 0.01, as can be seen from Figure 8(b), the accuracy curve is very unstable, which may be due to the learning rate being too large and making it difficult to find appropriate weight parameters. The training results with learning rates of 0.001 and 0.0001 are similar. Then, the four learning rates were tested on the test dataset, and the experimental results are shown in Table 7. Their accuracies were 91.60%, 91.34%, 93.58%, and 92.08%, respectively. Therefore, a learning rate of 0.0001 was selected based on the above analysis.

The optimization algorithm is very important for the performance of the model. The SGD [36], AdaGrad [37], and Adam [38] optimization algorithms were used to train the BLSENet in this paper, and their convergence speeds were compared. Figure 9 shows the training results of these three optimization algorithms. From the figure, the loss values using SGD and AdaGrad converge around 1.1 and the convergence effect is relatively poor as epoch increases. The accuracy values converge around 60%, which does not achieve the target accuracy. The results indicate that the model using the Adam algorithm has the fastest convergence speed and the best recognition effect. Then, the three optimizers were tested on the test dataset, and the experimental results are shown in Table 8. Therefore, Adam was selected as the optimization algorithm based on the above analysis.

The BLSENet is an improved model based on LeNet-5. To verify the improvement of the improved model compared to the original LeNet-5 model, BLSENet and LeNet-5 were compared on the test dataset. From Table 9, the recognition results of BLSENet are better than those of LeNet-5, with accuracies of 93.58% and 84.63%, respectively. The bilinear LeNet-5 combined with the SE module can improve the accuracy of LeNet-5 by 8.95%. Based on the above analysis, bilinear LeNet-5 combined with the SE module can improve the performance of the model.

In addition, we conducted some ablation experiments to analyze BLSENet. The results are shown in Table 9. As mentioned earlier, the concept of BLSENet is derived from LeNet-5 and combines the SE module. Therefore, we compared LeNet-5, LeNet-5 and LeNet-5, LeNet-5 + SE and LeNet-5, LeNet-5 and LeNet-5 + SE, and Double LeNet-5 + SE (BLSENet). The training results are shown in Figure 10. Their training curves all reach similar results as epoch increases. To test the generalization ability of the models on the test dataset, these models were compared on the same test dataset, and the results are shown in Table 9. The experimental results show that the recognition ability of LeNet-5 is the worst, and the accuracy of LeNet-5 and LeNet-5, LeNet-5 and LeNet-5 + SE, Double LeNet-5 + SE (BLSENet) is similar, with an accuracy of between 92% and 94%. The recognition effect of LeNet-5 + SE and LeNet-5 is the worst, with a recognition rate of 81.54%. This is an interesting phenomenon, and we will continue to research it in the future.

Finally, the author presents a detailed table showcasing the accuracy achieved by the proposed neural network model in experiments and compares it with other models from relevant literature, as shown in Table 10. Excitingly, the table clearly demonstrates the outstanding performance of the proposed model in terms of accuracy, outperforming models from other literature and yielding the best results, approximately 93.58%. This finding not only highlights the superiority of the proposed model but also provides robust support for further advancements in the research field.

4. Conclusion

In this paper, we proposed an apple leaf disease recognition method called BLSENet based on attention mechanism, lightweight CNN, and bilinear CNN framework. By embedding the SE module into the end of LeNet-5 and combining it with bilinear pooling, BLSENet was constructed to extract image features of apple leaf diseases. BLSENet has higher accuracy in the test dataset compared with the unimproved model LeNet-5 (84.63%), which indicates that the SE module and bilinear feature fusion have a positive effect on the performance of the model and BLSENet has the ability to recognize apple leaf diseases accurately. The model has achieved the expected goal, which can provide technical support for accurate identification and real-time monitoring of apple disease images. In our future work, we will continue to focus on deep learning models capable of assessing the severity of apple leaf diseases. Simultaneously, we aim to deploy the model on devices such as unmanned aerial vehicles (UAVs) to achieve precise remote sensing for agricultural monitoring. This is a challenging task but is a pressing demand in the field.

Data Availability

The datasets, codes, and weight files used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (NSFC) under Contract no. 31570712.