Abstract

Monitoring the wheat growth process by machine vision with artificial feature extraction suffers from poor objectivity and low efficiency. To solve this problem, this paper introduces deep learning into monitoring the wheat growth process. The convolutional neural network (CNN) has been widely used in image classification as a common algorithm in deep learning. The application of deep feature extraction networks could automatically recognize and extract image features. However, the large number of parameters and computational overhead brought by regular deep CNNs make it difficult to apply these models to embedded devices with limited storage space and computing power. Therefore, the knowledge distillation method has been proposed for the feature extraction network of the target detection network to improve the performance of shallow feature extraction networks and ensure recognition accuracy while reducing the computational burden and model size. This paper took ResNet50 and MobileNet as two different teacher networks to guide the student model MobileNet for training. The experimental results showed that when ResNet50 was used as teacher model, MobileNet had the best recognition effect. The average recognition accuracy of the student model MobileNet was 97.3%, and the model size was compressed to only 19.71 MB, which is 88.9% smaller than that of ResNet50. Knowledge distillation improves the accuracy of the obtained model and reduces the number of network model parameters. Furthermore, it reduces the model running time and the cost of model deployment greatly. The application of the knowledge distillation method could provide further technical support for intelligent wheat production in the field.

1. Introduction

Wheat is the second largest food crop in the world. About 35%–40% of the world’s population rely on wheat as their main food [1]. In China, the wheat planting area accounts for more than 70% of the arable land, and the winter wheat yield in Henan Province accounts for more than 25% of the total wheat yield in China. The stable and high yield of winter wheat in Henan Province is crucial to ensuring national food security. As one of the main bases of crop growth information, growth period information is widely used for automatically monitoring crops, such as farming activities in specific growth stages. At present, the acquisition of crop growth period information is mainly achieved by artificial observation. However, artificial observation is time-consuming, laborious, and prone to errors and will cause damage to crops. Using an intelligent method to monitor the growth period of wheat can greatly improve the accuracy and instantaneity of the observation results of wheat and reduce the input cost. Therefore, intelligent technology is urgently needed to solve the problem of monitoring the wheat growth process.

In recent years, machine learning methods have been extensively used in classification and regression problems and have also gradually been applied to the extraction of the crop growth period. However, these studies mainly focus on corn [2, 3], rice [4], cotton [5], and other crops that are relatively easy to collect and process. The research on the growth period of wheat is still in a preliminary stage. Zhang Mingwei et al. monitored the growth of winter wheat at the turning green stage and heading stage based on MODIS-EVI time-series data [6]. Huang Qing et al. monitored the growth of winter wheat using the MODIS-NDVI difference model [7]. Both methods need to construct a large number of characteristic standards for each type of crop. Moreover, the workload is heavy, and the generalization ability is insufficient.

As a popular pattern analysis method in recent years, deep learning has been widely used in voice recognition [8], face recognition [9], image classification [10], behavior analysis [11, 12], and other industrial fields. In recent years, it has also been gradually applied to the agricultural field [13, 14]. Some researchers had applied deep learning to target detection of corns [15], fruits [16, 17], vegetables [18, 19], and other crops. Using image processing technology, Wu Qian realized automatic observation of three cotton growth periods [20]. However, this method only targets fixed cotton planting areas and fixed cotton varieties. The application of this method to other cotton planting areas and different cotton varieties will lead to errors. Yu Zhenghong used computer vision technology for automatic tests of the corn growth period [21]. Specifically, the seedling stage of corn was detected by spatial uniformity. Meanwhile, the probability model of the number of species and plants in the connected domain, the number of endpoints, and the probability of reaching the trilobal stage were established to the detect trefoil stage. Then, the coverage and canopy height were adopted as a base to determine whether the jointing stage had been reached. Finally, the male tassels were determined by detecting the tasseling stage of maize with spatio-temporal significance. Based on the image processing method, Wu Lanlan studied the classification of corn and weeds at the seedling stage in the field with complex backgrounds based on a support vector machine and a BP neural network [22]. This research provided technical support for the real-time weed identification system in the field. Although the performance of modern deep neural networks has been significantly improved, it is still necessary to extract feature structures from huge and redundant data during model training. In general, the real-time requirement is not considered, and the number of model parameters obtained by the final training is large. Due to the limitation of computing resources and delay, deep neural networks are difficult to apply in practice.

To address the above issues, different methods of model simplification have been proposed in recent years. Hinton et al. were the first to propose the method of knowledge distillation [23], whose core is to use the conventional teacher model of a deep convolutional network to guide a student model of the simplified network. Based on this method, this study uses the conventional ResNet50 [24] and VGG16 models as teacher models to guide the training of the simplified MobileNet model. Then, the trained model is applied for the first time to monitor the wheat growth period. Through knowledge distillation, the accuracy of the MobileNet model is improved, the number of model parameters is reduced, the running time of the model is greatly shortened, and the deployment cost of the model is decreased greatly.

2. Data and Methods

2.1. Data Material

The experiments of this study were conducted in the experimental area of the Xuchang Campus of Henan Agricultural University, Changge City, Henan Province (113°58'26“E, 34°12'06”N). The experimental area belongs to the continental monsoon climate of the north temperate zone, with an average annual temperature of 14.3 °C, annual rainfall of 711.1 mm, and a frost-free period of 217 days.

Due to the complex field environment, it is usually necessary to fix the camera equipment in the field and shoot at the same distance. Image collection was performed from October 2019 to June 2020. Three varieties of wheat, i.e., “Yumai 49,” “Zhoumai 27” and “Xinong 509” were selected as experimental objects. In the wheat production in Henan Province, these three varieties of wheat maintain a planting area in the eastern and central Henan regions, and all of them are half-winter and have certain universality.

In the image collection process, groups of data at the density of 300 (about 300 plants per square meter), the density of 350, the nitrogen content of 15 (15 kg pure nitrogen per mu), and the nitrogen content of 0 (0 kg pure nitrogen per mu) were selected. Meanwhile, shots were taken every other day between 8 AM and 3 PM using a camera (Nikon D3100, sensor: CMOS, maximum aperture: F/5.6, 14.2 megapixels, maximum resolution: 4608 × 3072). A tripod was used for fixed height shooting, and all images were collected under natural light conditions.

The effective accumulated temperature of winter wheat in the Xuchang station was adopted as a sample marker.

From the obtained image data, a total of 12750 images of wheat in five growth stages were screened, namely, the emergence-tillering stage (the first day of emergence to the day before tillering), tillering-overwintering stage (the first day of tillering to the day before overwintering), overwintering-greening stage (the first day of overwintering to the day before greening), greening-jointing stage (the first day of greening to the day before jointing), and jointing-heading stage (the first day of jointing to the day before heading). Some of the collected images are shown in Figure 1.

The images of five wheat growth stages were preprocessed. The resolution of the images of the wheat growth period collected by the camera was 4600 × 3070, and the size of an image reached about 16 MB. If such images were directly put into the model for learning, this would bring a great burden to the training of the model. Therefore, the image normalization operation was first carried out on the images of wheat in five growth periods by adjusting the image resolution.

All the images of the wheat growth period were collected under complex backgrounds. A lot of field background information unrelated to the wheat image could be found around the perimeter of the collected image. Therefore, the wheat images containing irrelevant information were clipped using the center clipping method, and then all images were uniformly scaled to a size of 224 × 224 × 3. Finally, the size normalization of all images was realized.

With the deepening of the calculation depth of the model, the number of parameters also increases gradually. If the size of the dataset does not reach a certain number, the model will be overfitted. The model often has high accuracy on the training set, but it cannot perform well on the test set. In this case, the generalization ability of the whole model will be very weak. To prevent the overfitting phenomenon, the method of increasing the amount of image data was adopted in this study. Generally, image translation, image scaling, image cutting, and other methods can be exploited to amplify the dataset. In this paper, the method of image rotation of wheat in five growth stages was selected to amplify the dataset. The method rotated the image by 90°, 180°, and 270° to increase the wheat image data from different angles.

In this way, the dataset of five wheat growth stage images was expanded threefold. Finally, the five types of wheat canopy image data were expanded into a total of 38250 balanced samples, including 16,050 pieces of “Yumai 49,” 15,450 pieces of “Zhoumai 27” and 6,750 pieces of “Xinong 509.” 20% of the samples were randomly selected in each wheat growth stage to form a test set. All the comparative experiments in this paper were completed on this dataset. The data sample distribution is shown in Table 1.

2.2. Research Method

Knowledge distillation was first applied to the task of image recognition and classification. The core idea was to integrate the extracted knowledge and the capabilities of multiple models into a single model. It was generally believed that the parameters of the model retained the knowledge learned in the model training process. The commonly used transfer learning method was adopted to do pretraining on a large dataset. Then, the parameters obtained by pretraining were fine-tuned according to the task requirements and applied to the target data. The knowledge could be regarded as a mapping relationship between the input and output. In this study, a teacher network was trained firstly. Then, the output result q of the network was taken to train the student network. The output result p of the student network should be as close to q as possible. The loss function is expressed as follows:

In the function, CE represents cross-entropy, y is the one-hot encoding of real labels, q is the output result of the teacher network, and p is the output result of the student network.

As conventional large convolutional neural networks (CNNs), both VGG16 and ResNet50 models can achieve good classification results in many classification experiments. However, due to the excessive number of parameters of the two models, huge resources will be consumed in the process of standard model operation, which brings too much burden to the operating equipment. To solve the problem that the model has too many parameters and cannot be run on devices with low computing power, deep separable convolution is adopted by MobileNet to reduce the size of the model greatly [25]. In this paper, standard VGG16 and ResNet50 models were used as teacher networks. To improve the identification accuracy of the MobileNet model at various growth stages of wheat, MobileNet was used as the student model to learn the knowledge of the teacher network.

In the knowledge distillation method, Hinton et al. proposed for the first time to use the generated soft (fuzzy) label to assist the normal label training [26]. Soft labels are the output of a softened teacher network. The specific method of softening is to add the parameter T to the output function Softmax of the teacher network. By adding the parameter T to enlarge the probability value of other categories of the soft label, the student network can learn information more comprehensively. The softening output function qi is shown in the following formula:

In the formula, zi represents the learning object of the student network and is the logical output before the Softmax function of the convolutional neural network. The reason why parameter T was added is that the Softmax function is a steep curve, and the probability value of error classification would become extremely low after the output of the Softmax function, which would cause the student network to learn little information through the teacher network during knowledge transmission. A complex network can achieve a better classification effect. For a target image, the probability of correct output is much higher than that of error. However, this is not possible for a smaller network. Therefore, in this study, a parameter T was added to the small network Softmax. After the parameter T was added and a larger value was set, the output value would increase when the error classification passed through Softmax again, and the probability distribution of the output results would become more approximate, which is equivalent to a smoothing effect and plays a role in preserving similar information. When the same T value was used to train the small network, the target output would have a uniform distribution. At this time, if T was set to 1 again, the classification probability of a small network would be closer to the high recognition accuracy of a large network.

Knowledge distillation can transfer knowledge from one network to another, and the two networks will be isomorphic or heterogeneous. Meanwhile, knowledge distillation can be used to transform the network from a large network into a small network and retain its performance. The specific process is illustrated in Figure 2.

The basic steps of knowledge distillation are as follows:(1)A large amount of data is used to train a large model with good performance, and the model is taken as a teacher network.(2)The hyperparameter T is set and the student network is disaggregated using the pretrained teacher network. Firstly, L(soft) is obtained by the cross-entropy calculation of the prediction of the softened student network and softened teacher network. Then, L(hard) is obtained by cross-entropy calculation between the prediction of the student network without softening and the normal label. Finally, the two cross-entropy functions are trained according to the ratio α to form the final loss function. In this way, the teacher network and the student network can be trained together.(3)The trained student networks are classified using normal Softmax functions in the separate prediction stage.

3. Experiment and Result Analysis

The experiments were conducted on a personal computer equipped with Intel i7-8750H CPU, 16 GB main memory, NVIDIA GTX 1060 Ti GPU, and Windows 10 professional operating system. The code was written using the Keras package in Python.

3.1. Distillation Model Training Results

To ensure that the whole experimental process and results were reasonable and effective, all the hyperparameters of the experimental process were unified. 64 samples were selected for 1-time training, and the L2 regularization term was set to 0.0005. To avoid the problem of fitting the model, the learning progress of the model was controlled at 0.001.

To ensure that the teacher model and the student model were the best models, the experiment compared three mainstream deep convolutional neural networks, i.e., VGG16, MobileNet, and ResNet50 in terms of the accuracy of growth stage identification, the model running time, and the number of model parameters under 3-channel 224 × 224 RGB images. Figure 3 shows the accuracy variations in the iterative process of these three models on the validation sets of five wheat growth stages.

It can be seen from Figure 3 that in the iteration process, the accuracy changes of these three models on the validation set at the wheat growth stage generally showed different characteristics.

Specifically, the accuracy of all these three models showed obvious jitters in the iteration process. For the lightweight model MobileNet, the accuracy on the validation set of the wheat growth stage was relatively low in the initial iteration process. As the training of the model proceeded, the overall accuracy showed a trend of gradual increase.

Table 2 illustrates the comparison of the accuracy of the three models on the validation sets. It could be seen from the table that the identification accuracy of all the three models was very high for the five wheat growth stages. In particular, for the wheat emergence-tillering stage, the classification accuracy reached 96.8%, and even for the greening-jointing stage, the classification accuracy reached 89.9%, indicating that the model was effective. Through the analysis of misjudgment data, it could be found that the misjudgment mainly occurred in the late period of each growth stage and the early period of the next growth stage. The main reason for the misjudgment was that the phenological characteristics at this time were not obvious.

To verify the influence of taking different depth network models as the teacher model on the accuracy of the student model, the experiment used VGG16 and ResNet50 as a teacher model to guide the training of the student model MobileNet. The recognition results of the two teacher models before and after knowledge distillation are shown in Table 3.

By comparing the experimental results before knowledge distillation and after knowledge distillation, it could be seen that the recognition accuracy of single wheat growth stage and the average recognition accuracy of wheat in these five stages after knowledge distillation of VGG-16 and ResNet50 was improved. Among them, the accuracy of the seedling stage was the highest, which increased by 1.5% compared with that before knowledge distillation. The average accuracy of the ResNet50 model for the five wheat growth stages was the highest, with an improvement of 1.2%. Both VGG16 and ResNet50 models were slightly more accurate in identifying wheat growth stages after knowledge distillation. Although the improvement effect was not obvious, it was an improvement based on the high accuracy of the two models in identifying wheat growth stages before knowledge distillation. The experimental results show that the knowledge distillation of VGG16 and ResNet50 was effective. Through the learning of the knowledge from the VGG16 and ResNet50 models, MobileNet gained some of the missing information, thus improving the accuracy of the model even more than that of the VGG16 and ResNet50 teacher models.

It can be seen from the experimental results that the recognition result of the model was related to the network depth of the model. When ResNet50 was used as the teacher network, the network identification accuracy of the student model MobileNet was the highest, indicating that the greater the network depth of the teacher model, the greater the influence on the student model, and the higher the accuracy of the final model identification result. When ResNet50 was used as the teacher network, the network recognition accuracy of the student model MobileNet was 98.3%. To see how much performance improvement could be brought to the model by knowledge distillation, the student model MobileNet was compared with the teacher models VGG16 and ResNet50 in terms of the recognition accuracy, the number of model parameters, the memory of the model, and the training time of the model. It can be seen from Table 4 that after knowledge distillation, the number of parameters in the VGG16 model was 43.4 times that of the MobileNet model, and the number of parameters in the ResNet50 teacher model was 9 times that of the MobileNet model.

The number of parameters of the model reflects the complexity of the model and directly affects the deployment of the model on devices with relatively little computing power. The model training time indicates that the convergence time of the student model MobileNet after distillation was about 20 h. The accuracy of the distilled VGG16 model and ResNet50 model was 94.8% and 96.4%, respectively. Compared with the conventional deep convolutional neural network model, the distilled MobileNet model could improve the accuracy slightly and greatly reduce the size of the model and the training time of the model, thus speeding up the convergence of the model.

3.2. Visualization Results of the Feature Image

By feature visualization of the convolution layer in the teacher network, the running process of the model could be better understood. The convolutional layer was selected from a shallow layer to a deep layer for visualization, and the results are shown in Figure 4.

The front convolution layer of neural network is a shallow convolution layer, and the back convolution layer is a deep convolution layer. The two layers focus on different emphases. In Figure 4, each small graph is visualized by different convolutional layers in an increasing order from the shallow layer to the deep layer. It can be seen from the figure that the shape of the image becomes less obvious. This is because the features extracted from the shallow convolution layer emphasized texture and detail information, and the basic shape of the object can be clearly illustrated. As the number of layers increases, more abstract features are extracted and more transformation operations are performed to describe an object more comprehensively. Generally, the deeper the layers, the more representative the extracted features. Through the visualization of the network [26], the focus of the network on object recognition can be better understood. The brighter the image, the more obvious the features.

4. Conclusion

To address the problems of huge model parameters and long model training time of neural networks, this paper proposes to use knowledge distillation technology to establish a classification model to identify five different wheat growth periods. The main conclusions are as follows:(1)The standard CNN models VGG16 and ResNet50 and the lightweight CNN model MobileNet were used to predict the wheat growth periods on the test set. The average accuracy of these models was 93.5%, 94.6%, and 92.6%, respectively, which indicates that the convolutional neural network is feasible and efficient for wheat growth periods and recognition.(2)When ResNet50 was used as the teacher model, the accuracy of the student model MobileNet was improved to 98.3%. Then, the trained student model MobileNet was evaluated by the accuracy of wheat single growth periods, the number of model parameters, the memory occupied by the model, and the time spent in the model training. The results showed that the average accuracy of the MobileNet model after distillation was 97.3%, the number of model parameters was only more than 1.3 million, the model size was compressed to only 19.7 MB (88.9% smaller than that of ResNet50), and the operation convergence time of the model was only 20 hours (only half that of the teacher model). The model after knowledge distillation has high accuracy and can meet real-time performance requirements.(3)Knowledge distillation can make the accuracy of a small model close to that of a large model, thereby greatly reducing the number of parameters, model size, and running time of the identification model of wheat growth periods. Based on this, the trained model is easy to deploy and run on equipment with low computing power, which provides technical support for intelligent wheat production in the field.

Data Availability

Due to the limitations of the funded projects, the research data related to this study are not suitable for public sharing at present. When the project is completed, relevant research data can be provided.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This paper was sponsored by the Henan Province Key Scientific and Technological Project (222102110234).