Abstract

Traditionally, the classification of seed defects mainly relies on the characteristics of color, shape, and texture. This method requires repeated extraction of a large amount of feature information, which is not efficiently used in detection. In recent years, deep learning has performed well in the field of image recognition. We introduced convolutional neural networks (CNNs) and transfer learning into the quality classification of seeds and compared them with traditional machine learning algorithms. Experiments showed that deep learning algorithm was significantly better than the machine learning algorithm with an accuracy of 95% (GoogLeNet) vs. 79.2% (SURF+SVM). We used three classifiers in GoogLeNet to demonstrate that network accuracy increases as the depth of the network increases. We used the visualization technology to obtain the feature map of each layer of the network in CNNs and used the heat map to represent the probability distribution of the inference results. As an end-to-end network, CNNs can be easily applied for automated seed manufacturing.

1. Introduction

Maize is one of the most important crops global-wise. About one-third of the world’s population consumes maize as the major food source. Due to the urbanization, the cultivated land has been decreasing, which is a prominent issue particularly in China. The quality of seeds has become a growing concern for us. The phenotypic defects of seeds are one of the criteria for judging the quality. The traditional method of detecting seed defects typically relies on manual inspection, which is inefficient and subjective. Therefore, an objective and automated seed screening method is required.

Researchers have applied machine vision technology to achieve seed quality testing [14]. Features, such as color, texture, size, and shape, can be extracted from images of seeds, and the defects of the seed can be identified through various classifiers based on computer vision. This procedure can be easily automated and thus provide a significantly more efficient method for seed sorting than being inspected by human labor.

In recent years, deep learning has been rapidly developed. For example, some search engines, recommendation systems, image recognition, and speech recognition have adopted deep learning techniques and achieved good results [5]. As the performance of the GPU and the power of parallel computing continues to improve, it is possible to process graphical data in real time. Excellent results have been achieved with convolutional neural networks (CNNs) for image recognition. CNNs have obvious advantages over ordinary machine learning algorithms: the deep convolutional neural network is more independent on the display structure of the data during the image processing and weight sharing, down sampling, and local receptive fields can be used in the network [6]. The data processing method guaranteed the high efficiency of the CNNs. In the process of using CNNs, we only need to determine certain parameters, such as learning rate, weight loss, and batch size, of the optimization algorithm. At the same time, the accuracy of the algorithm increases with the amount of data we obtain. However, CNNs also have their disadvantages: (1) there are thousands of parameters in the model, and it is a long process to tune the parameters. (2) There are a lot of redundant information in the large number of feature maps; we cannot artificially intervene to filter out useful information. Researchers have also applied CNNs to the task of seed identification. Heo et al. used CNNs to filter weed seeds from high-quality seeds [7]. Veeramani used CNNs to distinguish between haploid and polyploid maize seeds [8], Ravindran used transfer learning to classify wood species [9], and Uzal used CNNs to estimate the number of soybean seeds [10].

The purpose of this study was to use CNNs and transfer learning to identify the appearance defects of maize seeds. We used CNNs to compare with traditional machine learning algorithms; the relationship between network depth and the accuracy of CNNs was also studied.

2. Materials and Methods

2.1. Data Collection

We used a “1/2.5” CMOS camera (MindVision MD-U500) with a 28 mm lens (MindVision ML15). The maximum resolution of 25921944 and an FPS of 10. The depth of each color channel is 8 bits, and a white LED ring light source (MindVision MD-HX24) is used. A black background was used to make the maize seeds and background more distinguishable.

The seed image library was acquired by our custom-made image acquisition equipment (Figure 1). In this study, 4000 corn seeds were used and divided into training and testing sets, 20% of the seeds are randomly selected as test sets. Table 1 shows the number of divided data sets. Both the training set and testing set were composed of two groups of corn seeds, respectively. One group of seeds were defect free in appearance, and the other group was with defects including, mold, worm, damages, and discoloration (see Figure 2).

2.2. Image Segmentation

In the actual testing process, it is unrealistic to test only one seed at a time, so we imaged multiple seeds at the same time. The problem was that there might be seeds getting in touch with each other, which greatly affected the classification of single seeds. Many researchers have open-sourced projects for target detection. These frameworks are based on CNN and enable identification of multiple targets in a single image simultaneous. The best performing frameworks are Faster R-CNN [11], SSD [12], and YOLO [13]. Faster R-CNN showed excellent results in the agricultural sector [14]. However, due to the large amount of computation in the training process, high-performance GPU is required to participate in the operation. Therefore, we used image processing algorithms for seed singulation.

We used state-of-the-art image processing techniques with OpenCV [15] to segment seeds from the images. Firstly, the original color image was converted into grayscale image, and the maximum interclass variance method (OTSU) [16] was used to obtain the binary image of the approximate contour of the seed. The morphological operations were then used for removing noise in the background and holes in the seeds area.

The next step was to dilate the image so that the image of the real seed was a subset of the expanded image. We used distance transform to find the center area of each seed, and then subtracted the dilated image from the central region to obtain an indeterminate seed edge. Lastly, we used watershed algorithm to get the exact position of each seed edge, and different colors were used to represent each seed for easy observation. The process was shown in Figure 3(a).

The final step was to extract each seed from the original image. We calculated the position coordinates of each seed. In order to prevent interference with adjacent seed images during the segmentation process, we only kept one maize seed in the original image before cropping (Figure 3(b)); the final results were shown in Figure 3(c).

2.3. Data Augmentation

Since there were limited data in the training set, increasing the number of pictures before starting training would increase the accuracy of CNNs [17], and considering uncertain factors such as the placement angle and position of the seed during the recognition process, we randomly flipped, moved, rotated the images, and applied the transformed images in training along with the original images to improve the robustness of the model. The total number of seeds is 20000.

2.4. Machine Learning Algorithm

A traditional computer vision method requires extracting features, such as shape, color, and texture, from the original image for the training process [3], which is usually overly complicated, so we adopted Speeded Up Robust Features (SURF) algorithm [18] and used a variety of classifiers for comparison in this study.

The location and direction of the seeds could be random during the classification process, and the SURF algorithm was used to extract the features from each seed image. The algorithm has the characteristics of scale and rotation invariance and uses an integral map to calculate the convolution, then use the Hessian response to measure whether a point is a feature point and create a descriptor to describe the feature.

We used logistic regression, support vector machine (SVM), K-nearest neighbor (KNN), and ensemble learning to classify the extracted features: (1) logistic regression is a simple classifier which models the mean response as a function of the linear combination of predictors. (2) Medium Gaussian SVM is a support vector machine that makes fewer distinctions than a fine Gaussian SVM, using the Gaussian kernel with kernel scale set as the square root of the number of predictors. (3) Coarse KNN is a nearest-neighbor classifier that makes coarse distinctions between classes. (4) Bagged trees is a bootstrap-aggregated ensemble of fine decision trees. For each method, we used cross-validation during the training. These classifiers were implemented using the Classification Learner App [19] in MATLAB 2018a, which has the advantage of integrating multiple classifiers and eliminating the necessity to set any complex parameters.

2.5. Convolutional Neural Network

There are two ways to train a CNN: training from scratch using random numbers as starting values or using transfer learning. In many practical tasks, it is not possible to train a network from the beginning, since it takes significant amount of computing time to obtain a large number of training data and using a small amount of data to train network from the beginning will cause over-fitting of the network. This problem can be easily solved using transfer learning [20], because the network is initialized by an optimal pretraining model (trained by the ImageNet dataset [21]) and only light tuning is required during the training process.

There are two ways to use transfer learning. One is to fixate the weight parameters of certain layers and only the parameters of other layers can be changed during the training. The other is to make all the initial weight parameters alterable in the training process; this method is called fine-tuning. The first method was used in this study.

We used a shallower network VGG19 [22] as the base network. First, the weight parameters of the entire network were initialized, locking the top 22 layers, and a three-layer custom layer (a global average pooling layer and two dense layers) was added at the end of the network. The network structure is shown in Table 2. The output shape indicates .

The entire network was composed of 25 layers, which included: an input layer, convolution layers, activation layers, pooling layers, and fully connected layers. The input layer accepted RGB images with a size of . Each convolution block contained a convolution layer and an activation layer. The convolution kernels were all in size, without padding. The ReLU [23] function was used for activation layers, and the max pooling was adopted for pooling layers. The global average pooling layer was used to prevent the entire network from overfitting. The last layer was a 2-way dense layer with softmax. The total network parameters were 20,551,746. Only the last three layers with 527,362 parameters were required to be tuned in the training. Cross entropy loss function and Adam [24] optimization algorithm were used in the model. The initial value of the learning rate was set as 0.001.

Szegedy et al. studied the effect of the depth of the network in training accuracy [25]. We also devised a study to improve the performance of the model by using a deeper network—GoogLeNet. The transfer learning was also applied to GoogLeNet in the study. The reason for choosing this network was that there were two auxiliary classifiers to compare the relationship between network depth and recognition rate. The inception structure was used in the network. The inception structure was composed of convolution kernels of , , and , respectively. The purpose was to extract richer features comparing with large convolution kernels when performing convolution operations on the same size of the receptive field [26]. The adoption of a convolution kernel was actually a convolution operation for each pixel of the feature map. The effect was equivalent to a fully connected layer. In fact, it acted as a dimensionality reduction, and the computational complexity was significantly reduced. At the end of the structure, four convolution operations were aggregated, and different feature maps were spliced together to enrich the feature map (Hebbian principle), and another function was to decompose the sparse matrix into a dense matrix. Redundant information was not processed for the original feature map, which speeded up the calculation.

In addition to the inception structure, we also added dropout layers and randomly discarded 40% of neurons to prevent overfitting of the network. Two auxiliary classifiers were added after Inception (4a) and Inception (4d) in order to achieve better results. During the training process, the loss was added to the total loss with a certain proportion, but these were not applied to the process of inferencing. We used this structure to evaluate the effect of the depth of the network structure on the performance of the network.

3. Results

The SVM algorithm with Gaussian kernel function had the highest accuracy rate of 79.2%, which performed the best (Table 3).

We used Keras [27] with TensorFlow as the backend to get the model of VGG19. Figure 4 shows the accuracy of training loss, testing loss, training accuracy, and testing accuracy of testing set for VGG19. The test accuracy was finally stabilized at around 88%. Comparing with machine learning algorithms, there was a significant improvement. We selected 100 maize seeds as the validation set to verify the algorithm, and the confusion table is shown in Table 4.

Since there was no integrated GoogLeNet pretraining model in Keras, we used Caffe deep learning framework. Figure 5 shows the training loss, testing loss, training accuracy, and testing accuracy of the three classifiers (the blue line and the red line represented two auxiliary classifiers, respectively, and the green line represented the main classifier). An accuracy level was reported every 500 iterations. It could be seen that the accuracy reached more than 90% in the previous iterations. Table 5 shows the accuracy of the three classifiers after 8000 iterations. The accuracy of the main classifier (deep layer) was 96.3%, and the accuracy of the auxiliary classifier (shallow layer) for the network was 95.7%. We used the same 100 seeds for algorithm verification. As shown in Table 6, the accuracy of the verification set was 94% which was slightly lower than the test set.

3.1. Visualizing Feature Maps and Heat Maps

CNNs is an end-to-end architecture. The recognition result can be automatically obtained by feeding only the pictures to be recognized to the network. The intermediate process was usually a black box and not interpretable. We used the visualization technology [28, 29] to extract the feature map of each layer in the network. In order to facilitate observation, we selected seeds with obvious damage (Figure 6(a)). Taking the first convolution layer, for example, Figure 6(b), as an output, there were 64 feature maps. It could be seen that the layer retained the original image color and texture feature information. In addition, we can observe that the features extracted by the network become more abstract as the depth of the layer increases.

We also visualized the heat map, which indicated which part of the picture was determined by the decision classification of the network. The heat map was a two-dimensional matrix that represented the CNN’s score for each position of the input image. As an example, Figure 7 shows that the worm part had the greatest influence for the classification results.

4. Discussions

In the actual classification process, due to the random placement, the direction of seeds was indefinite, so the SURF (with scale and rotation invariance) algorithm was used to extract the characteristics of the seeds and there was no need to use other algorithms to correct the rotation of the seed.

Unlike traditional machine learning algorithm, deep learning does not require complex feature extractions. We only need to consider using the appropriate network structure with the corresponding optimization algorithm. And changing the learning rate and other parameters during the training process can make the algorithm reach an optimal state. In the initial experiment, we tried to train the data from scratch using some lightweight networks (AlexNet), but these networks have experienced severe overfitting.

The reason for the excellent performance of transfer learning was that a best subset of initial values was used for model parameters. It was not necessary to start the model training with completely random numbers for all parameters. We used a shallower network (VGG19) for transfer learning: the training loss dropped significantly, but the testing loss decreased at the beginning and tended to be unchanged in the later stage. The accuracy of the training set was close to 95% and the testing accuracy of the model was stabilized at 88%, which indicated that the network had reached its optimal state. Although it had a better performance than SVM, the accuracy level was still not ideal in practice. The reason was that since VGG19 needed to identify multiple defects of seeds, the difference between some seeds with and without defects was not obvious. Therefore, the classification error was large, and it was necessary to extract deeper features. Thus, we adopted a more complex network GoogLeNet, which had an excellent ability to extract features based on the inception structure. The accuracy of the shallowest classifier was 95.7%, and the accuracy of the entire network was 96.5%. The results showed that deepness of the network played an important role. Although we achieved decent results with a deep network structure, there was overfitting during the iterative process (the training loss was decreasing and the testing loss was rising), which indicated that the amount of data available was still relatively small for complex networks like GoogLeNet.

5. Conclusions

In this paper, we used CNNs and transfer learning to achieve defect classification of maize seeds. Experiments demonstrate the availability of CNN in seed defect classification tasks, the CNNs were significantly better than machine learning algorithms in maize seed defect evaluation and the accuracy of the model increased as the depth of the network increased. The appearance defect of the seed is one of the indicators for judging the quality of the seed.

In this research, we only applied the CNN to RGB images, it can also be applied to multispectral or hyperspectral images. The application of multispectral images not only has the ability to recognize the phenotypic characteristics of seeds but also different varieties, enhancing the generalization ability and practicability of the model.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflict of interest.

Authors’ Contributions

Sheng Huang and Xiaofei Fan contributed equally to this work.

Acknowledgments

This work was supported by the introduction of talent research projects in Hebei Agricultural University.