Abstract

Recognizing and authenticating wheat varieties is critical for quality evaluation in the grain supply chain, particularly for methods for seed inspection. Recognition and verification of grains are carried out manually through direct visual examination. Automatic categorization techniques based on machine learning and computer vision offered fast and high-throughput solutions. Even yet, categorization remains a complicated process at the varietal level. The paper utilized machine learning approaches for classifying wheat seeds. The seed classification is performed based on 7 physical features: area of wheat, perimeter of wheat, compactness, length of the kernel, width of the kernel, asymmetry coefficient, and kernel groove length. The dataset is collected from the UCI library and has 210 occurrences of wheat kernels. The dataset contains kernels from three wheat varieties Kama, Rosa, and Canadian, with 70 components chosen at random for the experiment. In the first phase, K-nearest neighbor, classification and regression tree, and Gaussian Naïve Bayes algorithms are implemented for classification. The results of these algorithms are compared with the ensemble approach of machine learning. The results reveal that accuracies calculated for KNN, decision, and Naïve Bayes classifiers are 92%, 94%, and 92%, respectively. The highest accuracy of 95% is achieved through the ensemble classifier in which decision is made based on hard voting.

1. Introduction

In many developing nations, farming is the most significant economic sector. Most of the tasks are carried out without the use of modern technology. Seed categorization is usually done based on human understanding. The purification of seeds plays an important role in this process and must be enhanced. Manually determining the type of wheat needs expert judgment and takes time. When an array of seeds appears so similar, manually distinguishing them becomes a challenging process [14].

A quality evaluation method for wheat crops is required with the evolution of the grain chain [5]. The goal is a high-quality wheat product in greater quantities. Tests on seed germination are required for seed labeling [6]. The seeds were tested for pureness test, which is required to ascertain a seed sample’s physical and varietal purity. The genetic integrity of the original wheat cultivar may be compromised by mechanical mixing and improper labeling [7].

Classification testing is accomplished via a taxonomic categorization method and nondestructive grain feature analysis [8, 9]. Seed tester classifies seeds on two levels: species level and varietal level for varietal purity. The varietal level may be challenging due to the great degree of resemblance in the characteristics of different kinds of wheat seeds. The growth circumstances may also affect the grain’s properties [5]. The actual classification test procedure is still a low-throughput process, and its correctness is contingent on the expert’s performance and cumulative experience.

Machine learning techniques are now the subject of study in a variety of disciplines, particularly with the expansion of the Internet and the usage of bigger datasets [4]. Without specific tools or automatic software procedures, it is difficult for a human operator to interpret or handle such data. Machine learning is frequently used in various applications such as categorization, regression, and forecasting to meet these demands [10]. The use of a single classifier for the objects which has a very minute difference in physical characteristics such as color, texture, and morphology does not give better accuracy [11]. To address this problem, ensemble approach is used in the present work. By combining many models into a single, highly reliable model, ensemble methods seek to increase model prediction. The most common ensemble approaches are boosting, bagging, and stacking. Ensemble techniques are particularly well suited to regression and classification, where they minimize bias and variance while increasing model accuracy. The major findings of this paper are as follows:(1)An optimized classifier is designed for wheat seed classification by utilizing an ensemble machine learning approach with bagging(2)The model is compared with three machine learning classifiers: K-nearest neighbors (KNN) classifier, decision tree classifier (CART), and Gaussian NB (NB).(3)The highest achieved accuracy is 95%, which is with the ensemble method

Seed categorization using machine algorithms has been the subject of research. These studies employed a variety of machine classifiers and achieved a high degree of accuracy in their work. Machine learning techniques had previously been successfully applied in a variety of production chains for seed and cereals classification [1214]. In [15], the study shows the capability and possibilities of machine vision for shapes, sizes, and varietal types using well-trained multilayer neural network classifiers. They utilized Weka classification tools such as function, Bayes, meta, and lazy approaches to categorize the seeds. In [16], the authors proposed a fuzzy theory-based approach for recognizing wheat seed types that take into account the features of the seed. The tabu search technique was used. In [17], the authors have used an artificial neural network for classifying wheat seeds based on VLC and obtained an accuracy of 92.1 percent and 85.72 percent, respectively. In [18], authors have discussed morphological, color, and textural characteristics of the seed. If there is a very minute difference in morphological features, then seed classification is very difficult. Cereal yield is determined by the number of grains per ear and the size of the grains. Counting seeds and morphometry by sight is time-consuming. As a result, different ways for effective grain morphometry employing image processing techniques have been proposed [19, 20].

In [21], the authors created a workstation to aid in grain analysis for classification, and a video colorimetry approach is presented to support in determining cereal grain color. The categorization of chickpea seed varieties was done based on the morphological characteristics of chickpea seeds, using 400 samples from four types: Kaka, Piroz, Ilc, and Jam [22]. According to the commercial point of view, a machine vision built of existing neural network models may be utilized for rice quality assessment [23]. In this, it uses neural networks to categorize rice varieties, using a total of nine separate varieties of rice. The authors employ seed image acquisition to classify these variations. They also created a method for extracting 13 morphological features, 6 color features, and 15 texture features from color photographs of seeds. Their model has produced an overall classification accuracy of 92%. The k-nearest neighbors classifier necessitates storing the entire training set, which can be prohibitively expensive when the set is huge, and several researchers have attempted to eliminate the training set’s redundancy to relieve this problem [24, 25].

For plant categorization, the authors have utilized deep learning models [26]. Two tendencies may be seen in the current state of the art. The first is linked to high-throughput phenotyping and plant identification, as evidenced by Ubbens and Stavness’ work in this area [27]. The second problem is plant disease identification and monitoring [28, 29]. In [30],the authors present many voting techniques for testing ensembles of classifiers learned using the bagging approach. Multilayer perceptron is used as classifiers. Using groups of classifiers rather than individual ones is one option. Bagging [31] and boosting [32] are two of the most well-known ensemble techniques, in which many classifiers are combined to generate a single, more accurate result. In [33], the authors studied the performance of several voting techniques, with bagging being utilized for the reconciliation model, which is a process of merging classification models. Table 1 contains various features considered in machine vision systems for food grain quality evaluation.

3. Methodology

The methodology adopted for this work includes the collection of datasets, features identification, data augmentation, classification using machine learning algorithms like KNN, Naive Bayes, and CART implementation, implementation of ensemble approach for better accuracy with fine-tuning, and comparison of results.

3.1. Dataset

In the study, the seed dataset was received from the UCI library [42]. There are 210 occurrences of wheat kernels in the collection. In addition to the class attribute, each instance contains 7 other attributes. All the samples share the same 7 characteristics (area, perimeter, compactness, kernel length, kernel width, asymmetry coefficient, and kernel groove length). All of the characteristics are constant. Figure 1 shows the features of the dataset number of classes presented and selected machine learning algorithms.

The examined set contained kernels from three wheat varieties: Kama, Rosa, and Canadian, with 70 components. To perceive high-quality picturing of the interior kernel construction, a soft X-ray method is employed. It is less abrasive and more affordable than other sophisticated imaging techniques such as scanning microscopy or laser technology. Images are recorded on X-ray KODAK plates measuring 13 × 18 cm. At the Institute of Agrophysics of the Polish Academy of Sciences in Lublin, studies were supplemented by association harvested wheat grain patenting from experimental fields. In Table 2, classes represent 3 varieties of wheat which are Kama (1), Rosa (2), and Canadian (3).

3.2. Machine Learning Models

K-nearest neighbor: a k-nearest-neighbor algorithm is a data categorization method that predicts how likely a data point belong to one of two groups based on which group the data points closest to it belong to [43]. Any common method can be used to determine the distance. A Euclidean distance is an example of a distance. After that, we collect a certain feature value from all the training set in the immediate vicinity. We categorize our fresh testing data using most of this number as a prediction [44].Classification and regression trees (CART): CART is a nonparametric supervised learning method [45, 46]. The objective is to build a model of the value of a target variable by learning basic decision rules from data characteristics. A piecewise constant is approximated by a tree. Nonstatisticians can analyze CART quite well [45, 47].Gaussian Naïve Bayes: A basic approach for Naive Bayes is Bayes decision theory. The likelihood is used to make this classification decision. The posterior probability is calculated using the likelihood, prior probabilities, and evidence. Evidence is merely a scalar that ensures that posterior probability equals one. Resultant classes for the given test data are chosen based on the category with the highest posterior probability.Ensemble methods: ensemble methods are strategies for increasing the accuracy of model outputs by integrating many models rather than utilizing just one. The integrated models greatly improve the accuracy of the outcomes. The popularity of ensemble techniques in machine learning has risen because of this. When dealing with enormous amounts of data or a lack of appropriate data, ensemble-based solutions can be unexpectedly effective. When the quantity of training data is too enormous to train a single classifier, the data can be divided into smaller groups deliberately. After that, each division may be used to train a distinct classifier, which can then be merged using a suitable combination algorithm. If there is not enough data, bootstrapping may be used to train alternative classifiers using distinct bootstrap samples of the data, each of which is a random sample of the data taken with replacement and handled as if it were drawn independently from the underlying distribution [48].

Bagging is one of the oldest, more obvious, and most likely simplest ensemble-based methods, and it has very maximum performance. Bootstrapped copies of the training data are used to obtain the diversity of classifiers in bagging such that different chunks of training data are arbitrarily chosen from the entire training dataset, with substitution. Each subset of training data is utilized to train a particular sort of classifier. Individual classifiers are then merged using a simple majority vote. By resampling the data, boosting also produces an ensemble of classifiers, which are subsequently merged by majority vote. Resampling, on the other hand, is used deliberately in boosting to give the most useful training data for each subsequent classifier.

We applied a hard voting classifier, which means that the forecasted output class is the one that obtains the most votes. Three classifiers predicted the output wheat classes Kama, Rosa, and Canadian, and most of them anticipated Kama wheat variety as the result. As a result, the ultimate forecast will be Kama wheat.

3.3. Experimental Setup

Python3 libraries such as NumPy, SciPy, scikit-learn, Keras, pandas, and Matplotlib are utilized to perform the categorization through ML models. Scikit-learn appears to be the most user-friendly and reliable machine learning library [49, 50]. The foundations of this package are NumPy, SciPy, and Matplotlib. The results of the dataset analysis, as well as the model’s training and testing using numerous feature extractions are presented in this section. Figure 2 shows the classification process.

3.4. Performance Evaluation Matrix

To measure the effectiveness of the classifier, a confusion matrix is required, which provides the number of accurate and wrong predictions based on known true values. It shows true positive (TP): actual value is true and model predicted it true, true negative (TN): both actual value and predicted value are false, false positive (FP): actual value is false but model predicted it true, and false negative (FN): actual value is true but model predicted it as false.Accuracy: accuracy is the measure of how often a model has predicted the right value as per the given input. But it does not give detailed information regarding FP and FN. For some applications where FP and FN are considerable, F1-score and recall play a very important role. Accuracy is calculated byPrecision: this evaluation parameter tells how frequently a model predicts true positive. The low value of precision infers high false positives. Formula for calculating precision is as follows:Recall: this parameter gives information regarding how often a model predicts false negatives. The low value of recall means the model predicted high false negatives. Formula for calculating recall is as follows:F1-score: the F1-score is calculated by combining both precision and recall. That is, a high F1 score indicates a low number of false positives and false negatives, which infers that the model is accurately detecting actual threats and are not bothered by false alarms. The formula for calculating the F1 score is

4. Result

Results present the evaluation of considered four ML models: KNN, Naive Bayes, CART, and ensemble method. For evaluating models, we have divided the collected data into 70% for the training set and 30% for the testing set. The wheat classes Kama, Rosa, and Canadian have been assigned numbers 1, 2, and 3, correspondingly. The results of all considered algorithms are evaluated based on recall, precision, F1-score, and accuracy. Tables 37 show the value of these parameters for KNN, Naive Bayes, CART, and ensemble method, respectively. Figures 36 show the confusion matrix for KNN, Naive Bayes, CART, and ensemble method, respectively. The accuracy determined for KNN, decision, and Naive Bayes classifiers is 92 percent, 94 percent, and 92 percent, respectively, according to the data. Ensemble classifier, which makes decisions based on hard voting, has the best accuracy of 95 percent.

5. Discussion

To implement the KNN, we use the scikit-learn K-neighbors classifier. As an input parameter, the approach requires the number of neighbors. We can determine the wheat seed category by simply changing the number of neighbors. With this procedure, 92% accuracy is attained. The accuracy of building a CART using the scikit-learn is 94 percent. We used the scikit-learn Gaussian NB classifier and 92% accuracy is attained. We have achieved 95% accuracy using the ensemble method. Based on machine learning, a model for predicting determining factors of seeds has been created. The precision, recall, and F1 score of wheat seed dataset categorization using K-neighbors classifier (KNN), classification and regression trees (CART), Gaussian NB (NB), and ensemble method are compared in Table 8. Table 9 shows the summary of various classifiers accuracy. A chart is also presented to compare the accuracy of various methods in Figure 7 which shows the summary of various classifier accuracy.

6. Conclusion

Machine learning approaches in grain seed analysis and classification are playing a very important role. The major challenge in seed classification is the very minute difference between different categories of seeds. The accuracy of predictions with this challenge is improved by utilizing the concepts of ensemble learning. Wheat seed classification by considering seven independent features area, perimeter, compactness, kernel length, kernel width, asymmetry coefficient, and kernel groove length is presented in the paper. The ensemble machine learning approach with bagging and hard voting is utilized to best fit the classifier. Three machine learning algorithms K-nearest neighbors classifier (KNN), classification and regression trees (CART), and Gaussian NB (NB) are also implemented to compare the results. The results reveal that the accuracy calculated for KNN, decision, and Naïve Bayes classifiers are 92%, 94%, and 92%, respectively. The highest accuracy 95% is achieved through the ensemble classifier in which the decision is made based on hard voting. In the future, we can use other classification algorithms to improve accuracy.

Data Availability

Datasets related to this article can be found at “https://archive.ics.uci.edu/ml/datasets/seeds,” an open-source online data repository hosted at UCI Machine Learning Repository [42].

Conflicts of Interest

The authors declare that they have no conflicts of interest.