Abstract

Dry beans are the most widely grown edible legume crop worldwide, with high genetic diversity. Crop production is strongly influenced by seed quality. So, seed classification is important for both marketing and production because it helps build sustainable farming systems. The major contribution of this research is to develop a multiclass classification model using machine learning (ML) algorithms to classify the seven varieties of dry beans. The balanced dataset was created using the random undersampling method to avoid classification bias of ML algorithms towards the majority group caused by the unbalanced multiclass dataset. The dataset from the UCI ML repository is utilised for developing the multiclass classification model, and the dataset includes the features of seven distinct varieties of dried beans. To address the skewness of the dataset, a Box-Cox transformation (BCT) was performed on the dataset’s attributes. The 22 ML classification algorithms have been applied to the balanced and preprocessed dataset to identify the best ML algorithm. The ML algorithm results have been validated with a 10-fold cross-validation approach, and during validation, the CatBoost ML algorithm achieved the highest overall mean accuracy of 93.8 percent, with a range of 92.05 percent to 95.35 percent.

1. Introduction

People eat dry beans, which are a type of legume that is self-pollinated. Beans are a significant crop on a global scale and are popular with both farmers and consumers. Dry beans account for nearly 50 percent of the grain legumes consumed directly by humans in the majority of developing countries [1]. Beans are a staple food in Sub-Saharan Africa, where they are consumed by more than 200 million people [2]. A system of quality control makes sure that approved seed meets national and global quality benchmarks. For the majority of food products, visual characteristics are the primary criterion used by consumers when making purchasing decisions [3]. Like other legume species, common beans show the most variation in terms of growth patterns, physical features (size, shape, and shading), maturity, and ability to grow and adapt [4, 5]. Sorting and classifying bean seeds manually is a time-consuming process. Additionally, this method is inefficient and tedious, particularly when working with large production volumes. Human inspectors are usually in charge of checking raw materials, and it is difficult to streamline the inspectors’ findings. These considerations reaffirm the importance of objective measurement systems. As a result, automatic grading and classification methods are required.

Recent technological changes have helped researchers in this field a lot. Computer vision systems (CVSs) are being used for quality control and have recently begun to be used as an objective measurement and evaluation system [69]. CVS technology, which is primarily camera cum computer based, has been considered for sensory characteristics of agricultural products. This system consists of a light source, an image acquisition device, and computer peripherals and software. The digital repository systems provide this information widely with various attributes.

Equal numbers of input samples represent each output class (or target class), which is known as a balanced dataset. Imbalanced training data has a major negative impact on real-time performance [10]. The majority of the reported studies used a target class with an uneven distribution of observations, i.e., an imbalanced dataset.

The main contribution of this research is to develop the unbiased ML based multiclass classification model to identify the dry bean variety with the best accuracy using the balanced dry bean dataset available at the UCI ML digital repository [11]. Using the preprocessed balanced dataset, the dry bean types such as “Dermason,” “Sira,” “Seker,” “CAli,” “Bombay,” “Horoz,” and “Barbunya” have been identified without losing any features available in the dataset. The 22 ML algorithms have been evaluated with 10-fold cross-validation to identify the best ML multiclass dry bean classification model. To make the model more accurate, the BCT was used to reduce the skewness of the dataset’s attributes, making them almost identical to a normal distribution.

Kilic et al. [12] used computer vision to develop the classification system for bean varieties. The system consisted of hardware and software. The hardware was developed to capture a standard image from the samples. The software part discusses segmentation, morphological operation, and colour quantification of the samples. The 69 samples have been used in their artificial neural network (ANN) model. The system’s overall performance in classifying beans was 90.56 percent.

Using an infrared hyperspectral imagery method that works in the wavelength range of 390–1050 nm, Sun et al. [13] examined a quick and nondestructive method for categorising black bean variants. The primary component of the image was used to extract 16 textural and 6 morphological features by using ray level co-occurrence matrix analysis. Hasan et al. [14] examined various categories of dry beans and used a deep neural network-based method to categorise them. The outcomes indicate that their approach was 93.44 percent accurate and had an F-1 score of 94.57 percent when applied to a dataset of seven varieties of dry beans.

Giza3, Giza461, Misr1, Nobarya1, and Sakha1 are the five varieties of Egyptian faba-bean seeds studied by Abdulwahed et al. [15]. This method uses morphological features and an ANN to grade and classify the quality of Egyptian faba-bean seeds. Based on 15 physical traits of the seeds, artificial neural networks separated faba beans into different types.

It was presented by Araújo et al. [16] to develop a computer-based visual inspection system for beans that used correlation-based multishape granulometry in order to locate each grain in an image as well as its size and eccentricity. Using this method, their system correctly located 29,993 out of 30,000 grains, even when there were a lot of “glued” grains in the image.

De Oliveira et al. [17] used ANN as the transformation model and the Bayes as the classifier to identify the coffee beans types such as whitish, cane green, green, and bluish-green. The neural network models achieved a generalisation error of 1.15 percent, and the Bayesian classifier identified all samples.

Gope and Fukai [18] discussed the assessment of the Raspberry Pi 3 system’s capacity in low-income countries for classifying peaberries and normal beans. They discovered that due to hardware constraints in the case of large-sized images, the Raspberry Pi 3 could not complete computation with linear support vector machines (SVMs) and k-nearest neighbors (kNNs).

Arboleda et al. [19] created the classification model for identifying coffee bean species. From 195 training images and 60 testing images, significant coffee bean morphology attributes such as bean area, perimeter, equivalent diameter, and percentage of roundness were extracted. The coffee beans were automatically classified using ANN and kNN. ANN obtained classification scores of 96.66 percent.

Koklu and Ozkan [11] used CVS to develop a multiclass classification of dry beans. The CVS-derived bean images were subjected to segmentation and feature extraction stages, yielding a total of 16 features, 12 dimensions, and 4 shape forms from the grains. With 10-fold cross validation, multilayer perceptron (MLP), SVM, kNN, and decision tree (DT) classification models were developed, achieving overall classification rates of 91.73 percent, 93.13 percent, 87.92 percent, and 92.52 percent for MLP, SVM, kNN, and DT, respectively. Table 1 shows the methodology and performance of various classification approaches for bean variety classification.

In this article, the proposed multiclass classification model uses the balanced dataset with 16 features and 7 varieties of dry beans. To avoid classification biassing of ML algorithms towards the majority group due to the unbalanced multiclass dataset, each dry bean type has 522 instances (522 ∗ 7) with 16 features in the processed dataset.

3. Exploratory Data Analysis and Methodology

The proposed multi-class classification model is depicted in Figure 1. The model’s initial stage is data preprocessing. The second stage of the model is the Box-Cox transformation, and the final stage is ML model development.

3.1. Data

The data science process is a methodical way to address a data problem. In most scenarios, a data science project will have to go through five critical stages: problem definition, data processing, modelling, evaluation, and implementation. The dry bean dataset for this research was obtained from the UCI ML repository, which is accessible at [11]. It is also available as a supplementary file with this article. The dataset contains information about the images taken with a high-resolution camera of 13,611 grains of seven different registered dry beans. From the grains, a total of 16 features were extracted. This study examined seven distinct varieties of dried beans, with market conditions dictating features such as aspect, shape, category, and structure. The dataset is available in.csv format for the dry bean varieties “Dermason,” “Sira,” “Seker,” “CAli,” “Bombay,” “Horoz,” and “Barbunya” with a total of 13611 instances. Table 2 shows quantile and descriptive statistics for 16 features of the dry bean dataset.

3.2. Data Preprocessing

Preprocessing strategies improve the performance of classifiers [20]. The information extraction (IE) method of extracting structured content such as entities, interactions, facts, and terms, as well as other kinds of information that aid the data analysis pipeline in prepping the data for the study [21]. The distribution in the dry bean variants of dry bean dataset is shown in Figure 2. Figure 2(a) shows the percentage of distribution of seven dry bean varieties, and Figure 2(b) shows the individual dry bean variety count in the raw dataset. It is observed that the dry bean type “DERMASON” has appeared at a maximum of 26.1 percent and the dry bean type “BOMBAY” at a minimum of 3.84 percent. The most frequently encountered problem in data quality is the absence of feature values in some entries. The missing values for each instance have been checked. The total data set instances become 13543 from 13611 instances after dropping the duplicate instances. Classification is a process that can be applied to structured or unstructured data. The class wise count of dry bean dataset is 3546, 2636, 2027, 1860, 1630, 1322 and 522 for DERMASON, SIRA, SEKER, HOROZ, CALI, BARBUNYA, and BOMBAY, respectively, after dropping the duplicate instances. Except for the target “Class,” all feature data types have been converted to float.

3.2.1. Creation of a Balanced Dataset

Because of the unbalanced multiclass dataset, learning algorithms will be influenced towards the majority population. In contrast, the minority class is typically more significant from the perspective of data mining, as it may contain valuable information amidst its rarity. When encountered with such disparities, the researchers should design an effective model capable of handling the bias. This is referred to as learning from unbalanced data [22]. In terms of balancing distributions, there are methods for creating new objects for the minority group (over sampling) and methods that eliminate instances from the majority group (under sampling) [23]. Overfitting may result from the creation of new instances for the minority group. As a result, the random undersampling method used in this article will make the majority group of instances in the dry beans dataset matchable with the minority dry bean group. All of the dry bean types of instances were brought to 522 instances uniformly using the random undersampling method. This can be observed in Figure 3. To develop the model, a balanced dataset with 3654 instances has been considered. Each bean variety has 522 instances.

The steps followed in the creation of a balanced dataset are as follows:(i)Step 1: The majority and minority classes in the dataset have been identified. The majority class index in the preprocessed dataset is “DERMASON,” and the minority class index is “BOMBAY,” with 3546 and 522 instances, respectively.(ii)Step 2: The number of instances of “BOMBAY” is less by comparing all other classes. It is decided that the maximum number of instances is 522 for each variety of bean.(iii)Step 3: The random samples of other bean varieties have been chosen.(iv)Step 4: All the samples have been concatenated, and the balanced dataset has been created.

3.3. Box-Cox Transformation

When handling with a skewed outcome, investigators use log transformation to normalise the data before applying standard statistical tests, such as the t-test, linear regression, etc. Nevertheless, log-transformed data will not always be normal. In such instances, BCT can be implemented to normalise skewed data [24]. Initially, the dry bean dataset features were applied with log transformation. It fails with a reduction in negative skewness. As shown in Figures 4(a)4(p), the BCT was applied to all of the features of the dataset for transforming the skewed data into a normal distribution. For each attribute, the figure on the left shows the distribution before BCT, and the figure on the right shows the distribution after BCT. The skewness can be found at the top right corner of the figure. Y represents the dependent (continuous) variable, while X represents the independent variables (1, x1, x2,…, xk). In the equation, the BCT [24] used to transform the skewed distribution into a normal distribution without the original scale is given (1). The maximum likelihood technique is commonly used to determine the parameter lambda (λ).where

X is the covariate matrix, which includes the intercept. is a regression coefficient vector. is the variance of random error. is a random error.

3.4. Machine Learning Model
3.4.1. Training and Test Dataset

The training dataset is the set of data used to construct the model, which contains known features and a target. The created model will also need to be validated against another well-known dataset known as the test dataset or validation dataset. To meet this challenge, the entire known dataset can be divided into training and a test set [25]. The dry bean categorical classes, namely “SIRA,” “BOMBAY,” “DERMASON,” “BARBUNYA,” “HOROZ,” “CALI,” and “SEKER” were converted into integer types as 1–7, respectively. The training and test sets have been split in an 80 : 20 ratio, with 2923 and 731 instances with 16 features, respectively.

3.4.2. Machine Learning Algorithm (MLA) Selection

A model built with a single method may not offer the best prediction for a specific dataset. Each machine learning technique has its own constraints and creating a model with significant accuracy is difficult. The 22 MLAs were used to determine the accuracy of various MLAs on a balanced dataset. It helps us to bring out a better predictive model. The 10-fold cross validation has been performed and the mean accuracy of 19 MLAs has been listed in Table 3. Ensemble methods [26] such as AdaBoost classifier, Bagging classifier, and extra tree classifier, generalised linear models [27] like logistic regression, passive aggressive classifier, Ridge classifier, stochastic gradient descent classifier, and perceptron, Navies Bayes models [28] like Bernoulli and Gaussian MLA, kNN, and SVM algorithms [29], tree-based methods [30] such as DT classifier and extra tree classifier, and discriminant analysis methods [31] such as linear and quadratic discriminant analysis. Gaussian process MLAs have been evaluated with 10-fold cross-validation. Figure 5 displays the mean accuracy of MLA performance with 10-fold cross validation in descending order. The logistic regression provides the highest test accuracy of 92.69 percent in the 19 MLAs and the lowest accuracy of 12.77 percent found with the Bernoulli Naive Bayes ML classifier.

From the initial screening during validation, it is observed that the XGBoost, RF, and CatBoost algorithms offer greater precision. Therefore, in the following sections, the performance of these three algorithms with an 80 : 20 balanced dry bean dataset and with 10-fold cross validation is described.

3.4.3. Random Forest Algorithm

The DT modelling is an important part of RF. It is used on several samples of the original data obtained by the bootstrap method. Samples of the original data are used to make the bootstrap samples, and each sample has the same number of data points as the original data. The RF [32] constructs multiple DTs as well as merges them to produce more precise and stable predictions. The node’s importance is calculated as follows:where Cj = node j’s impurity value, j = the weighted sample size arriving at the node j, and right(j) and left(j) are the child node from right and left split on node j, respectively.

An individual attribute’s feature importance is

3.4.4. Extreme Gradient Boost

XGBoost [33] is a framework of the gradient boosting machine (GBM), a well-known algorithm for supervised learning. It is appropriate to both classification and regression tasks.

If DS is the set of data containing “m” attributes, then for “n” occurrences

Let ŷi be the ensemble tree model’s target value constructed using the equation.

Here K denotes the model’s total number of trees and denotes the model’s kth tree. Classification and Regression Trees (CART) serve as the base learner for Gradient Boosted Trees, which is a popular machine learning algorithm for both classification and regression problems. F’s functional space is f, and the set of feasible CARTs is F.

3.4.5. Cat Boost Classifier

Categorical boosting (CatBoost) is a Yandex-developed open-source boosting library [34]. CatBoost implements oblivious DTs (binary trees in which the same features have been used to create left and right splits for every level of the tree), thereby limiting the number of features split per level to a single instance, which aids in reducing prediction time. In the dataset “D” of dry beans, for every instance has “m” features in a vector “x” and the target dry bean class type, y.

Mathematically, the target assessment of the ith categorical data of the kth element of dry bean dataset D for dry beans can be expressed as follows:when a > 0. When the ith component of CatBoost’s input vector xj is equal to the ith component of input vector xk, the indicator function returns the value 1. The parameters “a” and “p” (prior) prevent underflowing in the equation. is a permutation at random.

3.5. Results and Discussion

The use of diverse bean varieties in dry bean cultivation actually inhibits the production of uniform crops. As a result, the resulting product, which includes a set of dried bean species, incurs economic losses. To address this issue, the purpose of this study is to distinguish the seven classes of dry beans cultivated in Turkey, as determined by the Turkish Standards Institute (TSE). The dry beans dataset has been processed through the developed model. The confusion matrix of three MLAs, namely RF, XGBoost, and CatBoost, is shown in Figure 6. Confusion matrices enable a more detailed visualisation of results and a comparison of actual and predicted values. In Figure 6, “SIRA,” “BOMBAY,” “DERMASON,” “BARBUNYA,” “HOROZ,” “CALI,” and “SEKER” are denoted as 0, 1, 2, 3, 4, 5, and 6. The correctly predicted sample numbers can be found in the diagonal part of the confusion matrix. The misclassified instances are available in other parts of the confusion matrix. For example, in Figure 6(c), for the dry bean variety “SIRA,” the correctly identified test set instances were 84. Seven test instances were identified as “DERMASON,” two instances were identified as “BARBUNYA,” three instances were identified as “HOROZ,” two instances were identified as “CALI,” and two instances were identified as “SEKER.” Figure 7 shows the receiver operating characteristic (ROC) curve that shows the performance of the RF, XGBoost, and CatBoost ML classification algorithms. ROC is the plot between true positive and false positive. In ROC, the area under the curve (AUC) represents the degree or measure of separability. It shows the model’s capability of distinguishing between dry bean classes. It is observed that the CatBoost algorithm provides the AUC value for the “SIRA” dry bean type as 0.99, and for other dry bean types such as “BOMBAY”, “DERMASON”, “BARBUNYA”, “HOROZ”, “CALI”, and “SEKER” has an AUC value of 1. Table 4 provides the performance metrics like precision, recall, and f1-score of the three ML algorithms, and Table 5 provides the ML model accuracy with an 80 : 20 dataset. The accuracy of the model has been improved by about 1.49 percent using the balance dataset and the CatBoost ML algorithm.

Among the 22 MLAs tested, it is observed that the CatBoost ML classifier provides the best performance. Table 6 shows the performance comparison with the existing method. The CatBoost ML classifier performs well as compared to the existing method under balanced instances for seven dry bean types. CatBoost ML excels at solving classification problems with heterogeneous data.

3.5.1. Model Performance with Cross-Validation (CV)

The three algorithms RF, XGBoost, and CatBoost have been validated with 10-fold cross validation with 90 : 10 data split. In cross-validation with k folds, the original dataset is randomly subdivided into “k” mutually exclusive subgroups or “folds” (F1, F2, …Fk) of roughly equal size. There are k training and testing iterations. In iteration “i” the test set is partition Fi, while the remaining segments, subgroups, or folds are used to train the model collectively [29]. Table 7 and Figure 8 show the 10-fold cross validation accuracy of the three MLAs. In 10-fold cross validation, the CatBoost ML algorithm achieves the highest overall mean accuracy of 93.8 percent, with a range of 92.05 percent to 95.35 percent.

4. Conclusion

Classification of dry bean seed varieties is critical for seed uniformity and quality assurance. Compared to human inspectors, the system possessed two significant advantages. It produces higher, reproducible, and objective sample classification, and also excludes the possibility of human inspectors misclassifying specimens. Initially, the dry bean dataset features has been applied with log transformation. It fails with a reduction in negative skewness. The BCT was applied to all of the features of the dataset for transforming the skewed data into a normal distribution. A model constructed using a single method may not provide the best forecast for a given data set. Each machine learning technique has its own set of restrictions, making it challenging to create a model with substantial accuracy. The accuracy of various MLAs on a balanced dataset was determined using the 22 MLAs. It supports us in developing a more accurate predictive model. The accuracy of the model has been improved by about 1.49 percent using the balance dataset and the CatBoost ML algorithm. The developed models’ high success rates across all metrics indicate that they are effective at classification. The overall system mean accuracy of a balanced dataset is obtained as 93.8 percent for the CatBoost ML model. The results indicate that the proposed CatBoost ML classifier can be used effectively to classify a variety of dry bean variants. Additionally, this developed framework can be applied to various kinds of dry beans from various regions. The model is developed without losing any features from the dataset. The ML model can be upgraded further by combining ML, deep learning, and novel algorithms.

Data Availability

The dataset are available in a publicly accessible database.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Supplementary Materials

Dry Bean Data Set. (Supplementary Materials)