Abstract

Classification of imbalanced data is a challenging task that has captured considerable interest in numerous scientific fields by virtue of the great practical value of minority accuracy. Some methods for improving generalization performance have been developed to address this classification situation. Here, we propose a cost-sensitive ensemble learning method using a support vector machine as a base learner of AdaBoost for classifying imbalanced data. Considering that the existing methods are not well studied in terms of how to precisely control the classification accuracy of the minority class, we developed a novel way to rebalance the weights of AdaBoost, and the weights influence the base learner training. This weighting strategy increases the sample weight of the misclassified minority while decreasing the sample weight of the misclassified majority until their distributions are even in each round. Furthermore, we included P-mean as one of the assessment markers and discussed why it is necessary. Experiments were conducted to compare the proposed and comparison 10 models on 18 datasets in terms of six different metrics. Through comprehensive experimental findings, the statistical study is performed to verify the efficacy and usability of the proposed model.

1. Introduction

Classification research is an essential field of study in data science. In balanced data classification, the support vector machine (SVM) [1] and other classification modeling approaches have been broadly discussed and used successfully in a variety of applications [2]. In specific situations where there are imbalanced datasets, the traditional methods always reach their limits in the practical application of classification [3]. They are intended to produce a model that matches the training data well; in an imbalanced dataset, this strategy ignores unusual cases. For this reason, standard classifiers generally perform poorly. With the growth of data mining and data analysis, imbalanced class learning has become a hot topic, and many academics have conducted comprehensive studies on the subject [3, 4]. They defined the nature of imbalance data categorization along three dimensions: concept complexity, training set size, and degree of imbalance between the two classes. They also show how imbalanced datasets can invalidate many conventional classifiers and offer some instances of how to tackle these difficulties. Ghosh et al. attempted to tackle the problem using deep learning systems in 2021, and they discovered that deeper architectures are useful on some data with specific structures in both artificial and real image datasets [5]. Thus far, many excellent research results and improved algorithms have been proposed to solve the problems, such as covering data streams and big data analytics, in this field. Imbalance classification problems include binary imbalance classifications and multiple imbalance classifications that can be transformed into binary problems for solution. Binary imbalance classification problems are not only frequently encountered in real life but are also interesting problems in machine learning (ML). This class of problems is characterized by the fact that the number of samples from one side of the dichotomous dataset involved, called the minority class, is smaller than that of the other side, called the majority class. This minority class is more interested in classification tasks such as medical diagnosis [6]. The identification target detects the people with diseases that belong to the minority, and the consequences of the minority being misclassified are more severe than in the reverse case. The same is valid for detecting images [7], fraud [8], managing risk [9], classifying text [10], and recognizing faces [11]. Binary classification problems with imbalanced data are prevalent compared to other issues in real life. As a basis for classification problems, solutions to binary classification problems can be derived from other classification problems, such as multiple classification problems. Therefore, it is vital to study binary classifications in rare class classifications.

In the literature, operating on data or algorithms are the two leading solutions to the problem of imbalanced datasets [12]. The analysis of minority class structure involves determining the imbalance rate and whether it is the overlapping type of distribution; this is the critical reason why the study is complicated. The classification of extremely imbalanced datasets is even more complicated [4]. Resampling techniques for data processing have been adopted to renew the class distribution, such as sampling less of the prevalent class, sampling more of the minority, or more complex techniques [13, 14]. The most popular synthetic minority oversampling technique (SMOTE) is a simple and effective resampling method, which is also widely studied and used in combination with algorithms in the binary imbalance problem. However, resampling may undertake the risk of losing the essential information of the majority, overfitting the minority, and causing the preprocessed dataset to be unlike the raw data. Thus, in many cases, data-level methods are not studied alone but in combination with algorithm-level methods [1517]. At the algorithmic level, they train the classifier through the data without making distribution changes. Weighting or thresholding support functions or class likelihood estimates can produce better results than resampling the data and can be applied to any conventional classifier [4]. In addition, cost-sensitive learning and ensemble schemes are popular algorithms. A series of cost-sensitive versions of SVM are proposed, for example, cost-sensitive SVM (CS-SVM), which uses SVM to extend its loss function to achieve the objective [18]. SVM based on density weight (DSVM) and improved 2-norm-based density-weighted least squares SVM (IDLSSVM) for binary class imbalanced learning problems [19]. Entropy-based fuzzy twin SVM (EFTWSVM), where fuzzy membership values are assigned based on the entropy values of samples [20]. Entropy-based fuzzy least squares SVM (EFLSSVM) and entropy-based fuzzy least squares twin SVM (EFLSTWSVM) for class imbalanced datasets [21], and other cost-sensitive algorithms [2226]. However, an appropriate decision boundary cannot be found when the minority samples are sparse [27]. Boosting ensemble approaches overcome learning challenges from imbalanced data classes effectively [28, 29]. AdaBoost [30], the representative boosting strategy, enhances the classification performance of a model by minimizing the error probability. AdaBoost follows the output of the current classifier to modify the sample weight distribution for the next round, and it distinguishes between instances of correct classification when weights are reduced and those of misclassification when weights are increased. However, it does not make further distinctions between different classes of instances, which results in the weights of different classes of instances being increased or decreased in the same way, which is clearly an unsuitable strategy for the imbalanced classification problem. In an imbalanced problem, a good strategy of weighting is one that can distinguish between different instance categories, with which more weight can be given to those relevant instances that have high recognition importance. For this reason, researchers have developed a series of algorithms that try to adjust the weights of instances according to their category labels, which are AdaC [31], CSB1, CSB2 [32], and AdaCost [33].

SVM and other algorithms can be embedded into a boosting process and have verified exceptional capabilities [3436]. Considering the binary classification of the data imbalance, we propose an SVM-based ensemble method that increases the focus on minority accuracy. Our approach is manipulated in two aspects:(1)Ensure the weights of the misclassified minority and majority samples will not be changed in the same way, because we are concerned with whether the misclassified minority samples can be corrected in the next round. We propose a novel way to rebalance the instances’ weights in each boosting processing, although several methods have been proposed for implementing weight updating in AdaBoost [37]. Thus, the sums of the weights of the misclassified minority and majority samples are balanced with each other.(2)We use a different approach to vary the parameter C, a crucial parameter for the SVM algorithm, so that each sample receives different costs related to the probability of misclassification, rather than simply dividing the samples into minority and majority categories. The key to this operation is to determine cost items as a function of the weighting of the AdaBoost framework at each iteration, retraining the SVM algorithm for the current iteration by changing the cost for each sample to affect the next iteration positively. Combining these two aspects establishes a link between the AdaBoost frame’s weight and the SVM-based classifier’s cost items. The proposed method uses the AdaBoost weight-adjustment process to solve the data imbalance problem. The rebalanced weights assign the determination of the cost items in SVM learners at each iteration. Different SVM-based learners can be generated to improve the generalization performance based on the previous self-adjusting weights of the instances during the boosting process. Additionally, we have conducted experiments on various UCI ML repository [38] datasets. Apart from some routine evaluation indicators used for binary classification tasks, we use the -mean to highlight the accuracy of minority classification to display the high efficiency of the presented algorithm.

The remainder of this paper is organized as follows. Section 2 summarizes the related literature on classifying imbalanced data in recent years. Section 3 introduces the background models, including the SVM model, the enhanced AdaBoost model, and AdaBoost with SVM-based and cost-sensitive SVM. Section 4 presents our method and procedure. Section 5 presents the results and comparisons with other methods based on different datasets and metrics. Section 6 presents the conclusions.

The topic of imbalanced data classification is a difficulty for all researchers, and in addition to a number of successful strategies that have been investigated, researchers are still attempting to address this obstacle using the most recent methodologies. The binary imbalanced job, as the cornerstone of the imbalanced classification problem, arises from numerous real life applications. Hazarika and Gupta presented a novel density-weighted twin SVM (DWTWSVM) for binary imbalance data classification and used density-weighted least squares twin SVM (DWLSTSVM) to boost the computational speed; then, the optimization problem is turned into solving the 2-norm of slack variables and equality constraints [39]. Additionally, using ensemble techniques to handle the imbalanced binary classifications [4]. We intend to address the imbalanced binary classification problem by altering the base classifier of the AdaBoost algorithm, for example, SVM. The SVM algorithm has been considered as one of the most effective classification methods since its introduction, and many excellent research results and improved algorithms have been proposed to solve the classification problem thus far [40]. Because of the vast potential, using SVM as component classifiers is not a new attempt. Many researchers have long focused on combining SVM and ensemble learning methods. Sun et al. [2] analyzed the AdaBoost algorithm and developed three forms of inputting cost terms into the AdaBoost algorithm framework to achieve cost-sensitive purposes. The costs marked the uneven importance of identification between classes and participated in the weight update of AdaBoost. Based on the research of Sun and Kamel, Tao et al. [41] employed cost-sensitive SVMs as base classifiers, while the normal boosting process was modified into a cost-sensitive classifier by presenting a self-adaptive method for determining the cost weight sequences of misclassification. This method allowed for adapting the various contributions of minority samples in the SVM classifier at each round according to the previous classifiers obtained. In this way, different classifiers were generated, thus improving the generalization performance. Lee et al. [28] introduced a weight adjustment factor mechanism of weighted SVM, which was used as a weak learner, and it was proposed for the imbalance data classification target. Instances were classified into four categories: bounded support vector (BSV), support vector (SV), positive noise, and others based on location. They gave different adjustment factors for BSV, SV, and positive noisy instances. In the process of learning a weighted SVM, the weights of instances in the AdaBoost algorithm were multiplied based on the adjustment factors. Wang and Sun [42] proposed an alternative method to improve AdaBoost based on the AD AdaBoost [43] algorithm. The imbalanced ratio of data was a factor and was defined as b = Np/Nn, the majority, and the minority size ratio. In addition to the implementation of the parameters C and the weights, there was also the implementation of the SVM-based ensemble algorithm by changing σ. Li et al. [44] proposed an AdaBoost-SVM method in which the sequence of trained radial basis function SVM (RBFSVM) component classifiers was inserted into the AdaBoost framework. The large σ-values at the beginning were reduced as the boosting iterations proceeded. This allowed a range of RBFSVM component learners with adaptively different parameters, which would have better generalization than the AdaBoost method using SVM component classifiers with fixed σ-values.

Additionally, the SVM ensemble models’ high performance has encouraged researchers to develop applications in different fields. For example, Sun et al. [16] proposed a dynamic approach that proved the efficiency of financial distress forecasting with two types of sample imbalances. This method was a forecasting method that combined the time-weighted strategy and AdaBoost with both the SVM-based integration algorithm and an oversampling technique. The results showed that the embedded integration model had significant advantages over the simple base classification model, although both the simple and embedded integration models improved the identification of rare financially distressed samples. In recent years, Liu et al. [45] presented an AdaBoost algorithm that shared SVM with a series of parameter methods to transfer the source task positive and unlabeled learning problem knowledge to the target task. The method combined the weak classifiers into a strong AdaBoost model for prediction. In addition, they considered the similarity of fuzzy examples in terms of minority and majority classes to refine the classifier’s decision boundary. Yao et al. [46] solved the class imbalance problem in forecasting corporate credit risk in the supply chain context using the suggested hybrid model, which combined SVM and AdaBoost ensemble models with an artificial imbalance rate model and distinct feature selection approaches. They claimed that the proposed model mitigated the problem of class imbalance. This not only enhanced the sample distribution variety but also made the AdaBoost integration more stable and generalizable. Wei et al. [47] proposed a fault diagnosis algorithm to address the problem of poor accuracy of actuator failure identification under airplane closed-loop control. The algorithm extracted failure features using the aggregate experience model decomposition method and principal component analysis (PCA). Simultaneously, an adaptive SVM method was embedded in the AdaBoost framework to perform classification operations on them. Additionally, SVM-based ensemble methods have gained tremendous application in various areas in recent years [4852].

3. Background Models

The following is the basic form of the binary classification model. Suppose a binary classification training dataset is given in which each sample consists of an instance and label as S = {(x1, y1), (x2, y2) ,..., (xN, yN)} instance xi ∈ XRn, label yi ∈ Y = {−1, +1}. X is the instance space, and Y is the label collection. According to the mathematical representation in the algorithms’ derivation and implementation, we used positive and negative classes to refer to minority and majority classes, respectively.

3.1. SVM Model

The SVM learning strategy maximizes the interval, which is called the margin. A wider margin corresponds to a more significant difference between the two types, making it easier for us to distinguish between them. Therefore, finding the optimal decision hyperplane corresponds to the maximum margin between the two types of samples. The SVM model designs a hyperplane with dimensions m − 1. The hyperplane can divide the data of N samples in m dimensions into two categories. For nonlinearly separable data that can cause problems with the algorithm, two approaches can solve this issue. The first one is to enhance the low-dimensional data through a kernel function and use the SVM model in high dimensions to find the appropriate decision hyperplane. The second method introduces slack variables violating the interval constraints slightly. Soft margin SVM can be converted to optimize:

Equation (1) is generally transformed into its dual problem and then solved. After the learning problem of SVM is transformed into convex quadratic programming, it has a globally optimal solution. Several optimization algorithms exist for the fast implementation of this problem. Here, we use the sequential minimum optimization (SMO) algorithm. The essential principle behind this approach is that if all variable solutions fulfill the Karush-Kuhn-Tucker (KKT) condition of this optimization issue, the optimization issue’s solution is attained because KKT is both a necessary and sufficient condition for it; otherwise, two variables are selected, additional variables are specified, and a quadratic programming problem is created for these two variables. The subproblem has two variables: one violates the KKT condition negatively, while the other is found automatically by the restrictions. In this approach, the SMO algorithm constantly decomposes and solves the original issue into subproblems. Through these, we can build a decision function using as follows:

3.2. Cost-Sensitive Support Vector Machine

Cost-sensitive classification learning approaches eschew the common classification strategy of supposing that the cost of all misclassifications is the same and then design classification algorithms that minimize the probability of error. However, in some of the aforementioned cases, this strategy is suboptimal; for example, one type of error is costlier than the others, or examples from different categories occur with different probabilities. Consequently, it is important to develop extensions of cost-sensitive techniques. Veropoulos et al. [53] presented a penalized regularized cost-sensitive SVM model to reduce the negative overwhelming impact. According to the class labels of the training data, the samples are classified into two exact classes: positive S+ = {i|(xi, yi) ∈ S, yi = 1, i = 1,..., N} and negative S = {i|(xi, yi) ∈ S, yi = −1, i = 1,..., N}. As the set S is divided into indices S+ and S, this model also introduces C+ and C penalty factors for positive and negative slack variables. In the optimization process, the positive samples retained higher penalty values than the negative samples. It implements the SVM problem as follows:

3.3. Enhanced AdaBoost Model

The accuracy-oriented nature of the AdaBoost algorithm prevents it from achieving the desired results if it is applied directly to the classification of imbalanced data. Thus, several researchers have made various improvements to the AdaBoost framework. The enhanced AdaBoost model [42] can be improved by adding the weighted voting parameters α, which are determined by the overall fault rate and the accuracy of the minority primary interest. Km is the sum of all positive samples’ weights, and m is the sum of the sample weights labeled positive and predicted to be positive. The ratio of these two is denoted as γm.

After initializing the sample weight D1, we find the base classifier Gm (x) to minimize the error:

We repeat the computation of the weak classifier weight αm:

Then, we renormalize ωm+1,i until m reaches M, the previous setting iterations. The parameters k and β in the above expression are crucial to ensure that the enhanced AdaBoost boosts the classifying efficiency of S+ and maintains a low global error rate. The final output G (x) is a linear combination of a series of weak classifiers.

3.4. AdaBoost with SVM-Based

In imbalance classification problems, the generalization performed by the SVM-based AdaBoost method is superior to that of a single SVM [44]. In this section, we describe the SVM-based AdaBoost model. The AdaBoost model is a forward stepwise additive model consisting of a basic classifier whose loss function is exponential L (y, f (x)) = exp[−y f (x)]. fm − 1 (x) is the first m − 1 base classifier. The and that minimize (7) are the αm and Gm (x) obtained by the AdaBoost algorithm:

The algorithm learning model is equivalent to the final classifier of AdaBoost when the base classifier is Gm (x):

Because multiple parameters are involved in both SVM and AdaBoost algorithms, there are various combinations of SVM with ensemble learning. These include the cost-sensitive AdaBoost algorithm [2, 31] and AdaC2, which have shown good and relatively stable performance [2]. AdaBoost with a heterogeneous SVM could also work well [54]. The following is a brief and concise example of an SVM cost-sensitive ensemble based on adaptive cost weights [41], a method that adaptively considers the various contributions of positives to the SVM classifier during boosting based on the previously obtained classifiers. This study has done more work on positive class samples because of their greater importance, giving a higher cost value to positive instances that are misclassified than to all correctly classified positive instances. This approach has allowed for a great deal of work on instance placement to be completed, assigning larger cost values to borderline instances rather than instances far from the boundary in all cases where it does not matter whether the positives are classified correctly. By incorporating the costs into the updated weights, the weights of positives with higher costs further increase when they are misclassified; otherwise, they further decrease. Initialize D1 = C+/Z0 for all positive instances, and D1 = C/Z0 for all negative instances, Z0 = C+ + nC, where and n are the number of the positive and negative classes. C defaults to one. We calculate the weight updating parameter as follows:

Further, we update and normalize sample weights:

In , where is the value associated with xi calculated by the decision function of SVM and Zm represents normalization value, the final classifier is .

4. Proposed SVM-Based AdaBoost Ensemble

We use SVM as a fundamental weak learner and extend the weight design of the AdaBoost framework to cost-sensitive classification problems. This extension means rebalancing the sum of positive and negative sample weights that are misclassified. A cost-sensitive ensemble classification algorithm is derived in which the weights of misclassified samples from the positive class are added, and the weights of misclassified samples classified correctly from the prevalent class are reduced in our approach. This guarantees that more weights are cumulated in the positive class to influence the training. Moreover, the update of the sample cost vector by the SVM learner is indirectly determined by the associated weight term. With this strategy, the SVM classifier can adaptively consider the different contributions of each instance in each iteration based on the previous boosting process. Instead of focusing too much on samples from the positive class as in other cost-sensitive class algorithms, our algorithm assigns a higher cost value to all misclassified cases during the SVM training process and further handles misclassified samples from the positive class in the rebalancing phase. This allows our algorithm to not only impact significantly on forming the classification but also to address the cost-sensitivity problem. In this section, we detail the two improvements we made to AdaBoost with SVM and the theoretical justification for each part, giving the algorithmic description at the end. Figure 1 shows the procedure of the proposed approach. The specific procedure and detailed computational equations for solving our dual problem are shown in Algorithm 1.

(1)Input: Training samples S = {(x1, y1), (x2, y2) ,..., (xN, yN)}.
(2)Initialize the D1 = (ω11,..., ω1i,..., ω1N) with ω1i = 1/N, and the C1 = (C11,..., C1i,..., C1N).
(3)For m = 1, 2, …, M.(a)Using Cm to obtain an SVM classifier Gm (x) on the training dataset.(b)Calculating the coefficient αm according to equation (15).(c)Updating rebalanced Dm + 1 using equations (16), (17).(d)Updating the value of Cm + 1 using equation (14).
(4)Building linear combinations of basic classifiers.
(5)Output: The final classifier.
4.1. Cost Weights AdaBoost-SVM Model

One way to combat the problem of skewed datasets is to work on the penalty factor, which is to give a larger penalty factor to positive classes with small sample sizes, indicating that we value this part of the sample. The penalty factor C is not a variable, and the whole optimization problem is solved with a value of C that you must specify beforehand. After specifying this value, you can obtain a classifier and then evaluate it with the test data. If the result is not satisfactory, change the value of C and repeat the process. This is a parameter search process; however, it is unlike the optimization problem itself. Here, we determine C automatically by the updated weight values for a reason, not by random guessing.

Compared to cost-sensitive SVM, we assign a different cost to every misclassified instance instead of one in distinct classes. To solve the problem efficiently and to apply the kernel technique more conveniently, we convert the primary problem into its dual problem and then solve it. The Veropoulos model of the soft margin SVM prototyping [1, 55] is developed as follows:

C is a column vector, C = [C1, C2 ... CN]T. ξ denotes a vector of corresponding slack variables of the training data ξ = [ξ1, ξ2 ... ξN]T. We take the Gaussian kernel function, and the dual and kernelized formulation can be derived as follows:

To find the optimal solution , we choose a positive component 0 ≤ λj ≤ Cj of λ and compute b = yj − ∑yiK (xi, xj) to construct a decision function f (x) = sign (∑yiK (xi, xj) +b).

We divide all the classified samples into three types. In the collection of the first sample type M1= {i|Gm (xi) yi = 1}, samples in this set type are all correctly classified and labeled as iM1. In the collection of the second sample type M2 = { i | Gm (xi) = -1, yi = 1}, the positive samples are classified incorrectly as negative, and we label the weight of this type of sample as iM2. These samples are concerning, as we consider positives classified correctly as a superior task, and we need to rebalance these instances. In the collection of the third sample type M3 = {i|Gm (xi) = 1, yi = −1}, the samples in the third set are classified incorrectly as negative samples into the positive. We consider them less important than the second type of misclassification; therefore, they would not be weighed again. We denote these samples as iM3. M3 and M2 are all misclassified samples; M1, M2, and M3 are combined into a complete set. Instead of using a parameter constant C in the standard SVM to control the maximum hyperplane interval in the objective function while ensuring the minimum deviation of instances, we would give a parameter vector C so that each slack variable ξi has a weight Ci. We use a different C for each outlier, which means we value each sample differently. We assign a smaller C to those samples that are inconsequential compared with those instances that are not to be misclassified. We update the cost terms continuously as the weights change during the boosting process. This cost weights the AdaBoostSVM model we call Ada-SVM. The cost item of the i-th sample at the (m +1)-th round is applied as follows:

4.2. Rebalance Weights Model for Imbalanced Data

Unlike other cost-sensitive classification algorithms, we construct a new computational rule to assign weights to the instances based on the weight adjustment of the original AdaBoost framework. Here, Gm (xi) is the category of the i-th sample predicted by the base learner SVM at the m-th iteration. αm is the coefficient of the m-th base classifier, and ωmi is the weight of the i-th sample at the m-th iteration. (15) is used to calculate αm.

At the (m +1)-th iteration, the weight of misclassified positive and negative instances is formulated as follows:

The weight of other samples is formulated as follows:where

Theorem 1. The training error bound for the final classifier of the rebalancing AdaBoost is as follows:

The proof is shown in Appendix A.

Theorem 2. The training error bound for the binary classification problem rebalancing AdaBoost:where γ = 1/2 − em.

The proof is shown in Appendix B.

5. Experimental

Our cost-sensitive ensemble learning method was driven by the goal of obtaining all positive samples classified correctly and dealing effectively with imbalanced datasets. We illustrate the effectiveness of our method for the imbalanced data classification problem by using experimental data. The proposed method was compared with ten other approaches in six metric dimensions on our selected dataset with different imbalance ratios. Because our method is based on changing the AdaBoost base classifier to SVM, we compare our method with the original AdaBoost and SVM algorithms without any tricks, respectively, to show that combining the two methods makes our method better than both of them. To compare the performance with other improved algorithm-level classification methods, SVM-based cost-sensitive methods are chosen, and these included the prorated cost method of using the inverse of the size of positives and negatives as the penalty constants for the different classes, CS-SVM, and SMOTE + SVM. We also compare our approach with other state-of-the-art ensemble and data-level resampling strategies, namely, Easy Ensemble [56], SMOTEBoost [57], and SMOTEBagging [57]. Ada-SVM is used to prove that the rebalancing trick is working. The decision tree algorithm was used as a comparison algorithm to illustrate that our approach improved the recall for the positive samples and was not traded off by sacrificing other metrics.

5.1. Description of Datasets

Fourteen datasets with different numbers of attributes and sample sizes were selected from the UCI for the test to assess the behavior of the suggested method in handling classification tasks of imbalanced datasets There were some missing attribute values in several datasets, and KNN handled missing attribute value processing. Table 1 lists a detailed description of the dataset used. All datasets had two output labels, denoting the positive and negative categories. The attributes indicate the size of features in datasets. Imbalanced Ratio (IR) is the ratio of the number of samples in the negative class to the number of samples in the positive class. Generally, the larger the IR value is, the more harmful it is to the performance of traditional classifiers. We selected some datasets with IR in the range of 1 to 50 for our experiments. We believe that the ratio of the number of samples to the number of features may also be an influential factor in the classification of imbalanced data, and we chose datasets with this ratio in the range of 6 to 547 for our experiments. The training set was normalized before training by rescaling each feature to homoskedasticity for SVM-based algorithms, and the rescaled features that did not distort the original distribution were used on the testing dataset. Preprocessing eliminated the numerical difference between each feature X; further, controlling the size of each column of feature X within a specific range made the model prediction performance more accurate.

5.2. Evaluation Metrics

Specific evaluation metrics to observe the model’s performance in each category were introduced to evaluate the classifier’s classification performance.

5.2.1. Confusion Matrix

The confusion matrix in Table 2 shows how the classification model made mistakes when making predictions. All cases were recorded in the following four categories: TP, FN, FP, and TN.

5.2.2. Accuracy, Precision, Recall, and Specificity

Classification accuracy was the index for evaluating the classifier’s performance. This indicator was suitable for datasets with balanced categories. For lopsided datasets, the accuracy became unreliable. If positive examples: negative examples = 1 : 99, then the classifier would incorrectly predict all positive examples as negative. An accuracy rate of 0.99 was possible; however, this model could not identify positive examples. Precision, recall, and specificity are metrics generally used for binary classification problems.

5.2.3. Receiver Operating Characteristic (ROC) and Area under the ROC Curve (AUC), G-Mean, F1-Score, and -Mean

The ROC curve was a recurrent evaluation index for the two-class classification problem; it visually compared the performance of different models on the same dataset. When comparing the performance of the two models, if the ROC of one model completely wrapped the ROC of the other, the former was superior to the latter in terms of classification performance. The use of the area under the ROC curve (AUC) can afford a better model. AUC inherited the insensitivity as a scalar of the ROC curve and was often used in these classifications [58].

In the imbalanced classification problem, it was not comprehensive to consider any indicator alone; we needed to combine metrics to measure the model’s efficiency. The G-mean [7] indicator considered the accuracy of both the positive and negative samples. It was different from the overall accuracy and avoided the dominant influence of negative samples on the classification performance. The F1-score value allowed us to focus on small outliers as the harmonic mean of a set of numbers and was biased toward the smallest element in the list. We paid attention to positive samples and metrics biased toward positive samples, such as the recall of positive samples, when comparing the results and evaluating an algorithm. F1-scores were occasionally unreliable because they were influenced by the minimum value rather than the maximum value. Hence, we used the -mean, and the geometric mean belonged to recall and precision, which was considered positive and were not so biased toward the positive as the F1-score [7]. The reason for using -mean was to examine how well the algorithm performed in classifying positive samples; this aspect has not been widely studied. We considered the -mean an essential index to discriminate whether the classification of positive samples was sufficient. Given the increasing emphasis on recall, -mean reconciled the values of precision and recall, evaluating the model’s performance more comprehensively.

Figure 2 shows that in some circumstances, the usage of -mean is required. The Hepatitis dataset is stratified sampled and divided into two sections, with 80% of the data serving as the training set and 20% serving as the testing set. Multiple trials were run with our model and the SMOTEBoost model, and the results were displayed as box plots, with the performance of the two models under the F1-score and -mean assessment criteria highlighted in magenta and blue, respectively. Evidently, the median of our proposed method is slightly lower than that of SMOTEBoost under the F1-score evaluation criterion but significantly higher than that of SMOTEBoost under the -mean evaluation criterion, and in fact, our method outperforms SMOTEBoost in the goal of increasing the accuracy of small class samples. Under all assessment criteria, the data findings provided by our technique are clearly more focused than those obtained by the SMOTEBoost method, reflecting the improved stability of our suggested classification model.

5.3. Experimental Settings

All methods were compared with the proposed approach to produce a comprehensive evaluation. The data findings depicted by the icons are all averages of the testing data following a five-fold cross-validation. To avoid the effect caused by the number of two-class instances for training and eliminate randomness before training, we adopted stratified sampling to ensure that the samples after the five-folds split had the same imbalance ratio as the entire sample set, and then four-folds are used as training data with the remaining one-fold as the testing data. Cross-validation may result in a skewed test set in severe circumstances of data imbalance, leading to an inaccurate evaluation [5]. As a result, we also carried out balanced testing, with all assessments based on five-fold stratified cross-validation. We consider the data with an IR higher than or equal to five to be highly imbalanced.

For a more comprehensive comparison of our method with other methods, we ran two independent groups of tests [3, 4]: (1) 17 datasets to compare our method with other methods in terms of inquiry structure of classes and sample-to-feature ratio, and (2) we derived 12 new datasets, each with 1000 examples, from a random undersampling in the Pageblock dataset with different IR , aiming to explore the effect of IR on our method compared with others while maintaining the remaining conditions. We examine the same IR levels with the datasets in the first group and higher IRs of 36, 40, and 50 at the same data size.

In the proposed SVM-based AdaBoost ensemble, three parameters had been prespecified: the parameter σ for the Gaussian kernel, the initial penalty vector C for each instance, and T for the number of iterations. We employed grid-search to predefine the Gaussian width parameters for the SVM classifier to avoid implications of parameters on the performance. We use one as the σ value for all datasets, and for others, including the Ecoli, Breast Cancer, Ionosphere, Teaching, and Pageblock datasets, we use 10.

The optimization problem was solved in such a way that C was always a fixed value. The cost factor C in cost-sensitive learning was determined for a specific reason, such as using the ratio of the number of samples between categories as the cost, similarly to the prorated cost method. Under some circumstances, the relevance of distinguishing distinct samples was described by cost items, and the cost of a specific sample relied on the properties of the unique situation. For example, regarding detecting fraud, the cost of missing a specific fraud case was determined by the amount of money involved [8]. Herein, we set the initial value to one, and the adjustment scale was updated by itself according to the changes in the iteration process.

The experiments of datasets presented in Table 1 were conducted to investigate the influence of the iteration parameter T on the performance of the proposed approach. This was a critical factor for improving classification performance. The G-mean, F1-score, -mean, AUC, accuracy, and recall in the panels in Figure 3 demonstrate that the algorithm was convergent and typically had stability after T reached 25 under settled σ. Therefore, we set the parameter T for all ensemble learning to a constant value of 25.

To ensure consistency across experiments, all classification algorithms involving the SVM parts are written in uniform handwritten code. Decision tree, easy ensemble, SMOTEBoost, SMOTEBagging, SMOTE + SVM, and AdaBoost, which make direct calls to the packages in Python, are algorithms that use default settings. Because the base classifier SVM has uncertainties in the selection of support vectors, the results would be slightly different for each run of our model. The implementation of the proposed method is publicly available in a GitHub repository (https://github.com/PChunyu/SVM-Adaboost-C). The computer configuration used to run all methods in this paper is an Intel (R) Core (TM) i5-7500 CPU @ 3.40 Hz and 3.40 GHz.

5.4. Experimental Results and Statistical Tests

In this section, we first show the effectiveness of our proposed rebalancing strategy and the proposed model. We present the experimental results of two groups in three dimensions: structure of classes, sample-to-feature ratio, and IR. At last, we performed hypothesis tests on all metrics between the proposed method and others to illustrate the validity of our model.

To confirm the validity of our proposed rebalancing weights approach, we compared the classification performance of Ada-SVM and proposed, and their differences are before and after rebalancing weights. Figure 4 shows the trend for accuracy and recall. Our approach’s performance was superior to that of Ada-SVM in terms of recall. After almost 20 iterations, the proposed method stabilized the recall rate at one, and the accuracy rate was the same or lower than Ada-SVM. Our approach improved the recall at the expense of accuracy in some datasets, an inevitable consequence of enhancing the recall rate. The rebalancing scheme contributed to the proposed method’s better generalization performance than that of Ada-SVM in terms of recall.

The Yeast 5 dataset with multiple feature dimensions was projected into the two-dimensional space using PCA to observe the decision boundary and demonstrate the effectiveness of the proposed ensemble. Figure 5 shows a set of classifications using different methods. Among the classification boundaries produced by the presented techniques, the proposed model classifies the best and has the highest overall accuracy while ensuring a recall value of one. The decision boundary for simple noncost-sensitive classifiers, such as no cost SVM, AdaBoost, and SMOTE + SVM, is a plane with no convergence, and although SMOTE + SVM is used as a classification method for minority class accuracy, the cost-insensitive regular SVM classifier is still used after SMOTE generates new data. These classifiers either perform better in accuracy alone or in recall only while ignoring total accuracy. The majority of the ensemble algorithms have irregularly delineated boundaries, which can more accurately identify regions with minority samples. However, evidently, the more detailed delineation of the minority by SMOTEBoost and SMOTEBagging does not guarantee that they achieve the desired results on the test set data, whereas Easy Ensemble does. Under normal circumstances, without any adjustment, the decision boundary of the cost-sensitive classifier for lopsided data was curved toward positive [55]. CS-SVM and prorated cost panels demonstrated the decision boundary’s warp toward the positive and were affected by imbalanced training data, particularly the positives. It was the reason CS-SVM and prorated cost led to poor generalization performance one testing data. Ada-SVM, No-Cost SVM, decision tree, SMOTEBoost, SMOTEBagging, and AdaBoost, regardless of accuracy, do not achieve a high recall value when compared to other techniques. The rest of the methods could correctly classify positive samples, whereas the accuracy was lower than that of the proposed method. Our method produced decision boundaries that did not favor the positive as in other cost-sensitive classification algorithms; it tended to the negative. This is because our proposed method aggravated the misclassified positive samples and improved the base classifier after readjusting the cost term for all misclassified samples. In general, the panel of the suggested approach showed the best fit among all the techniques. The maximum accuracy was achieved when the positive recall reached one.

5.4.1. First Group Datasets

Table 3 displays the experimental results of the first group; the bolded black values reflect the best values across all techniques. Except for the WDBC, Breast Cancer, Ionosphere, and Yeast5 datasets, where it does not achieve the greatest recall value, our technique has the highest recall values in all 13 datasets. On these four datasets, the difference between the recall value and the greatest recall value of our technique is not significant: 0.9297 (0.9576 for Easy Ensemble), 0.9713 (0.9834 for CS-SVM), 0.9603 (0.9923 for CS-SVM), and 0.8194 (0.9750 for Prorated Cost), respectively. It is also worth noting that our technique obtains a recall of one for both the Wine3, SPECT, and Ecoli_im datasets, as well as an average value of one after five-fold cross-validation, which indicates that the greatest recall value is obtained for each of the five cross-validation trials, demonstrating the excellent efficiency of our technique. Some algorithms also showed significant accuracy-orientedness in the experiments; the more lopsided the data, the worse the performance of the no cost, decision tree, and AdaBoost algorithms. As one of the most classic and simple methods, Easy Ensemble has the best performance in every metric on the WDBC dataset and the highest performance in the G-mean, F1-score, accuracy, and AUC metrics on the Ecoli_cp and Breast Cancer datasets. These algorithms in combination with SMOTE have excellent figures in other aspects except the recall metric. In some circumstances, CS-SVM obtained very impressive results as one of the representations of cost-sensitive classifiers at the algorithm level. CS-SVM, on the other hand, trained the classifier with fixed positive and negative category costs, causing it to perform better exclusively on a specific dataset, and prorated the cost as well. This restricts them to just marginally improved outcomes on specific datasets.

Although our method achieves the best level among all methods on the WPBC, Hepatitis, and German datasets, the recall levels of these datasets are overall low compared to the other datasets, where the recall values are all close to one. This has a lot to do with the structure of classes; a good classification can be obtained even using canonical classifiers from nonoverlapping distributions [4]. We selected the Ecoli_pp and German Credit datasets with consistent imbalance levels and a close sample number to feature number size ratios, as illustrated in Figure 6. After projecting the training and test set sample points of these two datasets into the two-dimensional plane separately, we can find that the two datasets differ significantly in the distribution of sample points. Both on the testing set and the training set, the distribution of positive and negative class samples on the Ecoli_pp dataset overlaps less, while the opposite is true for the German Credit dataset, which leads to the fact that all methods classify Ecoli_pp much better than German Credit on almost all metrics. Clearly, the lower the sample size to feature ratio, the more features are relative to the sample size, and too many features can easily lead to overfitting; additionally, in some models, when the sample size to feature ratio is too low, the model’s prediction becomes worse. However, this is not completely absolute, because it is difficult to control the other characteristics of the dataset at exactly the same level when discussing the sample number to feature number ratio. Hepatitis, for example, has a high ratio; however, because of structural overlaps, it is more impacted by the category structure, and even with a high sample-to-feature ratio, classification scores are still rather low. By comparing the data results, we found that a sample number to feature number ratio between 10 and 20 is a desirable range; for example, for the Wine3 and SPECT datasets with the highest recall, their sample number to feature number ratios are 14 and 12.

The most important is that our method obtained a considerable advantage in the positive recall at various IR. The performance of the recall results is depicted in Figure 7. The value on the bar is the sum of the all-dataset’s recall of the corresponding algorithm. The proposed approach had the maximum value of recall (15.93), followed by Easy Ensemble (14.57).

Table 4 displays the Shapiro-Wilk test findings because not all data followed the normal distribution at 5%. For statistical analysis, we used the parametric T-test and the nonparametric Wilcoxon test independently to examine if there were any obvious differences between the proposed strategy and the other ways in the experiment.

The recall value of the proposed model does not only look higher than other methods by visualization; however, by statistical tests, we found that the recall value of our method is statistically significantly higher among other methods, as compared to their respective counterparts. The results are listed in Table 5. Bold indicated significant values at the 0.05 level. Statistical results demonstrated that the proposed SVM-based ensemble learning method performed better than a single no-cost standard SVM in all metrics except the AUC and accuracy. The T-test and the Wilcoxon test results showed that our model was significantly better than other models in recall performances at the 5% level, while it is not statistically significantly different from other models in terms of other metrics. However, it is significantly different from SMOTEBoost and SMOTEBagging only in terms of accuracy because our method discards accuracy to some extent in exchange for higher recall.

5.4.2. Second Group Dataset

To observe how our method performs under different IRs, Table 6 presents the second group of experimental findings. They come from a resampling of the same dataset, with only different IRs under the condition that other features are guaranteed to be the same. Our method maintains the highest level even at the IR of 36 and 50; however, the recall value only stays at approximately 0.9.

The higher the IR of the dataset, the worse the classification performance tends to be for all methods. Our method does not perform as well on the second group datasets as first, although with the same IR as first. Overall, the classical Easy Ensemble and SMOTEBagging algorithms show a series of better performances on the Pageblock dataset. This also implies that our approach seems to be more effective with small datasets. However, we found that the recall value of the proposed model presented in Table 7 is significantly better than accuracy-oriented algorithms such as No-Cost SVM, decision tree, and AdaBoost. It also outperforms the recall values of cost-sensitive classifiers such as Ada-SVM, prorated, and SMOTE + SVM. The proposed model does not have statistically significant differences in recall values with CS-SVM, Easy Ensemble, SMOTEBoost, and SMOTEBagging state-of-the-art algorithms.

The proposed, Ada-SVM, CS-SVM, prorated cost, no cost, decision tree, easy ensemble, SMOTEBoost, SMOTEBagging, SMOTE + SVM, and AdaBoost algorithms used to classify the Ecoli series dataset take approximately 100 s, 100 s, 30 s, 30 s, 5 s, 0.03 s, 50 s, 3 s, 3 s, 20 s, and 1 s, respectively. The time required for the Pageblock dataset of size 1000 is 1700 s, 1700 s, 200 s, 200 s, 40 s, 6 s, 160 s, 7 s, 7 s, 120 s, and 3 s.

6. Discussion and Conclusion

Unlike other classification methods that only assign different costs to different categories to achieve cost sensitivity, the proposed model built a process that enabled SVM to self-adaptively update the cost value of each sample by integrations. The misclassified positive samples were assigned higher cost values, while other misclassified negative and correctly classified instances were given lower cost items to decrease their effectiveness in training. The optimization of each base classifier can be reached by these automatically updated cost values, which are following the selfadaptively weight vector that was decided by our new weighting mechanism.

Through theoretical justifications and empirical studies, the proposed approach had higher recall than others, and there were no differences with other classification measures at the 0.05 level. When dealing with imbalanced datasets, the findings demonstrated that the suggested method outperformed alternative methods statistically significantly. Through extensive experiments on different IR datasets, our method guarantees good results on the classification of a few classes on both high and low IR datasets. In some datasets, the mean recall metric of the proposed method could be one after five-fold cross-validation, while a set of other metrics could be maintained at an average level. We also found that our model underperformed in terms of overall accuracy compared with that of some models. This phenomenon stems from the purpose that we chase, our need is for higher recall rather than overall accuracy. This is momentous for the practical issue of reducing the identification overhead when working with a small number of classes. Because our method can achieve a recall value of 1 in some datasets, this good feature can be an effective aid in practical work, reducing a large amount of burden for manual work in identifying minorities, such as medical diagnosis. This is when the overall accuracy becomes less important than recall.

The advantage of -mean in assessing the classification impact of a skewed dataset is not evident in this study; however, -mean can be used to appraise the cost-sensitive classifier. It is worth investigating whether the advantage of this assessment metric can be proved in additional experimental instances. Furthermore, we analyzed and confirmed the results of previous studies about class structure and imbalance ratio. They indeed can have a serious impact on classification performance. The classification effect of all classifiers would be reduced when the sample overlap is excessively high; however, our method can also have the best recall performance among many classifiers. As an important reason for the classification disaster of the imbalanced dataset, the structure of classes can be studied in more depth in future work. The high IR would bring a bad classification performance.

Our model has its limitations. For instance, it works better on small datasets; this may be attributed to the fact that SVM, which is the base classifier of our proposed model, is more suitable for classification problems on small datasets. As regards the longer time that our algorithm needs for classification as compared to others, if we continue to update the program code in the future without changing the model, the time computational complexity of our method will be considerably reduced. Substantial research can be conducted in the future, including parameter evaluation and improvement. The impact of kernel functions on imbalanced classification or imbalanced multiple classification problems is also worth investigating. Because of its enormous application potential, this challenging topic will continue to receive extensive attention.

Appendix

A. Training Error Bounds for the AdaBoost

Proof.

B. Training Error Bounds for the Binary Classification Problem AdaBoost

Proof. According to the Taylor series of exp (x) and (1 − x) 1/2 at x = 0, we haveIf there exists γ > 0 with γm > γ for all m, we can obtain

Data Availability

The datasets used in this study are from the UCI ML repository (http://archive.ics.uci.edu/ml/datasets).

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work was supported by the 2022 Heilongjiang University Graduate Student Innovative Research Project (No. YJSCX2022-250HLJU).