Abstract

The objective of this work is to propose ten efficient scaling techniques for the Wisconsin Diagnosis Breast Cancer (WDBC) dataset using the support vector machine (SVM). These scaling techniques are efficient for the linear programming approach. SVM with proposed scaling techniques was applied on the WDBC dataset. The scaling techniques are, namely, arithmetic mean, de Buchet for three cases , equilibration, geometric mean, IBM MPSX, and Lp-norm for three cases . The experimental results show that the equilibration scaling technique overcomes the benchmark normalization scaling technique used in many commercial solvers. Finally, the experimental results also show the effectiveness of the grid search technique which gets the optimal parameters (C and gamma) for the SVM classifier.

1. Introduction

Scaling techniques play an important role in the convergence speed of machine learning algorithms, specially in classification and regression tasks. Using an efficient scaling technique makes training of algorithms faster. There is an integrative relationship between the linear programming approach and metaheuristic approach according to scaling techniques. A scaling technique is defined as a mathematical formula which makes these elements have similar magnitudes. In linear programming, the scaling techniques are applied on the objective function, the coefficient matrix of the inequalities, and the coefficient of constants. On the contrary, in the metaheuristic approach, the scaling techniques are applied on the matrix in which its rows represent the observations and its columns are the attributes of the dataset.

A dataset contains mostly nonzero elements which are of different values. Such representation is called a bad representation for this matrix. Scaling techniques can be used to handle this issue. Scaling techniques are used before applying the classifier in order to improve the classification accuracy on the dataset.

The comparison among the following scaling techniques, Curtis and Reid [1] scaling technique, arithmetic mean scaling technique, Wolfe [2] scaling technique, geometric mean scaling technique, and equilibration scaling technique, was proposed by Tomlin [3] on 6 test linear programming problems of different sizes. Another study was proposed by Larsson [4] by proposing and comparing entropy, de Buchet scaling technique [5], and -norm scaling technique [6] on one-hundred thirty-five randomly generated problems of different dimensions. He deduced that the entropy scaling method outperforms the other scaling techniques. Elble and Sahinidis [7] proposed new experimental results from the comparison among the following scaling techniques: IBM MPSX, entropy, arithmetic mean, binormalization, geometric mean, -norm, equilibration, and de Buchet on benchmark problems from Netlib. Scaling and solution times, the number of iterations for the solution, and the maximum condition number were the evaluation metrics for their study. They deduced that the equilibration method outperformed other techniques. Ploskas and Samaras [8] introduced experimental results for three algorithms: MATLAB’s revised simplex method, exterior point simplex method, and interior point algorithm using geometric mean scaling technique, equilibration scaling technique, and arithmetic mean scaling technique. They deduced that the equilibration scaling technique overcame the other techniques and that the effectiveness of scaling is important to both the interior point algorithm and revised simplex method; on the contrary, the exterior point simplex method is scaling invariant [9]. Ploskas and Samaras [10] proposed new experimental results comparing arithmetic mean, de Buchet for three cases , equilibration, geometric mean, IBM MPSX, and Lp-norm for three cases . Ploskas and Samaras [10] deduced that arithmetic mean, equilibration, and geometric mean overcame the other scaling techniques according to the execution time. In [11], Ploskas and Samaras in the chapter “Scaling Techniques” present a complete list of the scaling techniques plus illustrative examples. Ploskas and Samaras [12] state clearly that MATLAB’s GPU environment (in 2014) did not offer sparse utilities. Also, they were the first to present a GPU-based simplex implementation that showed speedups in benchmark instances. In their implementation, they used the most efficient scaling techniques.

In this work, ten efficient scaling techniques were proposed for the Wisconsin Diagnosis Breast Cancer (WDBC) dataset using the support vector machine (SVM). The SVM with proposed scaling techniques was applied on the WDBC dataset. The experimental results show that the equilibration scaling technique overcomes the benchmark normalization scaling technique.

The rest of this paper is organized as follows. The support vector machine classifier is described in section 2. In section 3, detailed descriptions of new scaling techniques are presented. The experimental design which has data description, experimental setup, measure for performance evaluation, and grid search method is introduced in section 4. In section 5, experimental results and discussions are discussed. In section 6, conclusions and future works are introduced.

2. Support Vector Machine Classifier

The support vector machine (SVM) is considered as a machine learning model originally developed by Vapnik [13, 14]. The SVM is based on the Vapnik–Chervonenkis (VC) theory and structural risk minimization (SRM) principle [13, 15]. The main objective of the SVM is finding a hyperplane in an N-dimensional space (N: the number of features) that distinctly classifies the data points, as shown in Figure 1. The convex quadratic programming is used for the SVM in order to avoid the local minima [13, 16].

In the linear classification, the hyperplane is placed in the largest distance between two vectors. In case of the nonlinear classification, it is mapped to the linear classification problem in a high-dimensional space [17], as shown in Figure 2.

Let us consider a binary classification task: suppose that and is a labeled training dataset such that is a representation of the feature vector and is the class label (negative or positive) of a training compound i. The optimal hyperplane can then be defined as follows:

Such that is the weight vector, x is the input feature vector, and b is the bias. and b would satisfy both inequality (2) and inequality (3) for all elements of the training set:

The aim of training an SVM classifier is to determine and b so that the hyperplane separates the data and maximizes the margin . Vectors, for which , will be termed the support vector.

There are cases in which we can linearly separate between the two classes, and there are other cases in which we cannot linearly separate between the two classes. We can overcome this problem by transforming the original input space into some higher-dimensional feature space where the two classes can be linearly separable. An alternative use for the SVM is the kernel method, which enables us to model higher-dimensional, nonlinear models [18]. In a nonlinear problem, a kernel function could be used to add additional dimensions to the raw data and thus making it a linear problem in the resulting higher-dimensional space. On the contrary, the kernel functions could help do certain calculations faster which would need computations in the high-dimensional space. There are many kernel functions, for example, but not limited to, the linear kernel and the Gaussian kernel, which are defined as shown in the following equations:where is the order of polynomial and is the predefined parameter controlling the width of the Gaussian kernel. The SVM classification accuracy is improved by the proper model parameters’ setting [19]. It is important to choose the parameters in advance. These parameters are C, ( or p), and kernel function.

C parameter is considered a regularization or generalization parameter. It governs the trade-off between having a minimum training error and minimizing the weight’s norm. C parameter tuning is a very important step in optimizing of the SVM. The parameter C imposes an upper bound on the weight’s norm, which implies that there are multiple hypothesis classes indexed by C. Increasing the C parameter leads to increasing the complexity of the hypothesis class. If we increase C slightly, we can still form all of the linear models [19]. Determining how to set C is not very well developed, so most researchers use cross-validation.

3. Scaling Techniques

Here, we introduce the mathematical notations of ten scaling techniques in addition to the normalization scaling techniques with ranges [0, 1)] and [−1, 1]. First of all, we introduce the following mathematical preliminaries, as shown in Table 1.

The scaled matrix is expressed as RAS, such that and . All scaling techniques proposed in this section first apply row scaling and after that column scaling. Then, the matrix after full scaling (row and column) is given by the following:

3.1. Arithmetic Scaling Technique [11]

First, equation (7) represents the rows’ scaling such that each row (instance) is divided by the arithmetic mean of the absolute value of the nonzero elements in that row (instance):

Second, equation (8) represents the columns’ scaling such that each column (attribute) is divided by the arithmetic mean of the absolute value of the nonzero elements in that column (attribute):

3.2. de Buchet Scaling Technique [4]

Equation (9) formulates the de Buchet scaling method which is based on the relative divergence:where the number of the nonzero elements of A is dented by and the parameter p is a positive integer. Here, there are the following three cases:Case  = 1: in this case, equation (9) approaches to the following equation:Equation (11) represents the row scaling factor of the matrix A:Equation (12) represents the column scaling factor of the scaled matrix A by :Case  = 2: in this case, equation (9) approaches to the following equation:Equation (14) represents the row scaling factor of the matrix A:Equation (15) represents the column scaling factor of the scaled matrix A by :Case  = : in this case, equation (9) approaches to the following equation:

Equation (17) represents the row scaling factor of the matrix A:

Equation (18) represents the column scaling factor of the scaled matrix A by :

The last case of the de Buchet ( = ) scaling technique is equivalent to the geometric mean scaling method that will be introduced later.

3.3. Equilibration Scaling Technique [11]

The largest element in the absolute value is the corner stone for this scaling method. Each row of the matrix A is divided by the largest element in the absolute value in that row. Then, each column of the scaled matrix A by the row factor is divided by the largest element in the absolute value in that column. The range of the final scaled matrix A is (−1, 1).

3.4. Geometric Mean Scaling Technique [11]

First, equation (19) represents the rows’ scaling such that each row (instance) is divided by the geometric mean of the absolute value of the nonzero elements in that row (instance):

Second, equation (20) represents the columns’ scaling such that each column (attribute) is divided by the geometric mean of the absolute value of the nonzero elements in that column (attribute):

3.5. IBM MPSX Scaling Technique [11]

The IBM MPSX scaling method is a combination between the geometric mean and the equilibration scaling methods. First, the geometric mean is performed four times or until the relation (21) holds true:where the number of the nonzero elements of A is denoted by and the parameter is a positive integer (which is often ). Then, the equilibration scaling method is applied. The IBM MPSX scaling method was introduced by Benichou et al. [20].

3.6. -Norm Scaling Technique [11]

Equation (22) formulates the Lp-norm scaling method: where the number of the nonzero elements of A is dented by . Here, there are the following three cases:Case  = 1: in this case, equation (22) approaches to the following equation:Equation (24) represents the row scaling factor of the matrix A:Similarly, equation (25) represents the column scaling factor of the matrix A:Case  = 2: in this case, equation (22) approaches to the following equation:Equation (27) represents the row scaling factor of the matrix A:Similarly, equation (28) represents the column scaling factor of the matrix A:Case  = , the last case of the -norm ( = ) scaling technique is equivalent to the geometric mean scaling method.

3.7. Normalization Scaling Technique [−1, 1] [21]

Equation (29) is used for the normalization scaling method with range [−1, 1] such that are the original value, the scaled value, the maximum value, and the minimum value of feature k, respectively:

The normalization scaling method avoids the numerical difficulties during the calculation.

3.8. Normalization Scaling Technique [0, 1] [21]

Another normalization scaling technique is formulated from the updated equation (29) as follows:

4. Experimental Design

In this section, we introduce data description, measure for performance evaluation, and the grid search method.

4.1. Data Description

In this work, we have run the proposed model on the Wisconsin Diagnosis Breast Cancer (WDBC) dataset that is available on the UCI Machine Learning Repository [22]. The dataset consists of 569 instances divided into two classes. The two classes malignant and benign have 357 and 212 cases, respectively. Every observation in the database has thirty-three attributes. These thirty-three attributes differ between benign and malignant samples.

The MATLAB platform is used to implement the SVM diagnostic system. On the contrary, we use LIBSVM that is developed by Chang and Lin [23]. Table 2 describes the computing environment.

Salzberg [24] introduced the k-fold CV which is used to guarantee the valid results. In this paper, set k as 10, i.e., the data consists of 10 subsets. The most commonly used (default) value is k = 10 in k-fold CV, which is often a good choice [25]. Each time, one of the 10 subsets is utilized as the test set and the remaining 9 subsets are utilized as a training set.

4.2. Measure for Performance Evaluation

In order to test the performance of the SVM model, we use accuracy (ACC). Table 3 shows the confusion matrix. TP, FN, TN, and FP are the number of true positives, the number of false negatives, the number of true negatives, and the number of false positives, respectively. According to the confusion matrix, the total classification accuracy (ACC) is defined as follows: ACC = (TP + TN)/(TP + FP + TN + FN) × 100%.

4.3. Grid Search Method

In order to test the performance of the SVM system, we use the grid search method. The grid search method is used to determine the optimal parameters C and . Figure 3 shows the flowchart of the SVM training using the grid search. We utilize the searching space of C and as follows: , respectively.

5. Experimental Results and Discussion

Here, the experimental results were applied, and an attempt is made to prove the validation of the proposed scaling techniques. The experiments were done on the WBCD dataset using the SVM to estimate the efficiency of the proposed scaling techniques for the breast cancer.

Table 4 shows the effectiveness of the grid search method which gets the best parameters C and for the SVM. The accuracy of normalization scaling techniques (S1) is better than that without the scaling technique (S0). Also, using scaling techniques (S1) speeds up the search and achieves dramatic decrease of CPU time. So, this result shows the effectiveness of the grid search method using the scaling technique (S1).

Tables 5 and 6 show the average classification accuracy rates and CPU time of the SVM with four scaling techniques. These techniques are normalization between (−1, 1) (S2), the equilibration scaling (S3), the geometric mean scaling (S4), and the arithmetic mean scaling (S5). One can notice easily that S3 achieved the best accuracy with 98.95% outperforming the compared scaling techniques. S3 also achieved lowest CPU time with about 10.2 seconds.

Tables 7 and 8 show the average accuracy rates of the SVM with the de Buchet scaling technique with  = 1 (S6), de Buchet scaling technique with  = 2 (S7), and the IBM MPSX scaling technique (S8). It is clear that both S6 and S8 scaling techniques have the same accuracy of 98.59% which is better than the accuracy of S7 scaling technique.

Table 9 shows the average classification accuracy rates of the SVM with the Lp-norm scaling technique with  = 1 (S9) and Lp-norm scaling technique with  = 2 (S10). One can notice that S9 achieved 98.25 accuracy outperforming S10 scaling technique. But, S10 CPU time is slightly lower than S9.

Tables 10 and 11 summarize the accuracy and CPU time of all compared scaling techniques. The equilibration scaling technique (S3) achieved the best accuracy and the lowest CPU time outperforming all compared scaling techniques. Figures 4 and 5 show the superiority of S3 according to the accuracy rate and CPU time, respectively.

Figure 6 shows the superiority of equilibration scaling technique (S3) and achieved best accuracy in all 10-fold cross-validation.

6. Conclusions

In this work, we proposed ten efficient scaling techniques for the Wisconsin Diagnosis Breast Cancer (WDBC) dataset using the support vector machine (SVM). These scaling techniques can enhance classification accuracy, reduce CPU time, and make training faster. Also, grid search is used to select best free parameters of the SVM (C, gamma). Simulation results showed that equilibration scaling technique (S3) achieved best accuracy with 98.95% outperforming all compared scaling techniques. S3 also achieved lowest CPU time with about 10.2 seconds. Eight efficient scaling techniques outperformed the two benchmark scaling techniques according to the accuracy rate. These techniques are S3, S4, S5, S6, S7, and S8. Seven of the ten efficient scaling techniques outperformed two benchmark scaling techniques according to the CPU time. These techniques are S3, S5, S6, S7, and S8.

In the future work, the proposed scaling techniques will be applied on other data sets with other classifiers in order to prove the superiority of these techniques on the benchmark normalization scaling technique that is used in MATLAB SOFTWARE. We can improve this work by using the different metaheuristic algorithms with other mathematical models [2630]. Also, swarm intelligence techniques will be used to optimize the SVM instead of grid search [3133].

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors extend their appreciation to the Deanship of Scientific Research at Majmaah University for funding this work under project no. RGP-2019-29 and Deputyship for Research and Innovation, Ministry of Education in Saudi Arabia, for funding this research work through the project no. IFP-2020-17.