Abstract
The support vector machine (SVM) and deep learning (e.g., convolutional neural networks (CNNs)) are the two most famous algorithms in small and big data, respectively. Nonetheless, smaller datasets may be very important, costly, and not easy to obtain in a short time. This paper proposes a novel convolutional SVM (CSVM) that has the advantages of both CNN and SVM to improve the accuracy and effectiveness of mining smaller datasets. The proposed CSVM adapts the convolution product from CNN to learn new information hidden deeply in the datasets. In addition, it uses a modified simplified swarm optimization (SSO) to help train the CSVM to update classifiers, and then the traditional SVM is implemented as the fitness for the SSO to estimate the accuracy. To evaluate the performance of the proposed CSVM, experiments were conducted to test five wellknown benchmark databases for the classification problem. Numerical experiments compared favorably with those obtained using SVM, 3layer artificial NN (ANN), and 4layer ANN. The results of these experiments verify that the proposed CSVM with the proposed SSO can effectively increase classification accuracy.
1. Introduction
Data mining is an effective method for examining and learning from extensive compound datasets of varying quality [1] and has been broadly applied to numerous practical problems in medicine [2–4], engineering [5], time series data [6], image classification [7], speech recognition [8], handwritten recognition [9], management [10], and social sciences [11], with classification being one of the most popular topics in data mining. Numerous classifiers for data mining have been established such as support vector machines (SVMs) [3, 4, 7] and deep learning algorithms [8, 9, 12].
Deep learning based on artificial neural networks (ANNs) is made up of neurons that have learnable weights and biases such that the neural network, a special mathematical function, is connected or close to the data in dataset as much as possible [12, 13]. Deep learning techniques include convolution neural networks (CNNs) for the continuous space data types (e.g., image and speech recognition) [7, 8, 14], recurrent neural networks (RNNs) for the time series data types (e.g., stock markets and language modeling) [12], and generative adversarial networks (GANs) for generating new examples and classifying examples [15]. Deep learning is an adequate and straightforward datamining method for big data [12, 13]. Moreover, since deep learning techniques need big data to learn the classification rules, that is, they only work well for large datasets, they pose an enormous challenge to many applications with respect to obtaining large enough datasets [12, 13]. Furthermore, deep learning relies on good hardware, especially the graphics processing unit (GPU), to have better performance, but such hardware is still expensive [16, 17].
The SVM is another wellknown and effective supervised learning model for selecting attributes and classifying data. Before the rise of deep learning, the SVM outperformed ANNs in various reallife applications in medicine [3, 4], semiconductor industry [18], online analysis [19], spectral unmixing resolution [20], imbalanced datasets [21], mining financial distress [22], data classification [23], and so forth [24–26]. In comparison with deep learning techniques that try to connect data in terms of ANNs, the SVM separates (not to connect) different classes of data based on the kernels through mathematical optimization [27, 28]. In addition, an SVM has high accuracy with less computation power and small data, which are two shortcomings of deep learning [24–26]. Therefore, besides the original SVM, various enhanced SVMs have been developed before the development of deep learning [21, 24–26]. SVMs are discussed in detail in Section 5.1.
Small data are wellformatted data with small volumes that are accessible, understandable, and actionable for decision makers [29]. The value of data lies in the information content but not the volume of data [30]. For some cases such as the marketing strategies of targeting campaigns or delivering personalized experiences, big data might not be appropriate because they do not require fullon big data [31]. Conversely, small data extract an individual’s data and provide valuable information to help decision makers formulate strategies. Moreover, the occurrence of small data is rare, with the process of collecting them being expensive and strenuous [4, 21]. Hence, if the data mining of small data is improved, it will aid in making useful, costefficient, and timely decisions in small data applications.
Deep learning techniques and SVMs belong to a broader family of machine learning algorithms. Deep learning techniques (e.g., convolution neural networks (CNNs)) based on neural networks are powerful for mining big data but less effective in smaller datasets. On the contrary, SVMs outperform all neural network types in smaller datasets but are less effective in mining big data. This paper proposes a novel convolutional SVM (CSVM) that has the advantages of both SVM and deep learning to enhance SVM by maximizing its prediction accuracy and tests for classifying twoclass datasets.
The proposed CSVM employs a supervised learning technique that is based on simplified swarm optimization (SSO), which is another powerful machine learning algorithm [2, 6, 32–35]. Numerical experiments and comparative research with ANNs and the traditional SVM show the accuracy and effectiveness of the proposed CSVM tested on five twoclass datasets.
To summarize the above, the theoretical contribution of this paper is the use of a novel convolutional SVM (CSVM) that has the advantages of both SVM and deep learning, that is, the use of SVM and the vital operation techniques of CNNs including stride and convolution, to enhance SVM and the use of onesolution, onefilter, onevariable greedy SSO update mechanism to prevent the solutions that are near optimums to be kept away from their current positions and to reduce the runtime.
2. Proposed and Traditional Convolution Products
The major difference between the proposed CSVM and the traditional SVM is the convolution product. Hence, the traditional and the proposed new convolution products are introduced and discussed in Section 2.
2.1. ConvolutionRelated Concept
CNNs represent some of the most significant models of deep learning, and their performance has been verified in numerous recognition research areas. Among the vital operation techniques of CNNs, we introduce some that are used in this paper [12, 13].(1)Padding: to prevent the reduction in data size generated by the convolution process in the next layer, we add zeros around the input image, with such action being called padding.(2)Stride: a kernel that is moving a horizontal or vertical distance each time is called a stride. The greater the stride is, the more independent the neighboring values in the convolution process are.(3)Convolution: in each operation of convolution, multiplication of the values between the input and the kernel (filter) moves through based on the given stride after padding. Then, these products are summed up and filled in the corresponding positions on the next layer.
2.2. Proposed Convolution Product with Repeated Attributes
Suppose that N_{att}, N_{sol}, N_{filter}, and N_{var} are the numbers of attributions, solutions, filters constructed in each solution, and the variables contained in each filter, respectively. Let be the value of the ath attribute in the rth record and let be the value of the ath attribute in the rth record after using the fth filter, where a = 1, 2, …, N_{att}, s = 1, 2, …, N_{sol}, and f = 1, 2, …, N_{filter}. For example, nine attributes are used in the breast cancer dataset of University of California Irvine (UCI) [36], and the vector representing the first normalized record is listed as follows:
is calculated using the convolution product in terms of and the filter X_{s, f} as follows:
From equation (2), there are N_{var} attributes that are included in the ath attribute if a + N_{var} _{−} _{1} < Natt. However, no attributes vr, N_{att} _{+} _{1}, vr, N_{att} _{+} _{2}, …, vr, N_{att} + N are included when we need to use equation (2) to update the last N_{var} _{−} _{1} attributes with a + N_{var − 1} ＞ N_{var}.
Let filter X_{s, 1} = [x_{s, 1, 1}, x_{s, 1, 2}, x_{s, 1, 3}] = [−1, 0, 1]. The procedures for generating the new attributes using the convolution product are listed as follows:
From the above, the first and second old attributes (i.e., and ) are used only once as shown in equation (3) and twice as shown in equations (3) and (4) for generating and , respectively. Similarly, the last and the second last attributes in I_{1} (i.e., and ) are only shown in equations (8) and (9), respectively. Moreover, there are no new attributes and based on equation (2).
There is no padding in the proposed CSVM. However, we need to guarantee that the two following situations are satisfied to fix the above problems:(1)Each attribute is included in the same number (i.e., N_{var}) of convolution products(2)The last j attributes still exist after each convolution product
The first (N_{var} − 1) attributes are repeated and appended in the last attribute of the same record such that the total number of attributes is an integer multiple of N_{var}; that is, for and f = 1, 2, …, N_{filter}. Hence, following the same example discussed above, we have
Thus, each new attribute is generated by three convolution products, and we have new I_{1} accordingly.
Let I_{i, j} be the updated ith record after using the jth filter, I_{i} = I_{i, 0}. The next example demonstrates updated I_{1} after two filters are used, with each having N_{var} = 3 variables; X_{s, 2} = [x_{s, 2, 1}, x_{s, 2, 2}, x_{s, 2, 3}] = [1.8, −0.9, 0.7].
Thus, after using the two filters X_{s,1} and X_{s,2}, we have
The basic idea of the proposed convolution product with repeated attributes is that the first (N_{var} − 1) attributes are repeated and appended in the last attribute of each (updated) record such that the total number of attributes is an integer multiple of N_{var}; that is, for . The pseudocode of the proposed convolution product with repeated attributes is listed in Algorithm 1.

Additionally, we obtain the following properties after employing the proposed convolution product with repeated attributes.
Property 1. If x_{s, f, 1} = α and x_{s, f, k} = 0 for all k = 2, …, N_{var} and all f = 1, …, N_{filter}, thenfor all a = 2, …, N_{att} and f = 1, …, N_{filter}.
3. Proposed and Traditional SSO
In the proposed CSVM, all values in filters of the proposed convolution product with repeated attributes are updated based on the proposed new SSO. The traditional SSO is introduced briefly, and the proposed SSO including the new selfadaptive solution structure with pFilter, the novel onesolution, onefilter, onevariable greedy update mechanism, and the fitness function are presented in Section 3.
3.1. Traditional SSO
The SSO is one of the simplest machinelearning methods [2, 6, 32–35] in terms of its update mechanism. It was first proposed by Yeh and has been tested to be a very useful and efficient algorithm for optimization problems [33, 34], including data mining [2, 6]. Owing to its simplicity and efficiency, SSO is used here to find the best values in filters of the proposed CSVM.
The basic idea of SSO is that each variable, such as the jth variable in the ith solution x_{i, j}, needs to be updated based on the following stepwise function [2, 6, 32–35]:where the value ρ_{[0, 1]} [0, 1] is generated randomly and the parameters , , , and 1 – are all in [0, 1] and are the probabilities of the current variable that are copied and pasted from the best of all solutions, the best ith solution, the current solution, and a random generated feasible value, respectively.
There are different variants of the traditional SSO which are customized to different problems from the no free lunch theorem; for example, the four items in equation (15) are also reduced to three items to increase the efficiency; parameters , and are all selfadapted; special values or equations are implemented to replace , , x_{i, j}, and x; or only a certain number of variables are selected to be updated, and so forth. However, the SSO update mechanism is always based on the stepwise function.
3.2. Fitness Function
Fitness functions help solutions learn toward optimization to attain goals in artificial intelligence, such as the proposed CSVM, the traditional SVM, and the CNN. The accuracy obtained by the SVM, based on the records transferred from the proposed convolutions, is adopted here to represent the fitness to maximize in the CSVM: Input: all records and the sth solution for r = 1, 2, …, N_{rec}. Output: the F (X_{s}). STEP F0: calculate I_{r} = I_{r} X_{s} based on the pseudocode provided in Section 2.2 for r = 1, 2, …, N_{rec}. STEP F1: classifier {I_{1}, I_{2}, …, I_{Nrec}} using the SVM and let the accuracy be F (X_{s}).
3.3. SelfAdaptive Solution Structure and pFilter
In the proposed CSVM, each variable of all filters in each solution is initialized randomly from [−2, 2]. Each filter and solution are presented by N_{var} 1 and N_{filter} N_{var}, respectively, since the number of filters may be more than one. For example, the sth solution X_{s} and the fth filter X_{s, f} in X_{s} are denoted as follows:where
However, overall, the number of filters is equal, that is, N_{filter} for each solution and all generations. However, a greater number of filters do not always guarantee a better fitness value. Hence, we need to record the best number of filters for each solution. Let filter j be the best filter of solution s = 1, 2, …, N_{sol}, and define pFilter[s] = j if F[X_{s, f}] ≤ F[X_{s, j}] for all k = 1, 2, …, N_{filter}. Note that X_{h, i} is the best solution for pFilter[h] = i among all existing solutions if F[X_{s, f}] ≤ F[X_{h, i}] for all s = 1, 2, …, N_{sol} and f = 1, 2, …, N_{filter}.
In the end, only the best solution (e.g., X_{s}) and its best number of filters, namely, X_{s, 1}, X_{s, 2}, …, X_{s, j}, where pFilter[s] = j, are reported. In addition, the update mechanism is based on the best filter in the proposed CSVM. Hence, the solution is selfadapted by the best number of filters.
3.4. OneSolution, OneFilter, OneVariable Greedy SSO Update Mechanism
The proposed new onesolution, onefilter, onevariable greedy SSO update mechanism is discussed in this subsection.
3.4.1. OneSolution Is Selected Randomly to be Updated in Each Generation
In the proposed CSVM, all values in filters are variables that must be determined to implement convolution products. Without the help from the GPU, it takes a long time to update variables to deepen the SVM. Hence, instead of the traditional algorithms, including SSO, the genetic algorithm (GA), and particle swarm optimization (PSO), of which all solutions need to be updated, only one solution is selected randomly for updating in each generation of the proposed new SSO update mechanism. Let solution s be selected to be updated based on the following equations:where ρ_{[0, 1]} is a random floatingpoint number generated from interval [0, 1] and ρ_{[1, Nsol]} {1, 2, …, N_{sol}} is the index of the solution selected randomly, gBest is the index of the best solution found, and the 0 is a new solution generated randomly. The new updated solution X_{s} will be either discarded or replaced with the old X_{s} based on the process described next.
3.4.2. OneFilter OneVariable Greedy Update Mechanism
All variables need to be updated, namely, the allvariable update mechanism, in the traditional SSO, and it has a higher probability of escaping the local trap compared to the updates with only some variables. However, the allvariable update mechanism may cause solutions that are near optimums to be kept away from their current positions. Additionally, its runtime is N_{sol} times that of the onevariable update, which selects one variable randomly to be updated. Hence, to reduce the runtime, only one variable in one filter in the solution selected in Section 3.4.1 is updated.
Let s be the solution selected to be updated. In the proposed new SSO, only one filter, for example, f, where f = 1, 2, …, pFilter[f] = j, in solution s is chosen randomly. Moreover, one variable, for example, x_{s, f, k}, where k = 1, 2, …, N_{var}, in such filter X_{s, f} is also selected randomly to be updated based on the following simple process:where is a the random number generated in the update mechanism, and subscript is the lower bound and upper bound for the random number . The interval of is derived from the optimal value of multiple randomized trial and error results. 0.05 is the step size of the local search, in order to ensure that in the local search process to find a fine enough optimal solution. After resetting all variables in these filters X_{s, h} to a random number generated from [−2, 2] for all h > f, we have
Also, F[X_{s, l}] = F[X_{s, f − 1}] for all l < f.
Moreover, the updated solutions X_{s}, including these new updated variables and filters, are all discarded, if their fitness values are not better than that of X_{s}; that is,
3.5. Pseudocode of the Proposed SSO
The pseudocode of the proposed SSO based on the new selfadaptive solution structure, pFilter, and the new update mechanism are listed in Algorithm 2.

4. Proposed SmallSample OA to Tune Parameters
It is important to select the most representative combination of parameters to find good results for all algorithms, such as the three parameters , and in SSO. To reduce the computation burden, a novel concept called smallsample orthogonal array (OA) is proposed in terms of OA test to tune parameters in Section 4.
4.1. OA
The design of experiment (DOE) adopts an array design that arranges the tests and factors in rows and columns, respectively, such that rows and columns are independent of each other, and there is only one test level in each factor level [37]. The DOE is able to select better parameters from some representative predefined combinations to reduce test numbers [2, 38].
The Taguchi OA test, first developed by Taguchi [37], is a DOE that is implemented to achieve the objective of this study. OA is denoted by L_{n} (a^{b}), where , a, and b are the numbers of tries, levels of each factor, and factors, respectively. For example, Table 1 represents an OA denoted by L_{9} (3^{4}).
From Table 1, we can see that the characteristics of the OA are orthogonal as follows:(1)The number of different levels in each column is equal; for example, numbers 1, 2, and 3 appear three times in each column in Table 1.(2)All ordered pairs of the two factors for the same test also appear exactly once, for example, (1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), and (3, 3) in columns 1 and 2, 1 and 3, 1 and 4, 2 and 3, 2 and 4, and 3 and 4, to ensure that each level is dispersed evenly in the complete combination of each level of factors.
4.2. Proposed SmallSample OA
There are three major methods for tuning parameters:(1)The tryanderror method: It implements the tests exhaustively by trying all possible cases to find the one with the better results. It is the simplest and the most inefficient one.(2)The parameteradapted method: It selects and tests some set of parameters from the existing parameters, which are already used in some applications. This method may have some issues with respect to identifying the characteristics of new problems.(3)The DOE: It selects the parameters from the experiment design. Compared to the two aforementioned methods, this method is the most efficient and effective one. However, this method faces an efficiency problem in large datasets or needs to be repeated very often.
Hence, to overcome these aforementioned problems, a novel method called the smallsample OA test is proposed to improve the OA method for tuning parameters. To reduce the runtime, the proposed smallsample OA test only samples few data randomly from the dataset and conducts the OA test on the subsets of such smallsample data to find the best parameters that result in the highest accuracy, the shortest runtime, and/or the largest number of solutions with the maximal number of obtained highest accuracy based on the three following rules: Rule 1. The one with the highest accuracy among all others; Rule 2. The one with the shortest runtime, with a big gap between such runtime and others if there is a tie based on Rule 1; Rule 3. The one with the largest number of solutions that have the highest accuracy if there is a tie based on Rule 2.
Then, this selected parameter set is applied to the rest of the unsampled dataset. The example for this proposed test is provided in Section 6.
5. Proposed CSVM and Traditional SVM
The proposed CSVM is a convolutional SVM modified by employing a new convolution product, which is updated based on the proposed new SSO. The traditional SVM is introduced briefly, and then the proposed pseudocode of the proposed CSVM is presented.
5.1. Traditional SVM
SVMs are excellent machine learning tools for binary classification cases [27, 28]. The purpose of an SVM is to maximize the margin between two support hyperplanes to separate two classes of data. Let X = {z_{1} = (x_{1}, y_{1}), z_{2} = (x_{2}, y_{2}), …, z_{n} = (x_{n}, y_{n})} be a twoclass dataset for training. For example, in a linear SVM, a hyperplane is a line, and we want to find the best hyperplane W^{T}X + b = 0 to separate these two classes of data in X, where W is the weight vector and b is the bias perpendicular to such hyperplane such that W is as large as possible. The above linear SVM is a constrained optimization model and it can be written as follows [27, 28]:
After applying the Lagrange multiplier method to the constrained optimization model, the SVM problem is a convex quadratic programming problem that can be presented as follows [27]:where λ_{i} is the Lagrange multiplier.
For these highdimensional data, it is very difficult to find a single linear line to separate two different sets. Hence, these data are mapped into a higherdimensional space using a function that is called the kernel in SVM. Then, a hyperplane can be found to separate the mapped data. Here, we list some popular kernel functions [27, 28]:
For more details of SVM and its development, the reader is referred to [25, 26].
5.2. Pseudocode of the Proposed CSVM
The pseudocode of the proposed CSVM is described below together with the integration of the proposed convolution product discussed in Subsection 2.2, the proposed SSO introduced in Section 3, and the proposed smallsample OA presented in Section 4.2 (Algorithm 3).

6. Experimental Results and Summary
There are two experiments, Ex1 and Ex2, in this study. Ex1 is based on the proposed smallsample OA concept to find the parameters , C_{p}, , N_{gen}, N_{filter}, and N_{var} in the proposed CSVM. Then, these parameters are employed in Ex2 to conduct an extension test to compare these results with those obtained from the DSCM, SVM, 3layer ANN, and 4layer ANN, respectively.
6.1. Simulation Environment
Four algorithms are developed and adapted in this study including the proposed CSVM, SVM, the 3layer ANN, and the 4layer ANN. The proposed CSVM is implemented using Dev C++ Version 5.11 C/C++, and the SVM part is integrated by calling the libsvm library [28] with all default setting parameters. The codes of both the 3layer and 4layer ANNs are modified using the source code provided in [39], which is coded in Python and run in Anaconda with epochs = 150, batch_size = 10, loss = “binary_crossentropy,” optimizer = “Adam,” activation = “ReLU” and 12 neurons in the first hidden layer, and activation = “sigmoid” in the second hidden layer of the 4layer ANN. The test environment is Intel (R) Core (TM) i99900K CPU @ 3.60 GHz, 32.0 GB memory, and 64bit Windows 10.
To validate the proposed CSVM, the proposed CSVM was compared with the traditional SVM and the 3layer and 4layer ANNs on five wellknown datasets: “Australian Credit Approval” (A), “breastcancer” (B), “diabetes” (D), “fourclass” (F), and “Heart Disease” (H) [34] based on a tenfold crossvalidation in Ex2. Summary of the five datasets is provided in Table 2. A brief introduction of the datasets is as follows: “Australian Credit Approval” (A): this file concerns credit card applications. This database exists elsewhere in the repository (Credit Screening Database) in a slightly different form. This dataset is interesting because there is a good mix of attributescontinuous, nominal with small numbers of values, and nominal with larger numbers of values. There are also a few missing values. “breastcancer” (B): the term “breast cancer” refers to a malignant tumor that has developed from cells in the breast. It is the most common cancer among women in almost all parts of the world. The used dataset consists of 699 instances that were classified as benign and malignant. Also, the dataset has 11 integervalued attributes. “diabetes” (D): diabetes mellitus is one of the most serious health challenges in both developing and developed countries. Diabetes dataset that we used contains 8 categories and 768 instances and records on diabetes patients (several weeks to months worth of glucose, insulin, and lifestyle data per patient and a description of the problem domain), gathered from larger databases belonging to the National Institute of Diabetes and Digestive and Kidney Diseases. “fourclass” (F): the dataset has irregular spreads over the space including disconnected regions and they are not linearly separable. A fourclass nonlinearly separable dataset consists of 862 pieces of data and 2 dimensions. “Heart Disease” (H): heart attack diseases remain the main cause of death worldwide, including South Africa, and possible detection at an earlier stage will prevent the attacks. This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date.
Let , T, G, f, and N be the highest accuracy levels obtained in the end, the runtime, the earliest generation that obtained , the number of filters generating , and the number of solutions that have , respectively. To be easily recognized, the subscripts 25, 50, 75, 100, avg, max, min, and std represent the related values obtained at the end of the 25^{th}, 50^{th}, 75^{th}, and 100^{th} generations, the average, the maximum, minimum, and the standard deviation, respectively.
6.2. Ex1: SmallSample OA Test
The orthogonal array used in this study is called L_{9} (3^{4}) as shown in Table 3.
In L_{9} (3^{4}), there are nine tries and four factors: C = (), N_{sol}, N_{var}, and N_{filter}; each factor has three levels as shown in Table 4. The higher the level, the larger related values with the exception of C; for example, in level 1, N_{sol} = 25 is smaller than that in level 2. The most distinguishable difference among all three levels in C of Table 3 is that level 2 has higher c_{r} which is to increase the global search ability, while level 3 has the lower value of c_{r} to enhance the local search ability.
The results obtained from the proposed CSVM in terms of the proposed smallsample OA test are listed in Table 5, in which each try is run fifteen times, where the larger N_{filter}, N_{sol}, N_{var}, and/or N_{gen}, the longer the runtime. However, it is not necessary to have better fitness values from Table 5. For example, the best fitness value has already been found in G_{25}, namely, F_{25} = F_{50} = F_{75}, in all datasets except Dataset D whose best fitness value is found in G_{75}.
Adhering to Rule 1 listed in Section 4, only the try with the highest accuracy is selected to be used for the rest of the unsampled dataset. In this case, Try 7 is selected for Dataset D, since the greatest accuracy is obtained from Try 7 in G_{75}. From Rule 2, the runtime T must be considered if there are two tries tied in accuracy. For example, both Try 5 and Try 9 have the highest accuracy in Dataset A, but Try 5 is selected, since its runtime is only 43.43, which is considerably less than the runtime (149.28) of Try 9. Similarly, Try 7 and Try 1 are selected for Datasets B and H, respectively. The parameter setting for the rest of datasets, namely, Dataset F, is based on Rule 3, and Try 5 is selected in accordance with Rule 2.
Hence, we obtain the parameter settings listed in Table 6.
In Dataset F, there are only two attributes resulting in also two variables in each filter of Table 6. Another observation is that the values N_{filter}, N_{var}, N_{sol}, and N_{gen} are always the smallest, since all the best final fitness values are equal to 88.00000 regardless of the generation number. Then, the parameter setting with less runtime is selected, which is reasonable. This is similar to Dataset B whose solution number is only 25, with less local search ability.
In Table 5, the accuracy levels obtained from SVM for the first fold of each dataset are listed in the last second column named 100F_{SVM}. From Table 5, all values in F_{Ngen} are better than those in the corresponding F_{SVM}. Moreover, also from Table 5, all fitness values obtained from G_{25}, namely, F_{25}, are already at least equal to F_{SVM}; that is, F_{SVM} ≤ F_{25} ≤ F_{50} ≤ F_{75} ≤ F_{100}. Hence, the proposed CSVM outperforms the traditional SVM in the smallsample OA, and the wide discrepancy between the final performances of the CSVM and the SVM is further reinforced in Subsection 6.3 using the parameters setting from the proposed smallsample OA.
6.3. Ex2
The results for G_{100} are collected to evaluate the effectiveness of the concept of the proposed smallsample OA and verify any possible effects on the average and the best fitness values of higher generation numbers. The complete data including the average, best, worst, and standard deviation of fitness of each fold for each dataset are listed in Tables 7–11.
6.3.1. Boxplots of the Experimental Results from Ex2
Both results obtained from the 3layer and 4layer ANNs are the least favorable with a big gap between the proposed CSVM and the traditional SVM. Hence, these two ANNbased methods are not discussed further, and we only focus on the proposed CSVM and the traditional SVM.
We determined that the higher the generation number, the better average fitness value. However, it can be observed that the best fitness value remains unchanged from G_{75} to G_{100} except for the 8^{th} fold in Dataset A, the 1^{st} and 10^{th} folds in Dataset B, and the 5^{th} fold in Dataset F. Therefore, N_{gen} = 75 is acceptable and there is no need for N_{gen} = 100 to increase the fitness value of the best solution. The position (the fitness values obtained) and the length (the range of the fitness values) of box in G_{100} are frequently higher and shorter than those of G_{25} in most boxplots. Hence, a larger generation number has a higher probability of enhancing the average solution quality under the cost of the longer runtime but ultimately does little to improve the best fitness value.
6.3.2. Number of Folds for Finding the Final Best Fitness Values
Table 12 lists the number of folds that have found the final best fitness values. The subscripts of dataset ID in the first column of Table 12 indicate the generation number used in Ex 2; for example, B_{25} indicates that 25 generations are used for Dataset B based on the parameters obtained with smallsample OA in Ex 1. Folds 7, 8, 10, 8, and 10 (see bold numbers in Table 12) under G_{25}, G_{25}, G_{75}, G_{25}, and G_{25} in Datasets A, B, D, F, and H, respectively, have found the best final fitness values.
The folds written as subscripts indicate the final best fitness values that have failed to be found. For example, 7_{1,4,7} in G_{25}, A_{25} represents that there are seven folds (from the ten folds) that have already found the best fitness values after 25 generations with the remaining three folds 1, 4, and 7 failing to do so in Dataset A.
To calculate the probability of the best final fitness value in Table 12, we add the folds (7 + 8 + 10 + 8 + 10) and divide the product by the total number of folds (50) in Table 6 to get 86%, which informs us that the probability of finding the best final fitness value without reaching G_{100}, which entails a significantly longer runtime, is 86%.
Hence, the proposed smallsample OA is effective in setting parameters to increase the efficiency and solution quality of the proposed CSVM. The above observation further confirms that having better parameters ultimately negates the need for a greater generation number to increase the fitness of the best solution.
6.3.3. ANOVA of the Experimental Results
To investigate the smallsample OA, the Analysis of Variance (ANOVA) is carried out to test the average fitness obtained from the proposed CSVM in terms of the parameters set by the smallsample OA, as shown in Table 13. The cells marked with “v” indicate that there is a significant gap between the pair of distinctive generation numbers listed in their respective rows in the fold denoted by the column. This is reinforced through the distinct difference between the average fitness values obtained from G_{25} and G_{75} in all folds of Dataset A.
From Table 13, the minimal generation numbers should be 75 and 50 for only Datasets A and F, respectively, with an insignificant gap between the fitness values in each fold. Hence, the proposed smallsmall OA is still effective in determining the generation number to reduce the significant difference among all fitness values; even it focuses only on the best fitness value and not the average fitness value that we found.
6.3.4. MPI of the Experimental Results
To further investigate the development of the proposed CSVM, two other indices, the average maximum possible improvement (MPI_{avg}%) and the best maximum possible improvement (MPI_{avg}%), are introduced and defined as
The MPI_{avg}% and MPI_{max}% results are listed in Tables 14 and 15, respectively, where the cells marked “” indicate that both the related F_{svm} and the average and/or the best fitness obtained are 100% correct, for example, the 2^{nd}, 4^{th}, and 5^{th} folds in Dataset B in Table 14. The bold numbers denote the best values among all folds for each dataset under the same generation number. Note that a value of 100, as in the 7th fold of Dataset B in Table 15, indicates that the related accuracy is 100%.
As shown in Tables 14 and 15, the results obtained from the proposed CSVM are at least 14.96% and 20.17%, with at most a 50.68% and 63.10% improvement in MPI_{avg}% and MPI_{max}%, respectively. The results shed light on the effectiveness of the proposed CSVM in comparison with the traditional SVM. It can be also observed that the more attributes, the greater the results obtained from the proposed CSVM regardless of the number of records. Ultimately, compared to the traditional SVM, the proposed CSVM is more suitable for small data.
7. Conclusions and Future Work
Classification is of utmost importance in data mining. The proposed new classifier, CSVM, is a convolutional SVM modified with a new repeatedattribute convolution product, in which all variables in each filter are updated and trained based on the proposed novel SSO. Equipped with a selfadaptive structure and pFilter, this greedy SSO is a onesolution, onefilter, onevariable type and its parameters are delineated by the proposed smallsample OA.
According to the experiment results for the five UCI datasets, namely, Australian Credit Approval, breastcancer, Diabetes, fourclass, and Heart Disease [36], from Ex2 in Section 6, the proposed CSVM with the parameter setting selected from Ex1 outperforms the traditional SVM, the 3layer ANN, and the 4layer ANN with an improved accuracy of at least 14.96% and up to 50.68% in MPI_{avg}%. Hence, the proposed smallsample OA discussed in Subsection 4.2 enables the CSVM to improve its overall performance, while the proposed CSVM ultimately serves as a successful concoction of the advantages of SVM, the convolution product, and SSO.
The classifier design method is a crucial element in the provision of useful information in the modern world. Through comparisons of the results of experiments, it can be determined whether further research will be conducted on the proposed CSVM, which will be applied to multiclass datasets based on several references [40, 41] with more attributes, classes, and records, and amalgamated with particular feature selections.
Data Availability
To validate the proposed CSVM, it was compared with the traditional SVM and the 3layer and 4layer ANNs on five wellknown datasets, “Australian Credit Approval” (A), “breastcancer” (B), “Diabetes” (D), “fourclass” (F), and “Heart Disease” (H), at http://archive.ics.uci.edu/ml/.
Disclosure
This article was once submitted to arXiv as a temporary submission that was just for reference and did not provide the copyright.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This research was supported in part by the Ministry of Science and Technology, R.O.C., under Grants MOST 1022221E007086MY3 and MOST 1042221E007061MY3 and the National Natural Science Foundation of China under Grant 61702118.