Abstract

The support vector machine (SVM) and deep learning (e.g., convolutional neural networks (CNNs)) are the two most famous algorithms in small and big data, respectively. Nonetheless, smaller datasets may be very important, costly, and not easy to obtain in a short time. This paper proposes a novel convolutional SVM (CSVM) that has the advantages of both CNN and SVM to improve the accuracy and effectiveness of mining smaller datasets. The proposed CSVM adapts the convolution product from CNN to learn new information hidden deeply in the datasets. In addition, it uses a modified simplified swarm optimization (SSO) to help train the CSVM to update classifiers, and then the traditional SVM is implemented as the fitness for the SSO to estimate the accuracy. To evaluate the performance of the proposed CSVM, experiments were conducted to test five well-known benchmark databases for the classification problem. Numerical experiments compared favorably with those obtained using SVM, 3-layer artificial NN (ANN), and 4-layer ANN. The results of these experiments verify that the proposed CSVM with the proposed SSO can effectively increase classification accuracy.

1. Introduction

Data mining is an effective method for examining and learning from extensive compound datasets of varying quality [1] and has been broadly applied to numerous practical problems in medicine [24], engineering [5], time series data [6], image classification [7], speech recognition [8], handwritten recognition [9], management [10], and social sciences [11], with classification being one of the most popular topics in data mining. Numerous classifiers for data mining have been established such as support vector machines (SVMs) [3, 4, 7] and deep learning algorithms [8, 9, 12].

Deep learning based on artificial neural networks (ANNs) is made up of neurons that have learnable weights and biases such that the neural network, a special mathematical function, is connected or close to the data in dataset as much as possible [12, 13]. Deep learning techniques include convolution neural networks (CNNs) for the continuous space data types (e.g., image and speech recognition) [7, 8, 14], recurrent neural networks (RNNs) for the time series data types (e.g., stock markets and language modeling) [12], and generative adversarial networks (GANs) for generating new examples and classifying examples [15]. Deep learning is an adequate and straightforward data-mining method for big data [12, 13]. Moreover, since deep learning techniques need big data to learn the classification rules, that is, they only work well for large datasets, they pose an enormous challenge to many applications with respect to obtaining large enough datasets [12, 13]. Furthermore, deep learning relies on good hardware, especially the graphics processing unit (GPU), to have better performance, but such hardware is still expensive [16, 17].

The SVM is another well-known and effective supervised learning model for selecting attributes and classifying data. Before the rise of deep learning, the SVM outperformed ANNs in various real-life applications in medicine [3, 4], semiconductor industry [18], online analysis [19], spectral unmixing resolution [20], imbalanced datasets [21], mining financial distress [22], data classification [23], and so forth [2426]. In comparison with deep learning techniques that try to connect data in terms of ANNs, the SVM separates (not to connect) different classes of data based on the kernels through mathematical optimization [27, 28]. In addition, an SVM has high accuracy with less computation power and small data, which are two shortcomings of deep learning [2426]. Therefore, besides the original SVM, various enhanced SVMs have been developed before the development of deep learning [21, 2426]. SVMs are discussed in detail in Section 5.1.

Small data are well-formatted data with small volumes that are accessible, understandable, and actionable for decision makers [29]. The value of data lies in the information content but not the volume of data [30]. For some cases such as the marketing strategies of targeting campaigns or delivering personalized experiences, big data might not be appropriate because they do not require full-on big data [31]. Conversely, small data extract an individual’s data and provide valuable information to help decision makers formulate strategies. Moreover, the occurrence of small data is rare, with the process of collecting them being expensive and strenuous [4, 21]. Hence, if the data mining of small data is improved, it will aid in making useful, cost-efficient, and timely decisions in small data applications.

Deep learning techniques and SVMs belong to a broader family of machine learning algorithms. Deep learning techniques (e.g., convolution neural networks (CNNs)) based on neural networks are powerful for mining big data but less effective in smaller datasets. On the contrary, SVMs outperform all neural network types in smaller datasets but are less effective in mining big data. This paper proposes a novel convolutional SVM (CSVM) that has the advantages of both SVM and deep learning to enhance SVM by maximizing its prediction accuracy and tests for classifying two-class datasets.

The proposed CSVM employs a supervised learning technique that is based on simplified swarm optimization (SSO), which is another powerful machine learning algorithm [2, 6, 3235]. Numerical experiments and comparative research with ANNs and the traditional SVM show the accuracy and effectiveness of the proposed CSVM tested on five two-class datasets.

To summarize the above, the theoretical contribution of this paper is the use of a novel convolutional SVM (CSVM) that has the advantages of both SVM and deep learning, that is, the use of SVM and the vital operation techniques of CNNs including stride and convolution, to enhance SVM and the use of one-solution, one-filter, one-variable greedy SSO update mechanism to prevent the solutions that are near optimums to be kept away from their current positions and to reduce the runtime.

2. Proposed and Traditional Convolution Products

The major difference between the proposed CSVM and the traditional SVM is the convolution product. Hence, the traditional and the proposed new convolution products are introduced and discussed in Section 2.

2.1. Convolution-Related Concept

CNNs represent some of the most significant models of deep learning, and their performance has been verified in numerous recognition research areas. Among the vital operation techniques of CNNs, we introduce some that are used in this paper [12, 13].(1)Padding: to prevent the reduction in data size generated by the convolution process in the next layer, we add zeros around the input image, with such action being called padding.(2)Stride: a kernel that is moving a horizontal or vertical distance each time is called a stride. The greater the stride is, the more independent the neighboring values in the convolution process are.(3)Convolution: in each operation of convolution, multiplication of the values between the input and the kernel (filter) moves through based on the given stride after padding. Then, these products are summed up and filled in the corresponding positions on the next layer.

2.2. Proposed Convolution Product with Repeated Attributes

Suppose that Natt, Nsol, Nfilter, and Nvar are the numbers of attributions, solutions, filters constructed in each solution, and the variables contained in each filter, respectively. Let be the value of the ath attribute in the rth record and let be the value of the ath attribute in the rth record after using the fth filter, where a = 1, 2, …, Natt, s = 1, 2, …, Nsol, and f = 1, 2, …, Nfilter. For example, nine attributes are used in the breast cancer dataset of University of California Irvine (UCI) [36], and the vector representing the first normalized record is listed as follows:

is calculated using the convolution product in terms of and the filter Xs, f as follows:

From equation (2), there are Nvar attributes that are included in the ath attribute if a + Nvar1 < Natt. However, no attributes vr, Natt+1, vr, Natt+2, …, vr, Natt + N are included when we need to use equation (2) to update the last Nvar1 attributes with a + Nvar − 1 > Nvar.

Let filter Xs, 1 = [xs, 1, 1, xs, 1, 2, xs, 1, 3] = [−1, 0, 1]. The procedures for generating the new attributes using the convolution product are listed as follows:

From the above, the first and second old attributes (i.e., and ) are used only once as shown in equation (3) and twice as shown in equations (3) and (4) for generating and , respectively. Similarly, the last and the second last attributes in I1 (i.e., and ) are only shown in equations (8) and (9), respectively. Moreover, there are no new attributes and based on equation (2).

There is no padding in the proposed CSVM. However, we need to guarantee that the two following situations are satisfied to fix the above problems:(1)Each attribute is included in the same number (i.e., Nvar) of convolution products(2)The last j attributes still exist after each convolution product

The first (Nvar − 1) attributes are repeated and appended in the last attribute of the same record such that the total number of attributes is an integer multiple of Nvar; that is, for and f = 1, 2, …, Nfilter. Hence, following the same example discussed above, we have

Thus, each new attribute is generated by three convolution products, and we have new I1 accordingly.

Let Ii, j be the updated ith record after using the jth filter, Ii = Ii, 0. The next example demonstrates updated I1 after two filters are used, with each having Nvar = 3 variables; Xs, 2 = [xs, 2, 1, xs, 2, 2, xs, 2, 3] = [1.8, −0.9, 0.7].

Thus, after using the two filters Xs,1 and Xs,2, we have

The basic idea of the proposed convolution product with repeated attributes is that the first (Nvar − 1) attributes are repeated and appended in the last attribute of each (updated) record such that the total number of attributes is an integer multiple of Nvar; that is, for . The pseudocode of the proposed convolution product with repeated attributes is listed in Algorithm 1.

Input: The rth record and the sth solution .
Output: The IrXs.
STEP C0. Let f = 1 and for i = 1, 2, …, Natt.
STEP C1. Let a = 1, for i = Natt + 1, 2, …, Natt + Nvar − 1, k = i − Natt.
STEP C2. Let b = 0, i = a, and j = 1.
STEP C3. Let .
STEP C4. If j < Nvar, let i = i + 1, j = j + 1, and go to STEP C3.
STEP C5. If a < Natt, let , a = a + 1, and go to STEP C2.
STEP C6. If f < Nfilter, let f = f + 1 and go to STEP C1.

Additionally, we obtain the following properties after employing the proposed convolution product with repeated attributes.

Property 1. If xs, f, 1 = α and xs, f, k = 0 for all k = 2, …, Nvar and all f = 1, …, Nfilter, thenfor all a = 2, …, Natt and f = 1, …, Nfilter.

3. Proposed and Traditional SSO

In the proposed CSVM, all values in filters of the proposed convolution product with repeated attributes are updated based on the proposed new SSO. The traditional SSO is introduced briefly, and the proposed SSO including the new self-adaptive solution structure with pFilter, the novel one-solution, one-filter, one-variable greedy update mechanism, and the fitness function are presented in Section 3.

3.1. Traditional SSO

The SSO is one of the simplest machine-learning methods [2, 6, 3235] in terms of its update mechanism. It was first proposed by Yeh and has been tested to be a very useful and efficient algorithm for optimization problems [33, 34], including data mining [2, 6]. Owing to its simplicity and efficiency, SSO is used here to find the best values in filters of the proposed CSVM.

The basic idea of SSO is that each variable, such as the jth variable in the ith solution xi, j, needs to be updated based on the following stepwise function [2, 6, 3235]:where the value ρ[0, 1] [0, 1] is generated randomly and the parameters , , , and 1 –  are all in [0, 1] and are the probabilities of the current variable that are copied and pasted from the best of all solutions, the best ith solution, the current solution, and a random generated feasible value, respectively.

There are different variants of the traditional SSO which are customized to different problems from the no free lunch theorem; for example, the four items in equation (15) are also reduced to three items to increase the efficiency; parameters , and are all self-adapted; special values or equations are implemented to replace , , xi, j, and x; or only a certain number of variables are selected to be updated, and so forth. However, the SSO update mechanism is always based on the stepwise function.

3.2. Fitness Function

Fitness functions help solutions learn toward optimization to attain goals in artificial intelligence, such as the proposed CSVM, the traditional SVM, and the CNN. The accuracy obtained by the SVM, based on the records transferred from the proposed convolutions, is adopted here to represent the fitness to maximize in the CSVM:Input: all records and the sth solution for r = 1, 2, …, Nrec.Output: the F (Xs).STEP F0: calculate Ir = IrXs based on the pseudocode provided in Section 2.2 for r = 1, 2, …, Nrec.STEP F1: classifier {I1, I2, …, INrec} using the SVM and let the accuracy be F (Xs).

3.3. Self-Adaptive Solution Structure and pFilter

In the proposed CSVM, each variable of all filters in each solution is initialized randomly from [−2, 2]. Each filter and solution are presented by Nvar 1 and NfilterNvar, respectively, since the number of filters may be more than one. For example, the sth solution Xs and the fth filter Xs, f in Xs are denoted as follows:where

However, overall, the number of filters is equal, that is, Nfilter for each solution and all generations. However, a greater number of filters do not always guarantee a better fitness value. Hence, we need to record the best number of filters for each solution. Let filter j be the best filter of solution s = 1, 2, …, Nsol, and define pFilter[s] = j if F[Xs, f] ≤ F[Xs, j] for all k = 1, 2, …, Nfilter. Note that Xh, i is the best solution for pFilter[h] = i among all existing solutions if F[Xs, f] ≤ F[Xh, i] for all s = 1, 2, …, Nsol and f = 1, 2, …, Nfilter.

In the end, only the best solution (e.g., Xs) and its best number of filters, namely, Xs, 1, Xs, 2, …, Xs, j, where pFilter[s] = j, are reported. In addition, the update mechanism is based on the best filter in the proposed CSVM. Hence, the solution is self-adapted by the best number of filters.

3.4. One-Solution, One-Filter, One-Variable Greedy SSO Update Mechanism

The proposed new one-solution, one-filter, one-variable greedy SSO update mechanism is discussed in this subsection.

3.4.1. One-Solution Is Selected Randomly to be Updated in Each Generation

In the proposed CSVM, all values in filters are variables that must be determined to implement convolution products. Without the help from the GPU, it takes a long time to update variables to deepen the SVM. Hence, instead of the traditional algorithms, including SSO, the genetic algorithm (GA), and particle swarm optimization (PSO), of which all solutions need to be updated, only one solution is selected randomly for updating in each generation of the proposed new SSO update mechanism. Let solution s be selected to be updated based on the following equations:where ρ[0, 1] is a random floating-point number generated from interval [0, 1] and ρ[1, Nsol] {1, 2, …, Nsol} is the index of the solution selected randomly, gBest is the index of the best solution found, and the 0 is a new solution generated randomly. The new updated solution Xs will be either discarded or replaced with the old Xs based on the process described next.

3.4.2. One-Filter One-Variable Greedy Update Mechanism

All variables need to be updated, namely, the all-variable update mechanism, in the traditional SSO, and it has a higher probability of escaping the local trap compared to the updates with only some variables. However, the all-variable update mechanism may cause solutions that are near optimums to be kept away from their current positions. Additionally, its runtime is Nsol times that of the one-variable update, which selects one variable randomly to be updated. Hence, to reduce the runtime, only one variable in one filter in the solution selected in Section 3.4.1 is updated.

Let s be the solution selected to be updated. In the proposed new SSO, only one filter, for example, f, where f = 1, 2, …, pFilter[f] = j, in solution s is chosen randomly. Moreover, one variable, for example, xs, f, k, where k = 1, 2, …, Nvar, in such filter Xs, f is also selected randomly to be updated based on the following simple process:where is a the random number generated in the update mechanism, and subscript is the lower bound and upper bound for the random number . The interval of is derived from the optimal value of multiple randomized trial and error results. 0.05 is the step size of the local search, in order to ensure that in the local search process to find a fine enough optimal solution. After resetting all variables in these filters Xs, h to a random number generated from [−2, 2] for all h > f, we have

Also, F[Xs, l] = F[Xs, f − 1] for all l < f.

Moreover, the updated solutions Xs, including these new updated variables and filters, are all discarded, if their fitness values are not better than that of Xs; that is,

3.5. Pseudocode of the Proposed SSO

The pseudocode of the proposed SSO based on the new self-adaptive solution structure, pFilter, and the new update mechanism are listed in Algorithm 2.

Input: A random selected solution (e.g., Xs) with its pFilter.
Output: The updated Xs.
STEP U0. Generate a random number ρ[0,1] from [0, 1] and select a solution, say Xs where s {1, 2, …, Nsol} based on equation (19).
STEP U1. Select a filter, say Xs, j where j {1, 2, …, pFilter[s]}.
STEP U2. Update Xs to X based on equation (21).
STEP U3. Based on equation (22) to decide to let Xs = X or discard X.
STEP U4. If Xs = X, let pFilter[s] = f, where F (Xi) ≤ F (Xf) for all i = 1, 2, …, Nfilter. Otherwise, halt.
STEP U5. If F (XgBest, pFilter[gBest]) ≤ F (Xs, pFilter[s]), let gBest = s.

4. Proposed Small-Sample OA to Tune Parameters

It is important to select the most representative combination of parameters to find good results for all algorithms, such as the three parameters , and in SSO. To reduce the computation burden, a novel concept called small-sample orthogonal array (OA) is proposed in terms of OA test to tune parameters in Section 4.

4.1. OA

The design of experiment (DOE) adopts an array design that arranges the tests and factors in rows and columns, respectively, such that rows and columns are independent of each other, and there is only one test level in each factor level [37]. The DOE is able to select better parameters from some representative predefined combinations to reduce test numbers [2, 38].

The Taguchi OA test, first developed by Taguchi [37], is a DOE that is implemented to achieve the objective of this study. OA is denoted by Ln (ab), where , a, and b are the numbers of tries, levels of each factor, and factors, respectively. For example, Table 1 represents an OA denoted by L9 (34).

From Table 1, we can see that the characteristics of the OA are orthogonal as follows:(1)The number of different levels in each column is equal; for example, numbers 1, 2, and 3 appear three times in each column in Table 1.(2)All ordered pairs of the two factors for the same test also appear exactly once, for example, (1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), and (3, 3) in columns 1 and 2, 1 and 3, 1 and 4, 2 and 3, 2 and 4, and 3 and 4, to ensure that each level is dispersed evenly in the complete combination of each level of factors.

4.2. Proposed Small-Sample OA

There are three major methods for tuning parameters:(1)The try-and-error method: It implements the tests exhaustively by trying all possible cases to find the one with the better results. It is the simplest and the most inefficient one.(2)The parameter-adapted method: It selects and tests some set of parameters from the existing parameters, which are already used in some applications. This method may have some issues with respect to identifying the characteristics of new problems.(3)The DOE: It selects the parameters from the experiment design. Compared to the two aforementioned methods, this method is the most efficient and effective one. However, this method faces an efficiency problem in large datasets or needs to be repeated very often.

Hence, to overcome these aforementioned problems, a novel method called the small-sample OA test is proposed to improve the OA method for tuning parameters. To reduce the runtime, the proposed small-sample OA test only samples few data randomly from the dataset and conducts the OA test on the subsets of such small-sample data to find the best parameters that result in the highest accuracy, the shortest runtime, and/or the largest number of solutions with the maximal number of obtained highest accuracy based on the three following rules:Rule 1. The one with the highest accuracy among all others;Rule 2. The one with the shortest runtime, with a big gap between such runtime and others if there is a tie based on Rule 1;Rule 3. The one with the largest number of solutions that have the highest accuracy if there is a tie based on Rule 2.

Then, this selected parameter set is applied to the rest of the unsampled dataset. The example for this proposed test is provided in Section 6.

5. Proposed CSVM and Traditional SVM

The proposed CSVM is a convolutional SVM modified by employing a new convolution product, which is updated based on the proposed new SSO. The traditional SVM is introduced briefly, and then the proposed pseudocode of the proposed CSVM is presented.

5.1. Traditional SVM

SVMs are excellent machine learning tools for binary classification cases [27, 28]. The purpose of an SVM is to maximize the margin between two support hyperplanes to separate two classes of data. Let X = {z1 = (x1, y1), z2 = (x2, y2), …, zn = (xn, yn)} be a two-class dataset for training. For example, in a linear SVM, a hyperplane is a line, and we want to find the best hyperplane WTX + b = 0 to separate these two classes of data in X, where W is the weight vector and b is the bias perpendicular to such hyperplane such that ||W|| is as large as possible. The above linear SVM is a constrained optimization model and it can be written as follows [27, 28]:

After applying the Lagrange multiplier method to the constrained optimization model, the SVM problem is a convex quadratic programming problem that can be presented as follows [27]:where λi is the Lagrange multiplier.

For these high-dimensional data, it is very difficult to find a single linear line to separate two different sets. Hence, these data are mapped into a higher-dimensional space using a function that is called the kernel in SVM. Then, a hyperplane can be found to separate the mapped data. Here, we list some popular kernel functions [27, 28]:

For more details of SVM and its development, the reader is referred to [25, 26].

5.2. Pseudocode of the Proposed CSVM

The pseudocode of the proposed CSVM is described below together with the integration of the proposed convolution product discussed in Subsection 2.2, the proposed SSO introduced in Section 3, and the proposed small-sample OA presented in Section 4.2 (Algorithm 3).

 PROCEDURE CSVM0
Input: A dataset.
Output: The accuracy of the classifier CSVM.
STEP 0. Separate the dataset into k folds randomly, and then select one-fold (e.g., the kth fold of the dataset); for the small-sample OA, it has N tries.
STEP 1. Implement CSVM0 (i, k) using the ith parameter setting on the kth fold of the dataset for i = 1, 2, …, N, and then let the parameter setting of the try (e.g., i) with the highest accuracy among all N tries.
STEP 2. Implement CSVM0 (i, j) on the jth fold of the dataset using the parameter setting of the ith try for j = 1, 2, …, k.
PROCEDURE CSVM0 (α, β)
Input: The parameter setting in the αth try of the small-sample OA and the βth fold of the dataset.
Output: The accuracy.
STEP W0. Generate solutions Xs randomly, then calculate F (Xs, f) based on the proposed convolution product and the SVM. Find pFilter[s] and gBest such that F (XgBest, pFilter[gBest]) ≥ F (Xs, f), where s = 1, 2, …, Nsol and f = 1, 2, …, Nfilter.
STEP W1. Let t = 1.
STEP W2. Update a randomly selected solution based on the pseudocode of the new SSO provided in Subsection 3.5 and the parameter setting in the αth try of OA.
STEP W3. Increase the value of t by 1, that is, let t = t + 1, and then go to STEP W2 if t < Ngen.
STEP W4. Halt, F (XgBest, pFilter[gBest]) is the accuracy, and XgBest, pFilter[gBest] is the classifier.

6. Experimental Results and Summary

There are two experiments, Ex1 and Ex2, in this study. Ex1 is based on the proposed small-sample OA concept to find the parameters , Cp, , Ngen, Nfilter, and Nvar in the proposed CSVM. Then, these parameters are employed in Ex2 to conduct an extension test to compare these results with those obtained from the DSCM, SVM, 3-layer ANN, and 4-layer ANN, respectively.

6.1. Simulation Environment

Four algorithms are developed and adapted in this study including the proposed CSVM, SVM, the 3-layer ANN, and the 4-layer ANN. The proposed CSVM is implemented using Dev C++ Version 5.11 C/C++, and the SVM part is integrated by calling the libsvm library [28] with all default setting parameters. The codes of both the 3-layer and 4-layer ANNs are modified using the source code provided in [39], which is coded in Python and run in Anaconda with epochs = 150, batch_size = 10, loss = “binary_crossentropy,” optimizer = “Adam,” activation = “ReLU” and 12 neurons in the first hidden layer, and activation = “sigmoid” in the second hidden layer of the 4-layer ANN. The test environment is Intel (R) Core (TM) i9-9900K CPU @ 3.60 GHz, 32.0 GB memory, and 64-bit Windows 10.

To validate the proposed CSVM, the proposed CSVM was compared with the traditional SVM and the 3-layer and 4-layer ANNs on five well-known datasets: “Australian Credit Approval” (A), “breast-cancer” (B), “diabetes” (D), “fourclass” (F), and “Heart Disease” (H) [34] based on a tenfold cross-validation in Ex2. Summary of the five datasets is provided in Table 2. A brief introduction of the datasets is as follows:“Australian Credit Approval” (A): this file concerns credit card applications. This database exists elsewhere in the repository (Credit Screening Database) in a slightly different form. This dataset is interesting because there is a good mix of attributes-continuous, nominal with small numbers of values, and nominal with larger numbers of values. There are also a few missing values.“breast-cancer” (B): the term “breast cancer” refers to a malignant tumor that has developed from cells in the breast. It is the most common cancer among women in almost all parts of the world. The used dataset consists of 699 instances that were classified as benign and malignant. Also, the dataset has 11 integer-valued attributes.“diabetes” (D): diabetes mellitus is one of the most serious health challenges in both developing and developed countries. Diabetes dataset that we used contains 8 categories and 768 instances and records on diabetes patients (several weeks to months worth of glucose, insulin, and lifestyle data per patient and a description of the problem domain), gathered from larger databases belonging to the National Institute of Diabetes and Digestive and Kidney Diseases.“fourclass” (F): the dataset has irregular spreads over the space including disconnected regions and they are not linearly separable. A four-class nonlinearly separable dataset consists of 862 pieces of data and 2 dimensions.“Heart Disease” (H): heart attack diseases remain the main cause of death worldwide, including South Africa, and possible detection at an earlier stage will prevent the attacks. This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date.

Let , T, G, f, and N be the highest accuracy levels obtained in the end, the runtime, the earliest generation that obtained , the number of filters generating , and the number of solutions that have , respectively. To be easily recognized, the subscripts 25, 50, 75, 100, avg, max, min, and std represent the related values obtained at the end of the 25th, 50th, 75th, and 100th generations, the average, the maximum, minimum, and the standard deviation, respectively.

6.2. Ex1: Small-Sample OA Test

The orthogonal array used in this study is called L9 (34) as shown in Table 3.

In L9 (34), there are nine tries and four factors: C = (), Nsol, Nvar, and Nfilter; each factor has three levels as shown in Table 4. The higher the level, the larger related values with the exception of C; for example, in level 1, Nsol = 25 is smaller than that in level 2. The most distinguishable difference among all three levels in C of Table 3 is that level 2 has higher cr which is to increase the global search ability, while level 3 has the lower value of cr to enhance the local search ability.

The results obtained from the proposed CSVM in terms of the proposed small-sample OA test are listed in Table 5, in which each try is run fifteen times, where the larger Nfilter, Nsol, Nvar, and/or Ngen, the longer the runtime. However, it is not necessary to have better fitness values from Table 5. For example, the best fitness value has already been found in G25, namely, F25 = F50 = F75, in all datasets except Dataset D whose best fitness value is found in G75.

Adhering to Rule 1 listed in Section 4, only the try with the highest accuracy is selected to be used for the rest of the unsampled dataset. In this case, Try 7 is selected for Dataset D, since the greatest accuracy is obtained from Try 7 in G75. From Rule 2, the runtime T must be considered if there are two tries tied in accuracy. For example, both Try 5 and Try 9 have the highest accuracy in Dataset A, but Try 5 is selected, since its runtime is only 43.43, which is considerably less than the runtime (149.28) of Try 9. Similarly, Try 7 and Try 1 are selected for Datasets B and H, respectively. The parameter setting for the rest of datasets, namely, Dataset F, is based on Rule 3, and Try 5 is selected in accordance with Rule 2.

Hence, we obtain the parameter settings listed in Table 6.

In Dataset F, there are only two attributes resulting in also two variables in each filter of Table 6. Another observation is that the values Nfilter, Nvar, Nsol, and Ngen are always the smallest, since all the best final fitness values are equal to 88.00000 regardless of the generation number. Then, the parameter setting with less runtime is selected, which is reasonable. This is similar to Dataset B whose solution number is only 25, with less local search ability.

In Table 5, the accuracy levels obtained from SVM for the first fold of each dataset are listed in the last second column named 100FSVM. From Table 5, all values in FNgen are better than those in the corresponding FSVM. Moreover, also from Table 5, all fitness values obtained from G25, namely, F25, are already at least equal to FSVM; that is, FSVM ≤ F25 ≤ F50 ≤ F75 ≤ F100. Hence, the proposed CSVM outperforms the traditional SVM in the small-sample OA, and the wide discrepancy between the final performances of the CSVM and the SVM is further reinforced in Subsection 6.3 using the parameters setting from the proposed small-sample OA.

6.3. Ex2

The results for G100 are collected to evaluate the effectiveness of the concept of the proposed small-sample OA and verify any possible effects on the average and the best fitness values of higher generation numbers. The complete data including the average, best, worst, and standard deviation of fitness of each fold for each dataset are listed in Tables 711.

6.3.1. Boxplots of the Experimental Results from Ex2

Both results obtained from the 3-layer and 4-layer ANNs are the least favorable with a big gap between the proposed CSVM and the traditional SVM. Hence, these two ANN-based methods are not discussed further, and we only focus on the proposed CSVM and the traditional SVM.

We determined that the higher the generation number, the better average fitness value. However, it can be observed that the best fitness value remains unchanged from G75 to G100 except for the 8th fold in Dataset A, the 1st and 10th folds in Dataset B, and the 5th fold in Dataset F. Therefore, Ngen = 75 is acceptable and there is no need for Ngen = 100 to increase the fitness value of the best solution. The position (the fitness values obtained) and the length (the range of the fitness values) of box in G100 are frequently higher and shorter than those of G25 in most boxplots. Hence, a larger generation number has a higher probability of enhancing the average solution quality under the cost of the longer runtime but ultimately does little to improve the best fitness value.

6.3.2. Number of Folds for Finding the Final Best Fitness Values

Table 12 lists the number of folds that have found the final best fitness values. The subscripts of dataset ID in the first column of Table 12 indicate the generation number used in Ex 2; for example, B25 indicates that 25 generations are used for Dataset B based on the parameters obtained with small-sample OA in Ex 1. Folds 7, 8, 10, 8, and 10 (see bold numbers in Table 12) under G25, G25, G75, G25, and G25 in Datasets A, B, D, F, and H, respectively, have found the best final fitness values.

The folds written as subscripts indicate the final best fitness values that have failed to be found. For example, 71,4,7 in G25, A25 represents that there are seven folds (from the ten folds) that have already found the best fitness values after 25 generations with the remaining three folds 1, 4, and 7 failing to do so in Dataset A.

To calculate the probability of the best final fitness value in Table 12, we add the folds (7 + 8 + 10 + 8 + 10) and divide the product by the total number of folds (50) in Table 6 to get 86%, which informs us that the probability of finding the best final fitness value without reaching G100, which entails a significantly longer runtime, is 86%.

Hence, the proposed small-sample OA is effective in setting parameters to increase the efficiency and solution quality of the proposed CSVM. The above observation further confirms that having better parameters ultimately negates the need for a greater generation number to increase the fitness of the best solution.

6.3.3. ANOVA of the Experimental Results

To investigate the small-sample OA, the Analysis of Variance (ANOVA) is carried out to test the average fitness obtained from the proposed CSVM in terms of the parameters set by the small-sample OA, as shown in Table 13. The cells marked with “v” indicate that there is a significant gap between the pair of distinctive generation numbers listed in their respective rows in the fold denoted by the column. This is reinforced through the distinct difference between the average fitness values obtained from G25 and G75 in all folds of Dataset A.

From Table 13, the minimal generation numbers should be 75 and 50 for only Datasets A and F, respectively, with an insignificant gap between the fitness values in each fold. Hence, the proposed small-small OA is still effective in determining the generation number to reduce the significant difference among all fitness values; even it focuses only on the best fitness value and not the average fitness value that we found.

6.3.4. MPI of the Experimental Results

To further investigate the development of the proposed CSVM, two other indices, the average maximum possible improvement (MPIavg%) and the best maximum possible improvement (MPIavg%), are introduced and defined as

The MPIavg% and MPImax% results are listed in Tables 14 and 15, respectively, where the cells marked “” indicate that both the related Fsvm and the average and/or the best fitness obtained are 100% correct, for example, the 2nd, 4th, and 5th folds in Dataset B in Table 14. The bold numbers denote the best values among all folds for each dataset under the same generation number. Note that a value of 100, as in the 7th fold of Dataset B in Table 15, indicates that the related accuracy is 100%.

As shown in Tables 14 and 15, the results obtained from the proposed CSVM are at least 14.96% and 20.17%, with at most a 50.68% and 63.10% improvement in MPIavg% and MPImax%, respectively. The results shed light on the effectiveness of the proposed CSVM in comparison with the traditional SVM. It can be also observed that the more attributes, the greater the results obtained from the proposed CSVM regardless of the number of records. Ultimately, compared to the traditional SVM, the proposed CSVM is more suitable for small data.

7. Conclusions and Future Work

Classification is of utmost importance in data mining. The proposed new classifier, CSVM, is a convolutional SVM modified with a new repeated-attribute convolution product, in which all variables in each filter are updated and trained based on the proposed novel SSO. Equipped with a self-adaptive structure and pFilter, this greedy SSO is a one-solution, one-filter, one-variable type and its parameters are delineated by the proposed small-sample OA.

According to the experiment results for the five UCI datasets, namely, Australian Credit Approval, breast-cancer, Diabetes, fourclass, and Heart Disease [36], from Ex2 in Section 6, the proposed CSVM with the parameter setting selected from Ex1 outperforms the traditional SVM, the 3-layer ANN, and the 4-layer ANN with an improved accuracy of at least 14.96% and up to 50.68% in MPIavg%. Hence, the proposed small-sample OA discussed in Subsection 4.2 enables the CSVM to improve its overall performance, while the proposed CSVM ultimately serves as a successful concoction of the advantages of SVM, the convolution product, and SSO.

The classifier design method is a crucial element in the provision of useful information in the modern world. Through comparisons of the results of experiments, it can be determined whether further research will be conducted on the proposed CSVM, which will be applied to multiclass datasets based on several references [40, 41] with more attributes, classes, and records, and amalgamated with particular feature selections.

Data Availability

To validate the proposed CSVM, it was compared with the traditional SVM and the 3-layer and 4-layer ANNs on five well-known datasets, “Australian Credit Approval” (A), “breast-cancer” (B), “Diabetes” (D), “fourclass” (F), and “Heart Disease” (H), at http://archive.ics.uci.edu/ml/.

Disclosure

This article was once submitted to arXiv as a temporary submission that was just for reference and did not provide the copyright.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported in part by the Ministry of Science and Technology, R.O.C., under Grants MOST 102-2221-E-007-086-MY3 and MOST 104-2221-E-007-061-MY3 and the National Natural Science Foundation of China under Grant 61702118.