Abstract

Supervised data classification is one of the techniques used to extract nontrivial information from data. Classification is a widely used technique in various fields, including data mining, industry, medicine, science, and law. This paper considers a new algorithm for supervised data classification problems associated with the cluster analysis. The mathematical formulations for this algorithm are based on nonsmooth, nonconvex optimization. A new algorithm for solving this optimization problem is utilized. The new algorithm uses a derivative-free technique, with robustness and efficiency. To improve classification performance and efficiency in generating classification model, a new feature selection algorithm based on techniques of convex programming is suggested. Proposed methods are tested on real-world datasets. Results of numerical experiments have been presented which demonstrate the effectiveness of the proposed algorithms.

1. Introduction

Supervised data classification is a widely used technique in various fields, including data mining, whose aim is to establish rules for the classification of some observations assuming that the classes of data are known. Due to the explosive growth of both business and scientific databases, extracting efficient classification rules from such databases is of major importance.

In the past 30 decades various algorithms were designed for supervised data classification which are based on completely different approaches, for example, statistics methods [1], neural networks [2], genetic algorithms [3], graphical models [4], and adaptive spline methods [5].

Algorithms based on inductive logic programming [6] and hybrid systems [7] are also used for supervised data classification. Kotsiantis in 2007 and Mangasarian and Musicant in 2001 [8, 9] presented good review of these approaches, including their definition and comparison. One of the new and most promising approaches to supervised data classification is based on methods of mathematical optimization. There exist different ways for the application of optimization; for example, see [1012]. One of these methods is based on finding clusters for the given training sets. The data vectors are allocated to the closest cluster and correspondingly to the set, which contains this cluster [13].

On the other hand, one of the most influencing important factors on the classification accuracy rate is feature selection. If the dataset contains a number of features, the dimension of the space will be large and nonclean, degrading the classification accuracy rate [14]. An efficient and robust feature selection method can eliminate noisy, irrelevant, and redundant data [15]. Therefore reducing it without loss of useful information is expected to accelerate the algorithms and increase the accuracy. Most feature selections are based on statistical considerations, and the features are usually removed according to a correlation between observations and features (see [15, 16]). In [17, 18] approaches based on optimization techniques have been developed. Then a new feature selection algorithm based on techniques of convex programming is proposed.

So in this research, new algorithms for classification and feature selection problems based on optimization techniques are designed; for the execution of these approaches one needs to solve complex problems of nonconvex and nonsmooth unconstrained optimization, either local or global. Despite the nonsmoothness and nonconvexity of the objective functions, global methods are much simpler and more applicable than local ones. In the present research, we are adapted and used as one type of direct global optimization methods, namely, mesh adaptive direct search (MADS) [19]. MADS is derivative-free method in the sense that this method does not compute nor even attempt to evaluate derivatives. Mesh adaptive direct search methods are designed to only use function values and require only a numerical value of the objective; no knowledge about the internal structure of the problem is needed [19].

Results of computational experiments using real-world datasets were presented and compared with the best known solutions from the literature.

The paper is organized as follows. The optimization approach to classification is considered in Section 2. In Section 3, an algorithm for feature selection problems is studied. In Section 4 an algorithm is presented for solving optimization problems. The discussion of the results of computational experiments and their analysis is explained in Section 5. Finally Section 6 concludes the paper.

2. A New Optimization Algorithm for Solving Classification Problem

Consider a set consisting of points which contains classes, that is, nonempty finite subsets of -dimensional space . Assume that the set consists of points . The aim of classification is to categorize a new observation into one of the known classes and there are many existing approaches for solving this problem (as mentioned in Introduction). In continue, a classification method that is based on optimization ways has been studied. Numerical experiments verify that this method outperforms known ones for real-world databases. In order to solve this problem, the clusters of each class of dataset have to be identified, which must be done along with centers of the corresponding clusters. New observations are allocated to the class with least distance between its centers.

Thus, at the first finding the clusters of a finite set will be explained. Clustering in -dimensional Euclidean space is based on some similarity (distance) metric, the Minkowski metric was used for this aim. There are various methods for solving clustering problem. One of the most popular methods is the center based clustering model [2022].

Consider the set ; suppose that this set consists of only one cluster; thus its center can be calculated by solving the following convex programming problem: Suppose that is the solution of problem (1); in order to find a center of the second cluster, find the answer of the following optimization problem:

In the same manner, suppose the already calculated centers, and then the center of th cluster is described as a solution to the following problem:

Then the following algorithm for solving a classification problem is proposed. Suppose that database contains 2 classes: and . Let , , and a tolerance.

Algorithm 1. A new algorithm for classification problem is presented.
Step 1 (initialization). Suppose that sets and contain a unique cluster; calculate the centers of clusters by solving the following problems:
Suppose that , are the solutions to these problems and allow and to be the values of these problems, respectively. Let .
Step 2 (identify the sets of points “misclassified” by the current clusters). Compute the sets
Step 3. If , compute the following sets: Else
Step 4. Improve the center of the cluster by solving the following convex programming problems: s.t. .
Allow and to be the solutions of the problems (8) and (9), respectively. Set and.
Step 5 (checking the stopping criterion). If, calculate these functions: and if , then the algorithm ends. Otherwise go to Step 6.
Step 6 (determine the estimate of next cluster). Solve the following optimization problems: s.t. .
Step 7. Allow and to be the solutions of the problems (11) and (12), respectively. Set , , and and go to Step 2.

3. Feature Selection Algorithm

Feature selection is concerned with the identification of a subset of features that significantly contributes to the discrimination or prediction problem. The main goal of feature selection is to search for an optimal feature subset from the initial feature set that leads to improved classification performance and efficiency in generating classification model. During the past decades, wide research has been conducted by researchers from multidisciplinary fields including data mining, pattern recognition, statistics, and machine learning. In [23] a comparison of various feature selection algorithms for large datasets is presented.

Consider a database which contains 2 nonempty finite sets . Let , where denotes the cardinality of a finite set . Let be the thresholds and let be some tolerance.

Algorithm 2. Feature selection.
Step 1 (initialization). Set.
Step 2. Find centers of clusters by assuming that the sets contain a unique cluster. Compute the centers of clusters by solving the following problems of convex programming: Here is defined by .
Step 3. Find points of the set which are closer to the cluster center of the other set (bad points).
Let be solutions to (13). Compute the sets Set .
If , then go to Step 5; otherwise go to Step 4.
Step 4. Calculate Then is a subset of most informative attributes and the algorithm terminates. Otherwise go to Step 5.
Step 5. To determine the closest coordinates, calculate and define the following set:
Step 6. Construct the set If , then is the subset of most informative attributes. If then is the subset of most informative attributes. Then the algorithm terminates; otherwise set and go to Step 2.

4. Solving Optimization Problems

In this section, algorithm for solving problems as mentioned in the classification algorithm has been discussed. Since these functions are nonsmooth and estimate of subgradients is difficult, direct search methods of optimization seem to be the best option for solving them. The main attraction of direct search methods is their ability to find optimal solutions without the need for computing derivatives, in contrast to the more familiar gradient-based methods [24].

Direct search algorithms can be applied for problems that are difficult to be solved with traditional optimization techniques, including problems that are difficult to model mathematically or are not well defined. They can be also applied when the objective function is discontinuous, stochastic, highly nonlinear, or undefined derivative.

In general, direct search algorithms are called pattern search algorithms and both the generalized pattern search (GPS) algorithm and the MADS algorithm are pattern search algorithms that compute a sequence of points that get closer and closer to the optimal point. At each step, the algorithm investigates a set of points, called a mesh, around the current point (the point computed at the previous step of the algorithm). The mesh is created by adding the current point to a scalar multiple of a set of vectors called a pattern. If the pattern search algorithm discovers a point in the mesh that makes better (decreases) the objective function at the current point, the new point becomes the current point at the next step of the algorithm.

4.1. The MADS Method

MADS methods are designed to only use function values and require only a numerical value of the objective; no knowledge about the internal structure of the problem is needed. These methods can quickly and easily be used in nonlinear, nonconvex, nondifferentiable, discontinuous, or undermined problems [19]. The convergence analysis of MADS guarantees necessary optimality conditions of the first and second orders under certain assumptions [19]. A general optimization problem can be as follows: where  , and .

MADS is an iterative algorithm. Each iteration (shown by the subscript is initiated with the current best feasible solution , known as the incumbent solution, and each iteration of the MADS algorithm can be stated by two steps. First, an optional search step over the space of variables is performed as long as it is a finite process and all trial points lie on a mesh. If no better point is found or no global search is applied, the algorithm goes to a compulsory local exploration step (compulsory because it ensures convergence). Second is the poll step; at most trial mesh points near the incumbent solution are chosen (the poll set) and evaluated. If no better neighbor is found, the mesh is refined. If an improved mesh point is found, the mesh is kept the same or coarsened, and then is the next incumbent. The exploration directions vary at each iteration and become dense with probability 1. This is the main difference between the pattern search and MADS algorithms. General constraints can be handled with a barrier approach, which redefines the objective as in the following equation: Then, MADS is applied to the unconstrained barrier problem The feasible region can be nonlinear, nonconvex, nondifferentiable, or disjoint. There are no hypotheses made on the domain, except that the initial point must be feasible. The convergence results depend on the local smoothness of (and not , which is obviously discontinuous on the boundary of ).

Algorithm 3 (the MADS algorithm). A general and flexible algorithmic framework for MADS is studied in [19]. This general framework is then specialized to a specific algorithmic implementation. The main steps of the algorithm are summarized as follows.
Step 1 (initialization). The user defines the starting point and the initial mesh size.
The algorithm initializes other parameters for subsequent steps.
Step 2 (request for an improved mesh point). Consider the following steps:(i)global search (optional): evaluation of over a finite subset of points defined by the mesh;(ii)local poll (mandatory): definition of a poll set and evaluation of over points in that set.
Step 3 (parameters update). Parameters are updated.
Step 4 (termination). If some stopping criterion is reached, stop; if not, go back to Step 2.

5. Results of Numerical Experiments

To verify the efficiency of the proposed algorithms a number of numerical experiments with real-world data sets have been carried out on a PC, Intel Core 2 Duo CPU, 1.95 GB of RAM.

The Australian credit dataset, the breast cancer dataset, the diabetes dataset, the heart disease dataset, the liver-disorder dataset, the German Numer dataset, and the mushroom dataset have been applied in numerical experiments.

The description of these datasets can be found in UCI Machine Learning Repository [25].

In Table 1, shows the number of samples of dataset, presents the number of classes of dataset, and is the number of features.

First, all features were normalized. This is done by a nonsingular matrix so that standard deviation values of all features are 1. In order to evaluate the performance, 10-fold cross-validation was used where a sample from each dataset was selected and then divided into 10 equal sized subsets. Next, a subset was selected and designated as the test set and the union of the remainder nine subsets was used as the training set. After the application of Algorithm 2 which calculates the subset of informative attributes and selection of the features, the classification model was validated with the test subset. This process was repeated where each of the 10 subsets was successively selected as the test set. Accordingly, the proposed method was run 10 times and the classification accuracy rate was calculated by averaging across all 10 test runs.

Note. In feature selection algorithm (Algorithm 2) (maximum numbers of added “bad points” in each iteration of the feature selection algorithm for each class of dataset) have important role in the execution of this algorithm and therefore in numerical experiments one or two percent of value of each class dataset for , and have been considered

In comparison with introduced in [18] results of numerical experiments have shown that this algorithm significantly reduces the number of attributes, so that 3 attributes were used in the diabetes dataset, the breast cancer dataset, the liver-disorder dataset, and the Australian credit dataset, 11 attributes in the heart disease dataset, 4 attributes in the German dataset, and 6 attributes in the mushroom dataset for solving classification problem. While in comparison with those obtained by the proposed method and the results obtained in [18] we can see that, for the Australian credit dataset, the number of features is decreased from 6 to 3; for the breast cancer dataset it was the same, while for the heart disease dataset, the number of features is increased from 3 to 11.

In numerical experiments Algorithm 1 was used for classification of datasets with 10-fold cross-validation and the MADS algorithm has been applied for solving problems in Algorithm 1; then in this research it is called MA and is supposed as . Results of the numerical experiments are presented in Tables 28. In Tables 28, represents the error rate for the training data and shows the error rate for the test data, that is, criteria for goodness of one method.

Also the numerical results of the parametric misclassification minimization (PMM) [26], robust linear programming (RLP) [27], the hybrid misclassification minimization (HMM) [28], support vector machines algorithms [29], the -nearest neighbor algorithm (NN), the multilayer perceptron (MLP), the probabilistic neural network (PNN), and the sequential minimal optimization algorithm (SMO) [30, 31] were used for the purpose of comparison. Moreover the results obtained by particle swarm optimization algorithm (PSO) [32, 33], music-inspired harmony search algorithm (HS) [34], fire fly algorithm (FFA) [35] and its references, the Waikato Environment for Knowledge Analysis (WEKA) system release 3.4 [36], which contains a large number of such techniques that were divided into different groups, were equally used for comparison. From each of such groups, some representatives have been chosen. They are as follows: the radial basis function artificial neural network (RBF) [37], among the lazy, the KStar [38], among the rule-based ones the ripple down rule (Ridor) [39], and among others the voting feature interval (VFI) [40]. Similarly, we have the MultiBoostAB [41] and among the Bayesian the Bayes net [42]. Parameter values used for any technique are those set as default in WEKA. Also the results obtained by support vector machines algorithm [10], IncNet [43], fuzzy approach [44], FLEXNFIS [45], FNN [46], RULES-4 [47] and C4.5 [48], Naïve Bayes [49, 50], BNND and BNNF methods from [51], SSVM [52], RSVM [53], SVM [54], LSSVM [55], FAIRS [56], DC-RBFNN [57], Boost [58], RIPPER [59], INB [60], and GPF [61] were used in the experiments.

The results of numerical experiments obtained by using 23 algorithms of classification from Michie [62], presented in Chapter 9 of this book, were also applied; these are statistical, neural network, and machine learning algorithms. In addition, only the best results obtained by these algorithms are presented in Tables 28.

The results for the Australian credit database are presented in Table 2, which indicates that the accuracy of the proposed method is higher than the accuracies of other methods pointed out in the table.

The results for second database, breast cancer database, are presented in Table 3. It shows that the accuracy of proposed method is higher than the accuracies of other methods except for KStar and HMM methods in which the accuracies are quite close to that of the proposed method.

For the diabetes database, the results of numerical experiments are presented in Table 4, which shows that the accuracy of proposed method is higher than the accuracies of other methods pointed out in this table except that of FNN method in which the accuracy is the best.

The results for the heart database are presented in Table 5. From these results and the previous results, it is safe to conclude that the accuracy of proposed method is the best and, thus, the most suitable for this dataset.

The results for the liver database are presented in Table 6 which shows that the accuracy of proposed method is better than the accuracies of other methods pointed out in the table except for PMM method in which the accuracy is the best. From these results and the previous results, it is safe to conclude that the accuracy of proposed method is the best and, thus, the most suitable for this dataset.

The results for the German database are presented in Table 7, which confirms that the errors of the proposed method are lower than the errors of other methods pointed out in the table, except for SMO, DC-RBFNN, and GPF methods in which the errors are lower than MA method.

The results for the last database, mushroom database, are presented in Table 8. It shows that the errors of the proposed method are near 0. These are showing goodness of the proposed method.

As shown in Tables 28, the MA model obtains the best or near the best prediction accuracies in almost all datasets.

Further, in order to evaluate important factors in the performance of the MADS algorithm for solving the classification problem, different experiments were carried out on the datasets as mentioned earlier in this paper. Here only the main results obtained are presented from different experiments conducted; this is done to avoid unnecessary details for the sake of summary. In this research, mesh factors have been defined as: , is mesh contraction factor used when iteration is unsuccessful, is mesh expansion factor which expands mesh when iteration is successful and ,  . Also it was found that the decreases when increases and the best value for is near 5.

Since the direction of poll set is chosen as random in MADS algorithm, therefore each performance of this algorithm gives new result and so MA method was performed 10 times and average of solutions is presented in the following tables. Also it was found that the standard deviations of them (solutions) are near zero.

Various experiments have been accomplished on the datasets as mentioned before in the classification algorithm with having different values of (so that ) for finding the best value for . Therefore, the good value was observed around . Also to appraise important factors in the performance of the MADS algorithm for solving classification problem, the same experiments have been done by different strategies have been made in the MADS algorithm for step search (that means search step is empty, when random direction was chosen for mesh set in search step, when genetic algorithm was chosen for step search, and when Nelder-Mead algorithm was chosen for step search); and the results were almost the same.

Therefore, the results are presented in Tables 28, show that AM gives good results compared with other methods for all datasets. The results of numerical experiments demonstrate that the proposed algorithms are effective for solving classification problems.

6. Conclusions

In this paper a new algorithm was proposed for solving classification problem where the algorithm includes the nonsmooth and nonconvex optimization problems. The new proposed algorithm is based on classes in the database which use cluster centers so that, for each class, the cluster analysis problem with more estimation is solved.

The MADS method was used for solving the nonsmooth optimization problems. The new method was tested using real-world datasets. Results of these computational experiments show the effectiveness of the new algorithms. In the future, the size of datasets will increase; obviously applying feature selection is useful for classification problem and therefore it seems that the feature selection procedure should be further studied. Also proposing new globalization strategies for this method based on combining with other good methods similar to PSO for solving classification problems as future study is suggested.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors would like to thank the Ministry of Education Malaysia for funding this research project through a Research University Grant of Universiti Teknologi Malaysia (UTM), project titled “Dimension Reduction & Data Clustering for High Dimensional & Large Dataset” (04H40). Also, thanks are due to the Research Management Center (RMC) of UTM for providing an excellent research environment in which this work was completed.