Research Article  Open Access
Qi Yu, Yoan Miche, Antti Sorjamaa, Alberto Guillen, Amaury Lendasse, Eric Séverin, "OPKNN: Method and Applications", Advances in Artificial Neural Systems, vol. 2010, Article ID 597373, 6 pages, 2010. https://doi.org/10.1155/2010/597373
OPKNN: Method and Applications
Abstract
This paper presents a methodology named Optimally Pruned KNearest Neighbors (OPKNNs) which has the advantage of competing with stateoftheart methods while remaining fast. It builds a one hiddenlayer feedforward neural network using KNearest Neighbors as kernels to perform regression. Multiresponse Sparse Regression (MRSR) is used in order to rank each kth nearest neighbor and finally LeaveOneOut estimation is used to select the optimal number of neighbors and to estimate the generalization performances. Since computational time of this method is small, this paper presents a strategy using OPKNN to perform Variable Selection which is tested successfully on eight reallife data sets from different application fields. In summary, the most significant characteristic of this method is that it provides good performance and a comparatively simple model at extremely highlearning speed.
1. Introduction
In many application fields, the regression problem is widely been paid attention to, in order to predict a dependent variable (target) from a number of independent variables (observations), or to model numerical data consisting of values of variables (input) and of one or more variable (output). However, there are two main difficulties facing regression problems: accuracy and computational time.
In the recent years, many different techniques have been investigated to solve the regression problems. Support Vector Machines (SVMs) are one of the most popular ones among these techniques, initially developed for classification tasks and lately has been extended to the domain of regression [1]. Briefly, SVM is a universal constructive learning procedure based on statistical learning theory. Its basic idea is to transform the signal into a higherdimensional feature space and find the optimal hyperplane for classification and regression problems. Different from Multiple Layer Perception (MLP), the nonlinear classification and model regression are solved using convex optimization leading to a unique solution, which avoids the problem of local minima of MLP [2]. Least Squares Support Vector Machines (LSSVMs) are reformulations of standard SVM [3]. In LSSVM, the complexity of solving quadratic programs in SVM is deduced to solving linear KarushKuhuTucker (KKT) conditions. Only linear equations need to be solved which makes the approach much simpler. As a consequence, LSSVM loses the property of sparseness in SVM.
However, there are some limitations of SVM that weaken its performance: the hyperparameters of the kernel have to be chosen; the training speed is slow, especially when the number of variables is large. Even from a practical point of view perhaps the most serious problem with SVM is the high algorithmic complexity and extensive memory requirements of the required quadratic programming in largescale tasks. Thus, another group of methods like Knearest neighbors (KNNs) or Lazy Learning (LL) [4] is taken into account. The key idea behind KNN is that similar training samples have similar output values and it keeps avoiding the local minima problem as SVM, but performs more simple and fast.
On the other hand, Variable Selection has several important advantages when the number of input variables increases. It helps to decrease the redundancy of the original data. It can also reduce the complexity of the modeling process. Moreover, it contributes to the interpretability of the input variables.
Thus, in this paper, we present a methodology: Optimally Pruned KNearest Neighbors (OPKNNs) which builds a singlehidden layer feedforward neural networks (SLFNs) using KNN as the kernel. The most significant characteristic of this method is that it tends to provide good generalization performance at a fast computing speed and select the most important variables at the same time.
In the next section, the three steps of OPKNN are introduced. And the strategy we used to solve regression problem using OPKNN is showed in Section 3. Section 4 gives the results for a toy example and nine reallife datas using OPKNN and four other methods, and the last section summarizes the whole methodology.
2. Optimal Pruned KNearest Neighbors
In this section, a methodology called Optimally Pruned KNearest Neighbors (OPKNNs) is presented. The three main steps of the OPKNN are summarized in Figure 1.
2.1. SingleHidden Layer Feedforward Neural Networks (SLFNs)
Recently, Huang et al. in [5] proposed an original algorithm called Extreme Learning Machine (ELM). This method makes the selection of the weights of the hidden neurons very fast in the case of singlelayer feedforward neural network (SLFN). A more thorough presentation of the ELM algorithm can be found in the original paper [6, 7]. Furthermore, a methodology named Optimally Pruned Extreme Learning Machine (OPELM) [8], based on the original ELM algorithm, is proved to be more efficient when encountering irrelevant or correlated data.
The first step of the OPKNN algorithm is building a singlelayer feedforward neural network. This is similar to the core of Extreme Learning Machine (ELM). The difference is that OPKNN is deterministic, rather than randomly choosing hidden nodes like in ELM and OPELM.
In the context of a singlehidden layer perceptron network, let us denote the inputs by , outputs by , and the weight vectors between the hidden layer and the output by . Activation functions used with the OPKNN differ from the original SLFN choice since the original sigmoid activation functions of the neurons are replaced by the KNearest Neighbors, hence it named OPKNN. For the output layer, the activation function remains as a linear function, meaning that the relationship between hidden layer and output layer is linear.
A theorem proposed in [5] states that the output weights can be computed from the real output and the hidden layer output matrix , where the columns of are the corresponding output of the Knearest neighbors. Finally, the output weights are computed by , where stands for the MoorePenrose inverse [9] and is the output.
The only remaining parameter in this process is the initial number of neurons of the hidden layer.
2.2. KNearest Neighbors
The KNearest Neighbors (KNNs) model is a very simple, but powerful tool. It has been used in many different applications and particularly in classification tasks. The key idea behind the KNN is that similar training samples have similar output values for regression problems [10]. In OPKNN, the approximation of the output is the weighted sum of the outputs of the knearest neighbors. The model introduced in the previous section becomes
where represents the output estimation, is the index number of the th nearest neighbor of sample , and represents the results of the MoorePenrose inverse introduced in the previous section.
In this sense, for each different neuron, different nearest neighbors are used, in other words, the only remaining hyperparameter that has to be chosen is the neighborhood size . Besides choosing , there is no other hyperparameter in method KNN, as well as in OPKNN.
2.3. Multiresponse Sparse Regression (MRSR)
For the removal of the useless neurons of the hidden layer, the Multiresponse Sparse Regression proposed by Similä and Tikka in [11] is used. It is an extension of the Least Angle Regression (LARS) algorithm [12] and hence it is actually a variable ranking technique, rather than a selection one. The main idea of this algorithm is the following: denote by the matrix of targets, and by the regressors matrix. MRSR adds each regressor one by one to the model , where is the target approximation by the model. The weight matrix has nonzero rows at th step of the MRSR. With each new step a new nonzero row, and a new regressor to the total model, is added.
An important detail shared by the MRSR and the LARS is that the ranking obtained is exact in the case where the problem is linear. In fact, this is the case, since the neural network built in the previous step is linear between the hidden layer and the output layer. Therefore, the MRSR provides the exact ranking of the neurons for the problem [12].
Details on the definition of a cumulative correlation between the considered regressor and the current model's residuals and on the determination of the next regressor to be added to the model can be found in the original paper about the MRSR [11].
MRSR is hence used to rank the kernels of the model: the target is the actual output while the “variables” considered by MRSR are the outputs of the knearest neighbors.
2.4. LeaveOneOut (LOO) Method
Since the MRSR only provides a ranking of the kernels, the decision over the actual best number of neurons for the model is taken using a LeaveOneOut method. One problem with the LOO error is that it can get very time consuming if the dataset tends to have a high number of samples. Fortunately, the (or PREdiction Sum of Squares) PRESSs statistics provides a direct and exact formula for the calculation of the LOO error for linear models. See [4, 13] for details on this formula and implementations:
where is defined as and the hidden layer output matrix is defined in Section 2.1.
The final decision over the appropriate number of neurons for the model can then be taken by evaluating the LOO error versus the number of neurons used (properly ranked by MRSR already).
3. Strategy for Regression Using OPKNN
3.1. Variable Selection (VS)
Variable Selection is one of the most important issues in machine learning, especially when the number of observations (samples) is relatively small compared to the numbers of input variables. It has been the subject in application domains like pattern recognition, time series modeling, and econometrics. The necessary size of the data set increases exponentially with the number of dimensions. To circumvent this, one solution is to select a subset of the features or variables which best describes the output variables (targets) [14]. Then, it is possible to capture and reconstruct the underlying regularity or relationship (that is approximated by the regression model) between input variables and output variables.
Variable Selection has several important advantages. It helps to decrease the redundancy of the original data. It can also reduce the complexity of the modeling process. Moreover, it contributes to the interpretability of the input variables.
3.2. Variable Selection Using OPKNN
Whether using KNN, OPKNN, SVM, LSSVM, or some other regression method, an optimization criterion is needed to do Variable Selection. In fact, there are many ways to deal with the Variable Selection problem, a common one is using the generalization error estimation. In this methodology, the set of features that minimizes the generalization error is selected using Leave one out. Other techniques such as Bootstrap or resampling techniques [15, 16] exist but they are very time consuming and may lead to an unacceptable computational time. In this paper, Variable Selection is performed using the LeaveOneOut error of OPKNN as criterion, since OPKNN is very fast.
3.2.1. Wrapper Method
As is well known, Variable Selection can be roughly divided into two broad classes: filter method and wrapper method. As the name implies, our strategy belongs to the wrapper methods which means that the variables are selected according to the criterion directly from the training algorithm.
In other words, our strategy is to select the input subset that can give the best OPKNN result. Once the input subset is fixed, OPKNN is repeated to build the model. Furthermore, for the training set and test set, selection procedure is performed on the training set, and then OPKNN is used on the selected variables of the test set. In this paper, the input subset is selected by means of Forward Selection algorithm.
3.2.2. Forward Selection
This algorithm starts from the empty set which represents the selected set of the input variables. Then the best available variable is added to the set one by one until running through all the variables.
To clarify Forward selection, suppose a set of inputs , and the output , then the algorithm is as follows.
(1)Set to be the initial set of the original input variables, and to be the empty set like mentioned before.(2)Find where represents the selected variable, save the OPKNN results, and move from to .(3)Continue the same procedure, till the size of is .(4)Compare the OPKNN values for all the sizes of the sets , the final selection result is the set which the corresponding OPKNN gives the smallest value.ForwardBackward Selection [17] can be also used instead of Forward Selection in the algorithm but will increase the computational time.
4. Experiments
This section shows the speed and accuracy of the OPKNN method, as well as the strategy we introduced before, using several different regression data sets. For the comparison, Section 4.2 provides also the performances using Support Vector Machine (SVM) [18].
The following subsection shows a toy example to illustrate the performance of OPKNN on a simple case that can be plotted.
4.1. Sine Example
In this toy example, a set of training points () are generated (and represented as green points in Figure 2), the output is a sum of two sines. This single dimension example is used to test the method without the need for variable selection beforehand.
The model built by OPKNN is showed as blue crosses in Figure 2. As seen from the figure, it approximates the data very well.
The dashed blue line in Figure 3 shows the LOO error for different numbers of nearest neighbors. From the analysis of the figure, by using nearest neighbors, the algorithm reaches the smallest LOO error () which is close to the real noise introduced in the dataset which is . The computational time for the whole OPKNN is one second (using Matlab implementation).
Thus, in order to have a very fast and still accurate algorithm, each of the three presented steps has a special importance in the whole OPKNN methodology. The Knearest neighbor ranking by the MRSR is one of the fastest ranking methods providing the exact best ranking, since the model is linear (for the output layer), when creating the neural network using KNN. Without MRSR, which can be seen in the solid red line in Figure 3, the number of nearest neighbor that minimizes the LeaveOneOut error is not optimal and the Leave One Out error curve has several local minima instead of a single global minimum. The linearity also enables the model structure selection step using the LeaveOneOut, which is usually very timeconsuming. Thanks to the PRESS statistics formula for the LOO error calculation, the structure selection can be done in a small computational time.
4.2. RealData Sets
For the comparison of OPKNN and four other methods, nine data sets are selected from different application for regression problems [19]. Each data set is randomly permuted (without repetitions) and then divided into training set (twothirds of the data set) and testing set (onethird of the data set). 10 such rounds are performed (different permutations) such that the results have statistical significance. In this sense, the test error we calculate finally is the average of 10 trials.
The only exception here is the data “Delve,” which has 2000 samples in training and 20732 samples in testing. 10fold test in Monte Carlo way is not necessary in this case since the number of samples in testing is very large.
Table 1 shows some key information about the data sets and the variables selected on average, while Tables 2 and 3 illustrate the test error and Computational time for all methods, respectively.



As seen from Table 2, the OPKNN holds the best performance level in most of the cases except two datasets. According to these results, SVM and OPELM are reliable in general. However, considering the computational time shown in Table 3, the OPKNN method clearly has its own advantage. It is faster than SVM, with several orders of magnitude. For example, in the Abalone data set using the OPKNN is more than times faster than the SVM.
On the other hand, the speed is not the only advantage of OPKNN; OPKNN also selects the most significant input variables. This operation highly simplifies the final model, and moreover, makes the data and model more interpretable. The cost is the computational time. According to the forward strategy we used in variable selection part, the higher the dimensionality of the data, the more rounds of OPKNN. Therefore, OPKNN is not as fast as OPELM in some cases while selecting variables. However, for example, we select most important variables from the original in Delve data, which highly reduces the complexity. This selection of variables was tested with the other methods and yielded much better results—decreasing to the test error for the MLP for example.
5. Conclusions
It is usual to have very longcomputational time for training a feedforward network using existing classic learning algorithms even for simple problems, especially when the number of observations (samples) is relatively small compared to the numbers of input variables. Thus, this paper presents OPKNN method as well as a strategy using OPKNN to do Variable Selection. This algorithm has several notable achievements:
(i)keeping good performance while being simpler than most learning algorithms for feedforward neural networks,(ii)using KNN as the deterministic initialization,(iii)the computational time of OPKNN being extremely low,(iv)variable selection highly simplifies the final model, and moreover, makes the data and model more interpretable.In the experiment section, we have demonstrated the speed and accuracy of the OPKNN methodology in nine real applications. The aim of OPKNN is not to be the best method in terms of error, but to prove that OPKNN is a good tradeoff between performance, computational time, and variable selection possibility. In a word, this makes OPKNN a valuable tool for real applications.
References
 B. Schlkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond Adaptive Computation and Machine Learning, 2001.
 N. Cristianini and J. ShaweTaylor, An Introduction to Support Vector Mahines and Other KernalBased Learning Methods, Cambridge University Press, New York, NY, USA, 2000.
 J. A. K. Suykens, T. V. Gestel, J. D. Brabanter, B. D. Moor, and J. Van dewalle, Least Squares Support Vector Machines, World Scientific Publishing, Singapore, 2007.
 G. Bontempi,, M. Birattari, and H. Bersini, “Recursive lazy learning for modeling and control,” in Proceedings of the European Conference on Machine learning, pp. 292–303. View at: Google Scholar
 G.B. Huang, Q.Y. Zhu, and C.K. Siew, “Extreme learning machine: theory and applications,” Neurocomputing, vol. 70, no. 13, pp. 489–501, 2006. View at: Publisher Site  Google Scholar
 G.B. Huang and L. Chen, “Convex incremental extreme learning machine,” Neurocomputing, vol. 70, no. 1618, pp. 3056–3062, 2007. View at: Publisher Site  Google Scholar
 M.B. Li, G.B. Huang, P. Saratchandran, and N. Sundararajan, “Fully complex extreme learning machine,” Neurocomputing, vol. 68, no. 1–4, pp. 306–314, 2005. View at: Publisher Site  Google Scholar
 Y. Miche, P. Bas, C. Jutten, O. Simula, and A. Lendasse, “A methodology for building regression models using Extreme Learning Machine: OPELM,” in Proceedings of the European Symposium on Artificial Neural Networks, pp. 23–25, 2008. View at: Google Scholar
 C. R. Rao and S. K. Mitra, Generalized Inverse of Matrices and Its Applications, John Wiley & Sons, New York, NY, USA, 1972.
 A. Sorjamaa, J. Hao, and A. Lendasse, “Mutual information and kNearest Neighbors approximator for time series prediction,” in Proceedings of the International conference on Artificial Neural Networks (ICANN '05), vol. 3697 of Lecture Notes in Computer Science, pp. 553–558, Warsaw, Poland, 2005. View at: Google Scholar
 T. Similä and J. Tikka, “Multiresponse sparse regression with application to multidimensional scaling,” in Proceedings of the 15th International Conference on Artificial Neural Networks, vol. 3697 of Lecture Notes in Computer Science, pp. 97–102, 2005. View at: Google Scholar
 B. Efron, T. Hastie, I. Johnstone et al., “Least angle regression,” Annals of Statistics, vol. 32, no. 2, pp. 407–499, 2004. View at: Publisher Site  Google Scholar
 R. H. Myers, Classical and Modern Regression with Applications, Duxbury, Pacific Grove, Calif, USA, 1990.
 A. Lendasse, V. Wertz, and M. Verleysen, “Model selection with crossvalidations and bootstraps—application to time series prediction with RBFN models,” in Proceedings of the Joint International Conference on Artificial Neural Networks, vol. 2714 of Lecture Notes in Computer Science, pp. 573–580, Istanbul, Turkey, 2003. View at: Google Scholar
 B. Efron and R. Tibshirani, New Tools in Nonlinear Modeling and Prediction, Chapman and Hall, London, UK, 1993.
 M. Verleysen, “Learning highdimensional data,” in Proceedings of the NATO Advanced Research Workshop on Limitations and Future Trends in Neural Computing, pp. 22–24, Italy, 2001. View at: Google Scholar
 Q. Yu, E. Séverin, and A. Lendasse, “Variable selection for financial modeling,” in Proceedings of the 13th International Conference on Computing in Economics and Finance, Montréal, Canada, 2007. View at: Google Scholar
 C. C. Chang and C. J. Lin, “LIBSVM: a library for support vector machines,” 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm/. View at: Google Scholar
 http://archive.ics.uci.edu/ml/datasets.html.
Copyright
Copyright © 2010 Qi Yu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.