#### Abstract

This paper presents a methodology named Optimally Pruned K-Nearest Neighbors (OP-KNNs) which has the advantage of competing with state-of-the-art methods while remaining fast. It builds a one hidden-layer feedforward neural network using K-Nearest Neighbors as kernels to perform regression. Multiresponse Sparse Regression (MRSR) is used in order to rank each *k*th nearest neighbor and finally Leave-One-Out estimation is used to select the optimal number of neighbors and to estimate the generalization performances. Since computational time of this method is small, this paper presents a strategy using OP-KNN to perform Variable Selection which is tested successfully on eight real-life data sets from different application fields. In summary, the most significant characteristic of this method is that it provides good performance and a comparatively simple model at extremely high-learning speed.

#### 1. Introduction

In many application fields, the regression problem is widely been paid attention to, in order to predict a dependent variable (target) from a number of independent variables (observations), or to model numerical data consisting of values of variables (input) and of one or more variable (output). However, there are two main difficulties facing regression problems: accuracy and computational time.

In the recent years, many different techniques have been investigated to solve the regression problems. Support Vector Machines (SVMs) are one of the most popular ones among these techniques, initially developed for classification tasks and lately has been extended to the domain of regression [1]. Briefly, SVM is a universal constructive learning procedure based on statistical learning theory. Its basic idea is to transform the signal into a higher-dimensional feature space and find the optimal hyperplane for classification and regression problems. Different from Multiple Layer Perception (MLP), the nonlinear classification and model regression are solved using convex optimization leading to a unique solution, which avoids the problem of local minima of MLP [2]. Least Squares Support Vector Machines (LS-SVMs) are reformulations of standard SVM [3]. In LS-SVM, the complexity of solving quadratic programs in SVM is deduced to solving linear Karush-Kuhu-Tucker (KKT) conditions. Only linear equations need to be solved which makes the approach much simpler. As a consequence, LS-SVM loses the property of sparseness in SVM.

However, there are some limitations of SVM that weaken its performance: the hyperparameters of the kernel have to be chosen; the training speed is slow, especially when the number of variables is large. Even from a practical point of view perhaps the most serious problem with SVM is the high algorithmic complexity and extensive memory requirements of the required quadratic programming in large-scale tasks. Thus, another group of methods like K-nearest neighbors (KNNs) or Lazy Learning (LL) [4] is taken into account. The key idea behind KNN is that similar training samples have similar output values and it keeps avoiding the local minima problem as SVM, but performs more simple and fast.

On the other hand, Variable Selection has several important advantages when the number of input variables increases. It helps to decrease the redundancy of the original data. It can also reduce the complexity of the modeling process. Moreover, it contributes to the interpretability of the input variables.

Thus, in this paper, we present a methodology: Optimally Pruned K-Nearest Neighbors (OP-KNNs) which builds a single-hidden layer feedforward neural networks (SLFNs) using KNN as the kernel. The most significant characteristic of this method is that it tends to provide good generalization performance at a fast computing speed and select the most important variables at the same time.

In the next section, the three steps of OP-KNN are introduced. And the strategy we used to solve regression problem using OP-KNN is showed in Section 3. Section 4 gives the results for a toy example and nine real-life datas using OP-KNN and four other methods, and the last section summarizes the whole methodology.

#### 2. Optimal Pruned K-Nearest Neighbors

In this section, a methodology called Optimally Pruned K-Nearest Neighbors (OP-KNNs) is presented. The three main steps of the OP-KNN are summarized in Figure 1.

##### 2.1. Single-Hidden Layer Feedforward Neural Networks (SLFNs)

Recently, Huang et al. in [5] proposed an original algorithm called Extreme Learning Machine (ELM). This method makes the selection of the weights of the hidden neurons very fast in the case of single-layer feedforward neural network (SLFN). A more thorough presentation of the ELM algorithm can be found in the original paper [6, 7]. Furthermore, a methodology named Optimally Pruned Extreme Learning Machine (OP-ELM) [8], based on the original ELM algorithm, is proved to be more efficient when encountering irrelevant or correlated data.

The first step of the OP-KNN algorithm is building a single-layer feedforward neural network. This is similar to the core of Extreme Learning Machine (ELM). The difference is that OP-KNN is deterministic, rather than randomly choosing hidden nodes like in ELM and OP-ELM.

In the context of a single-hidden layer perceptron network, let us denote the inputs by , outputs by , and the weight vectors between the hidden layer and the output by . Activation functions used with the OP-KNN differ from the original SLFN choice since the original sigmoid activation functions of the neurons are replaced by the K-Nearest Neighbors, hence it named OP-KNN. For the output layer, the activation function remains as a linear function, meaning that the relationship between hidden layer and output layer is linear.

A theorem proposed in [5] states that the output weights can be computed from the real output and the hidden layer output matrix , where the columns of are the corresponding output of the K-nearest neighbors. Finally, the output weights are computed by , where stands for the Moore-Penrose inverse [9] and is the output.

The only remaining parameter in this process is the initial number of neurons of the hidden layer.

##### 2.2. K-Nearest Neighbors

The K-Nearest Neighbors (KNNs) model is a very simple, but powerful tool. It has been used in many different applications and particularly in classification tasks. The key idea behind the KNN is that similar training samples have similar output values for regression problems [10]. In OP-KNN, the approximation of the output is the weighted sum of the outputs of the k-nearest neighbors. The model introduced in the previous section becomes

where represents the output estimation, is the index number of the th nearest neighbor of sample , and represents the results of the Moore-Penrose inverse introduced in the previous section.

In this sense, for each different neuron, different nearest neighbors are used, in other words, the only remaining hyperparameter that has to be chosen is the neighborhood size . Besides choosing , there is no other hyperparameter in method KNN, as well as in OP-KNN.

##### 2.3. Multiresponse Sparse Regression (MRSR)

For the removal of the useless neurons of the hidden layer, the Multiresponse Sparse Regression proposed by Similä and Tikka in [11] is used. It is an extension of the Least Angle Regression (LARS) algorithm [12] and hence it is actually a variable ranking technique, rather than a selection one. The main idea of this algorithm is the following: denote by the matrix of targets, and by the regressors matrix. MRSR adds each regressor one by one to the model , where is the target approximation by the model. The weight matrix has nonzero rows at th step of the MRSR. With each new step a new nonzero row, and a new regressor to the total model, is added.

An important detail shared by the MRSR and the LARS is that the ranking obtained is exact in the case where the problem is linear. In fact, this is the case, since the neural network built in the previous step is linear between the hidden layer and the output layer. Therefore, the MRSR provides the exact ranking of the neurons for the problem [12].

Details on the definition of a cumulative correlation between the considered regressor and the current model's residuals and on the determination of the next regressor to be added to the model can be found in the original paper about the MRSR [11].

MRSR is hence used to rank the kernels of the model: the target is the actual output while the “variables” considered by MRSR are the outputs of the k-nearest neighbors.

##### 2.4. Leave-One-Out (LOO) Method

Since the MRSR only provides a ranking of the kernels, the decision over the actual best number of neurons for the model is taken using a Leave-One-Out method. One problem with the LOO error is that it can get very time consuming if the dataset tends to have a high number of samples. Fortunately, the (or PREdiction Sum of Squares) PRESSs statistics provides a direct and exact formula for the calculation of the LOO error for linear models. See [4, 13] for details on this formula and implementations:

where is defined as and the hidden layer output matrix is defined in Section 2.1.

The final decision over the appropriate number of neurons for the model can then be taken by evaluating the LOO error versus the number of neurons used (properly ranked by MRSR already).

#### 3. Strategy for Regression Using OP-KNN

##### 3.1. Variable Selection (VS)

Variable Selection is one of the most important issues in machine learning, especially when the number of observations (samples) is relatively small compared to the numbers of input variables. It has been the subject in application domains like pattern recognition, time series modeling, and econometrics. The necessary size of the data set increases exponentially with the number of dimensions. To circumvent this, one solution is to select a subset of the features or variables which best describes the output variables (targets) [14]. Then, it is possible to capture and reconstruct the underlying regularity or relationship (that is approximated by the regression model) between input variables and output variables.

Variable Selection has several important advantages. It helps to decrease the redundancy of the original data. It can also reduce the complexity of the modeling process. Moreover, it contributes to the interpretability of the input variables.

##### 3.2. Variable Selection Using OP-KNN

Whether using KNN, OP-KNN, SVM, LS-SVM, or some other regression method, an optimization criterion is needed to do Variable Selection. In fact, there are many ways to deal with the Variable Selection problem, a common one is using the generalization error estimation. In this methodology, the set of features that minimizes the generalization error is selected using Leave one out. Other techniques such as Bootstrap or resampling techniques [15, 16] exist but they are very time consuming and may lead to an unacceptable computational time. In this paper, Variable Selection is performed using the Leave-One-Out error of OP-KNN as criterion, since OP-KNN is very fast.

###### 3.2.1. Wrapper Method

As is well known, Variable Selection can be roughly divided into two broad classes: filter method and wrapper method. As the name implies, our strategy belongs to the wrapper methods which means that the variables are selected according to the criterion directly from the training algorithm.

In other words, our strategy is to select the input subset that can give the best OP-KNN result. Once the input subset is fixed, OP-KNN is repeated to build the model. Furthermore, for the training set and test set, selection procedure is performed on the training set, and then OP-KNN is used on the selected variables of the test set. In this paper, the input subset is selected by means of Forward Selection algorithm.

###### 3.2.2. Forward Selection

This algorithm starts from the empty set which represents the selected set of the input variables. Then the best available variable is added to the set one by one until running through all the variables.

To clarify Forward selection, suppose a set of inputs , and the output , then the algorithm is as follows.

(1)Set to be the initial set of the original input variables, and to be the empty set like mentioned before.(2)Find where represents the selected variable, save the OP-KNN results, and move from to .(3)Continue the same procedure, till the size of is .(4)Compare the OP-KNN values for all the sizes of the sets , the final selection result is the set which the corresponding OP-KNN gives the smallest value.Forward-Backward Selection [17] can be also used instead of Forward Selection in the algorithm but will increase the computational time.

#### 4. Experiments

This section shows the speed and accuracy of the OP-KNN method, as well as the strategy we introduced before, using several different regression data sets. For the comparison, Section 4.2 provides also the performances using Support Vector Machine (SVM) [18].

The following subsection shows a toy example to illustrate the performance of OP-KNN on a simple case that can be plotted.

##### 4.1. Sine Example

In this toy example, a set of training points () are generated (and represented as green points in Figure 2), the output is a sum of two sines. This single dimension example is used to test the method without the need for variable selection beforehand.

The model built by OP-KNN is showed as blue crosses in Figure 2. As seen from the figure, it approximates the data very well.

The dashed blue line in Figure 3 shows the LOO error for different numbers of nearest neighbors. From the analysis of the figure, by using nearest neighbors, the algorithm reaches the smallest LOO error () which is close to the real noise introduced in the dataset which is . The computational time for the whole OP-KNN is one second (using Matlab implementation).

Thus, in order to have a very fast and still accurate algorithm, each of the three presented steps has a special importance in the whole OP-KNN methodology. The K-nearest neighbor ranking by the MRSR is one of the fastest ranking methods providing the exact best ranking, since the model is linear (for the output layer), when creating the neural network using KNN. Without MRSR, which can be seen in the solid red line in Figure 3, the number of nearest neighbor that minimizes the Leave-One-Out error is not optimal and the Leave One Out error curve has several local minima instead of a single global minimum. The linearity also enables the model structure selection step using the Leave-One-Out, which is usually very time-consuming. Thanks to the PRESS statistics formula for the LOO error calculation, the structure selection can be done in a small computational time.

##### 4.2. Real-Data Sets

For the comparison of OP-KNN and four other methods, nine data sets are selected from different application for regression problems [19]. Each data set is randomly permuted (without repetitions) and then divided into training set (two-thirds of the data set) and testing set (one-third of the data set). 10 such rounds are performed (different permutations) such that the results have statistical significance. In this sense, the test error we calculate finally is the average of 10 trials.

The only exception here is the data “Delve,” which has 2000 samples in training and 20732 samples in testing. 10-fold test in Monte Carlo way is not necessary in this case since the number of samples in testing is very large.

Table 1 shows some key information about the data sets and the variables selected on average, while Tables 2 and 3 illustrate the test error and Computational time for all methods, respectively.

As seen from Table 2, the OP-KNN holds the best performance level in most of the cases except two datasets. According to these results, SVM and OP-ELM are reliable in general. However, considering the computational time shown in Table 3, the OP-KNN method clearly has its own advantage. It is faster than SVM, with several orders of magnitude. For example, in the Abalone data set using the OP-KNN is more than times faster than the SVM.

On the other hand, the speed is not the only advantage of OP-KNN; OP-KNN also selects the most significant input variables. This operation highly simplifies the final model, and moreover, makes the data and model more interpretable. The cost is the computational time. According to the forward strategy we used in variable selection part, the higher the dimensionality of the data, the more rounds of OP-KNN. Therefore, OP-KNN is not as fast as OP-ELM in some cases while selecting variables. However, for example, we select most important variables from the original in Delve data, which highly reduces the complexity. This selection of variables was tested with the other methods and yielded much better results—decreasing to the test error for the MLP for example.

#### 5. Conclusions

It is usual to have very long-computational time for training a feedforward network using existing classic learning algorithms even for simple problems, especially when the number of observations (samples) is relatively small compared to the numbers of input variables. Thus, this paper presents OP-KNN method as well as a strategy using OP-KNN to do Variable Selection. This algorithm has several notable achievements:

(i)keeping good performance while being simpler than most learning algorithms for feedforward neural networks,(ii)using KNN as the deterministic initialization,(iii)the computational time of OP-KNN being extremely low,(iv)variable selection highly simplifies the final model, and moreover, makes the data and model more interpretable.In the experiment section, we have demonstrated the speed and accuracy of the OP-KNN methodology in nine real applications. The aim of OP-KNN is not to be the best method in terms of error, but to prove that OP-KNN is a good tradeoff between performance, computational time, and variable selection possibility. In a word, this makes OP-KNN a valuable tool for real applications.