Abstract

Kernel-based neural network (KNN) is proposed as a neuron that is applicable in online learning with adaptive parameters. This neuron with adaptive kernel parameter can classify data accurately instead of using a multilayer error backpropagation neural network. The proposed method, whose heart is kernel least-mean-square, can reduce memory requirement with sparsification technique, and the kernel can adaptively spread. Our experiments will reveal that this method is much faster and more accurate than previous online learning algorithms.

1. Introduction

Adaptive filter is the heart of most neural networks [1]. LMS method and its kernel-based methods are potential online methods with iterative learning that are used for reducing mean squared error toward optimum Wiener weights. Due to simple implementation of LMS [1], this method became one of the candidates for online kernel-based learning. The kernel-based learning [2] utilizes Mercer kernels in order to produce nonlinear versions of conventional linear methods.

After the introduction of the kernel, kernel least-mean-square (KLMS) [3, 4] was proposed. KLMS algorithm tries to solve LMS problems in reproducing kernel hilbert spaces (RKHS) [3] using a stochastic gradient methodology. KNN has such characteristics as kernel abilities and LMS features, easy learning over variants of patterns, and traditional neurons capabilities. The experimental results show that this classifier has better performance than the other online kernel methods, with suitable parameters.

Two main drawbacks of kernel-based methods are selecting proper value for kernel parameters and series expansions whose size equals the number of training data, which make them unsuitable for online applications.

This paper concentrates only on Gaussian kernel (for similar reasons to those discussed in [5]), while KNN uses other kernels too. In [6], the role of kernel width in the smoothness of the performance surfaces. Determining the kernel width of Gaussian kernel in kernel-based methods is very important. Controlling kernel width can help us to control the learning rate and the tradeoff between overfitting and underfitting.

Use of cross-validation is one of the simplest methods to tune this parameter which is costly and cannot be used for datasets with too many classes. So, the parameters are chosen using a subset of data with a low number of classes in [7]. In some methods, genetic algorithm [8] or grid search [5] is used to determine the proper value of such parameters. However, in all the mentioned methods, the kernel width is chosen as a preprocess which is against the principle of online learning methods. In addition, it has time overhead problem. Therefore, we follow methodologies that are consistent with online applications. In other works, the kernel width is scaled in a distribution-dependent way for SVM methods [9], kernel Polarization is used as a kernel optimality criterion independent of the learning machine [10], and in [11] a data-dependent kernel optimization algorithm is employed whcih maximizes the class separability in the empirical feature space. The computational complexity of these methods is quadratic with respect to the number of selected training data. Lately, an adaptive kernel width method to optimize minimum error entropy (MEE) criterion that has a linear complexity with respect to the length of the window used for computing the density estimate is proposed [12].

We proposed an adaptive kernel width learning in KNN method to maintain the online nature of it without any preprocess and reach convergence. We use the gradient property of KNN in order to estimate the best kernel width value during the process. Therefore, KNN method with adaptive Kernel width remains online and improves its accuracy as compared to the versions with fixed kernel width.

In other sides, it is needed to decrease computational complexity of kernel-based method to be useful in online application. Usually, used pruning [1315] or fixed size models [13, 14, 1618] in batching methods and truncation [17, 18] in online methods. We focus on online model reduction. KRLS and KLMS algorithms proposed some sparsification techniques based on ALD criterion [19, 20], by checking the linear dependency of the input feature vectors, which have a quadratic complexity. A sliding-window kernel RLS algorithm consists in only taking the last L pairs of arrived data, but it is a local method [21]. In [22, 23], proposed an online coherence based sparsification method on kernel-based affine projection (KAP) and KLMS that has only linear complexity. Coherence parameter is a fundamental quantity that characterizes the behavior of dictionaries in sparse approximation problems. In this paper, we combine the coherence criterion with proposed AKNN to control growth number of instances.

This paper is organized as follows. In Section 2, KLMS and KNN are introduced. Section 3 discusses adapting kernel parameter and sparsing instances in KNN. In Section 4, experiments illustrate the effectiveness of our approach compared to other existing methods. Finally, Section 5 summarizes the conclusions and points out avenues for further research.

2. Background

In this section, a short review of LMS and KLMS algorithms is presented. We introduce notations in Table 1 for better understanding of the formulation.

2.1. LMS Algorithm

The main purpose of the LMS algorithm is to find a proper weight vector, which can reduce the MSE of the system output based on a set of examples . Therefore, LMS cost function is The LMS algorithm approximates weight vector using gradient method [1]: where is the stepsize parameter. LMS algorithm is a famous linear filter because it has easy implementation and low computation.

2.2. KLMS Algorithm

The LMS algorithm can learn linear patterns very well but it is poor in learning nonlinear patterns. To overcome this problem, Puskal derived LMS algorithm directly in the kernel feature space [3, 4]. Kernel methods map the input data into a high dimensional space (HDS) [24]. The mapping procedure helps to compute nonlinear problems using linear algorithms. After that, Mercer’s theorem provides a kernel function to compute inner product of data from HDS directly in the input space that is called the kernel trick: The basic idea of the KLMS algorithm is to perform the linear LMS algorithm on in the feature space. So, KLMS cost function is given by where and are matrixes of input vectors and weight vectors in feature space (RKHS), respectively. For convenience assuming that , therefore By exploiting the kernel trick, output is given by where . Good prediction ability in nonlinear channels is an advantage of the KLMS algorithm, but the complexity of this method for each input is , and is the number of training data, which is a problem especially in online application.

3. Kernel-Based Neural Network

This section includes five parts for better presentation of the proposed kernel-based neural network. First, KLMS based neuron is explained; then kernel adaptation and stepsize of adaptation and termination condition are discussed, and final subsection includes the sparsification.

3.1. The KLMS Neural Network (KNN)

The KLMS neural network (KNN) performs the classification task by adding a nonlinear logistic function to KLMS structure; Figure 1 illustrates the structure of KNN.

Similar to the KLMS algorithm, we perform a gradient search in order to find the optimum weight. If is th output and , then By using kernel trick, is given as follows: where . Therefore, KNN can determine the classifier output by calculating coefficients in learning state and input vectors.

According, what was said, finding a proper kernel play an important role in kernel-based learning. The best kernel function for learning each dataset is different. One solution for improving kernel function is finding the best kernel parameters. We try to determine the best kernel width in KNN (assuming a Gaussian kernel). Whereas KNN is an online learning method, we need an iterative method to find appropriate instead of methods by high complexity or methods that find it by preprocessing on data. A good suggestion is to select a random value for at the beginning of the training step and update it until convergence is obtained.

3.2. Adaptation of Kernel Width

The goal of LMS family’s methods is to reduce mean square error, and this aim will be achieved by using gradient search. It can be proved that the KNN, such as KLMS algorithm, converges when there are infinite numbers of samples. If kernel space structure and cost function were derivable, then gradient methods can be used for finding kernel’s parameters. Due to the derivability of Gaussian kernel function and MSE cost function, gradient search method can be used for updating kernel weight to reach the least mean square error. If the proposed modified KNN cost function is defined as and the kernel formulation is then there is an error function which depends on weight Ω and kernel weight . Using gradient of with respect to σ, the least mean square error is achieved: where is stepsize and At first, can be solved that depends on : By substituting (14) in (12), error gradient is given by: As a result, by substituting error gradient in (11) and using , is updated by where is the stepsize that controls the convergence and speed of the kernel weight process. Choosing proper value for is a main challenge that some of its problems are explained in Section 3.2. The AKNN output can be calculated with coefficients , kernel width , achieved from training state, and input vectors as follows: where is a sigmoid function that covers all ranges of data () and is bounded between 0 and 1. This function is derivable, which is important for using gradient method, and its derivative can be achieved from sigmoid function, but you can use other suitable functions. Sigmoid function and its derivatives are equal to According to (18), we can reformulate for the next sample as below that decrease training time: Algorithm 1 describes the proposed adaptive KNN method that has a unique solution the same as KLMS algorithm.

Initialization:
learning step
learning kernel width step
primal kernel width
sparsity threshold
;
While   available do
for   instances do
  % compute distances of dictionary instances to tth instance
   ;                   
  % compute kernel vector, output and error for tth instance
           
  % compute coefficient using (19)
  % save for tracking
  
  % update kernel width σ
              
     if   became negative then
  % decrease until becomes non-negative
  % update kernel width again.
  end    if
  if ( )
  % add to dictionary and update using (23);
  else
  % update Coefficients using (24);
  end if
  
end for
if ( )
  % manipulated step size using (20)
else if ( )
 % exit training process
end if
end while

3.3. Modulation of Stepsize

There are some problems with choosing a fixed that we intend to solve them with a simple way without adding a new procedure. Some of these solutions are as follows:

Preventing Negation of Kernel Width . By choosing large , large jumps occurs which sometimes lead to passing from positive range to negative range. In order to prevent negation of , new value is checked in each time step. If is negative, has been decreased until stays in positive range.

Tracking Kernel Width . When current is far from real and is too small, adaptation procedure cannot reach very well. Or when is nearby , occasionally will be far from under a too large .

We can track value change by controlling stepsize . But when does this problem happen? Assuming that is a danger threshold, we expect that the error learning does not exceed this threshold and is mean square error for instances; if , then we can change according to distance of current from : where is a prediction of . It means that MSE is too large when is too small to change quickly by (16). So, when is far from , will increase to σ reaching to quickly, or when is nearby , will decrease for convergence. A simple way for expecting is using the average distance of passed instances: According to (10), Gaussian kernel has a good resolution when . Therefore, this average could give me an approximation of .

3.4. Termination Condition

When MSE is an acceptable range () and its change is not noticeable (), the procedure can be terminated.

3.5. Adaptive Sparsification in KNN

Growing network with arriving training inputs is the other drawback of kernel-based online learning algorithms. In this section, we use a proper criterion to cope with this problem and to produce sparse approximation of functions based on RKHS [22, 23]. As we show, this sparsification approach yields an efficient algorithm that has a linear complexity with respect to the number of dictionary elements. We incorporate the coherence criterion into the new AKNN method to increase the number of variables controlled by the coherence parameter ().

In online coherence-based sparsification algorithm using coherence criterion, whenever a new data pair are arrives, a decision is made of whether the new data add to the dictionary . If is kernel vector for th instance, the coherence rule has two modes [23].

(a) If . The new instance is not added to because can be reasonably well represented by the kernel functions of the dictionary elements. We suggest a gradient approach for applying effects of removed instance error on dictionary elements coefficients to minimize : So, recursive update equation for is obtained by where is the stepsize with forgetting factor capability.

(b) If . The new instance is added to and vector is updated by Adaptive kernel width and sparsity techniques in KNN method (AKNNμ) which is described in Algorithm 1, are produced from combination of this online sparsification strategy with AKNN methods. This algorithm has high accuracy because errors of removed instances are committed in learning process, and it is capable of online application. Algorithm 1 shows that its complexity for each instance is , if is the number of dictionaries.

4. Experiments and Results

Two experiments have been designed. In the first experiment effects of kernel width parameter on performance of the proposed KNN method with adaptive kernel width and the KNN method with fixed kernel width were demonstrated. In the second experiment, simulations on some classification problems to compare performance of the proposed method with other online classification methods were conducted.

4.1. Evaluation Effect of Parameter on the Fixed and Adaptive KNN Methods

This experiment is an example that visualizes effect of choosing initial on learning error. It was performed on an artificial two-dimensional (2D) dataset with 720 samples (Figure 2), whose design is complex. A comparison was carried out between the performance of the adaptive kernel width KNN method and the fixed kernel width KNN approach in the same conditions. The plot of Figure 3 depicts the average square error rate in both methods for small (Figure 3(a)) and large (Figure 3(b)). The changing process of in each time step of online learning algorithm in AKNN is plotted in Figure 4(a) (for small ) and in Figure 4(b) (for large ).

The empirical feature space preserves the geometrical structure of in the feature space [11, 26]. Figure 5 gives the 2D projection of the training data (90% of data) in the empirical feature space when the Gaussian kernel function with different is employed. It is based on only the 2D projection of the embedding that means the first two largest eigenvalues of kernel Gram matrix. Classification results of testing data (10% of data) for three values are presented in Figure 6 that shows misclassified data by circles.

Figure 3 demonstrates the effects of initial change on the MSE changing process for both methods. We observe that AKNN achieves significantly smaller MSE than the KNN algorithm in both cases. This shows that the proposed AKNN approach is effective for different initial values and can reach appropriate value (from Figure 4). Achieved from AKNN approach for spiral dataset is about 0.14. In addition, effect of on result classification is shown clearly in Figure 6. It means that KNN too has misclassified data for inappropriate and least misclassified data for the best that the achieved from proposed method.

Figure 5 gets some intuitive feeling about the position of data into the empirical feature space. From Figure 5(b), it is seen that the embedded data are too dense in empirical feature space and they cannot explain class separability. In Figure 5(c) with large , the embedded data are not dense, but they have complex structure yet that cannot be separated by linear hyper plans. Opposed to the two mentioned cases, the embedded data have a good resolution and a linear structure that can be separated by linear hyper plans.

4.2. Performance Comparison of AKNN with Online Learning Methods

The experiments have been carried out to evaluate the performance of the proposed method in comparison with a number of online classification methods. They include the kernel perception algorithm [25], the aggressive version of ROMMA [27], the algorithm [28], and the passive-aggressive algorithms (PA) [29]. We test all the algorithms on ten real datasets which are listed in Table 2 and can be downloaded from LIBSVM, UCI, and MIT CBCL face websites.

To make a fair comparison, all algorithms adopt the same experimental setup. For all algorithms in comparison, we set the penalty parameter and train all datasets with the Gaussian kernel with . For the algorithm, parameters and are set to 2 and 0.9, respectively. We fix sparsity threshold to be 0.7 ( in DUOL algorithm and μ in AKNNμ).

We scale all training and testing data to be in . Then we use a 10-fold cross-validation to estimate the efficiency measurements the same as mistake rate, density rate, and running time. The mistake rate evaluates the online learning performance that is the percentage of instances that are misclassified by the learning algorithm. We measure the sparsity of the learned classifiers by the density rate that is the percentage of remained instances (or support vectors in some of the methods). The running time evaluates computational efficiencies of all the algorithms. All the experiments are run in MATLAB over a windows machine.

Table 3 presents the result of comparing eight online learning algorithms over the large datasets. And Table 4 summarizes the performance of six compared online learning algorithms over the small and medium datasets. KNN and AKNN methods are not in Table 3 because they do not have sparsification technique. The online learning efficiency measurements are presented in both tables.

According to the experimental results shown in Tables 3 and 4, we can see that(i)although there is little difference in the training mistake rate among all methods, the proposed method has the best training mistake rate especially for large datasets;(ii)the proposed method achieves significantly smaller testing mistake rates than the other online approaches except in pima dataset;(iii)it can be seen that the perceptron, ALMA, and AKNNμ methods return more density rate than other approaches especially for large datasets. That means they need to keep fewer training data. Therefore, we can say that these methods are faster than other approaches because in general the training time is proportional to the size of dictionary.

So, we can say that among all online learning, the AKNNμ method yields the least mistake rate with the smallest density rate and running time for most of the cases. It happens because AKNNμ method is able to find a proper kernel function by finding the best kernel parameters and it keep the least and the best possible instances that are selected by online sparsification method.

5. Conclusion and Future Works

The goal of the present paper was to present a novel adaptive kernel least-mean-square neural network method. This method (AKNN) has an adaptive kernel width and sparsification processes simultaneously. We briefly touched history of learning algorithms based on LS. Then, we proposed an adaptive kernel that iteratively decreases least mean square error in form (4) to select proper kernel width for Gaussian kernel. For improving the method in online application, we have used a sparsification methodology for controlling the model order increasing that its computational complexity is only linear in the dictionary size. By use of the acquired μ-coherence dictionary from the training samples, the algorithm needs less computational cost and memory requirement compared with conventional KNN. The conducted experiments show that varying σ did not affect the proposed algorithm. Experiments results also prove that the given algorithm has faster learning rate than conventional KNN algorithm and better classification performance than other online classification methods. However, selecting proper values for other parameters is still a challenge.

Our future works deal with the following questions: can we use a more efficient function as the sigmoid function in the KNN neuron? Can we adapt other kernel functions to use in neuron KNN? How to improve the performance of AKNN on imbalanced data and noisy data?