Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA
Abstract
The combination of the famed kernel trick and affine projection algorithms (APAs) yields powerful nonlinear extensions, named collectively here, KAPA. This paper is a follow-up study of the recently introduced kernel least-mean-square algorithm (KLMS). KAPA inherits the simplicity and online nature of KLMS while reducing its gradient
noise, boosting performance. More interestingly, it provides a unifying model for several neural network techniques,
including kernel least-mean-square algorithms, kernel adaline, sliding-window kernel recursive-least squares (KRLS),
and regularization networks. Therefore, many insights can be gained into the basic relations among them and the
tradeoff between computation complexity and performance. Several simulations illustrate its wide applicability.
1. Introduction
The solid mathematical foundation, wide and successful applications are making kernel
methods very popular. By the famed kernel trick, many linear methods have been recast in high dimensional reproducing kernel Hilbert spaces (RKHS) to yield more powerful nonlinear extensions, including support vector machines [1], principal component
analysis [2],
recursive least squares [3], Hebbian algorithm [4], Adaline [5], and so forth.
More recently, a kernelized least-mean-square (KLMS)
algorithm was proposed in [6], which implicitly creates a growing radial basis
function network (RBF) with a learning strategy similar to resource-allocating
networks (RAN) proposed by Platt [7]. As an improvement, kernelized affine projection
algorithms (KAPAs) are presented for the first time in this paper by reformulating
the conventional affine projection algorithm (APA) [8] in general reproducing
kernel Hilbert spaces (RKHS). The new algorithms are online, simple, and
significantly reduce the gradient noise compared with the KLMS and thus improve
performance.
More interestingly, the KAPA reduces to the kernel
least-mean square (KLMS), sliding-window kernel recursive least squares
(SW-KRLS), kernel adaline, and regularization networks naturally in special
cases. Thus it provides a unifying model for these existing methods and helps
better understand the basic relations among them and the tradeoff between
complexity and performance. Moreover, it also advances our understanding on the
resource-allocating networks. Exploiting the underlying linear structure of
RKHS, a brief discussion on its well-posedness will be conducted.
The organization of the paper is as follows. In
Section 2, the affine projection algorithms are briefly reviewed. Next, in
Section 3, the kernel trick is applied to formulate the nonlinear affine
projection algorithms. Other related algorithms are reviewed as special cases
of the KAPA in Section 4. We detail the implementation of the KAPA in Section
5. Three experiments are studied in Section 6 to support our theory. Finally, Section 7 summarizes the conclusions and future lines of research.
The notation used throughout the paper is summarized
in Table 1.
2. A Review of the Affine Projection Algorithms
Let
be a zero-mean
scalar-valued random variable, and let
be a zero-mean
random variable
with a positive-definite covariance matrix
. The cross-covariance vector of
and
is denoted by
. The weight vector
that
solves
(1)is given by
[8].
Several methods that approximate
iteratively
also exist, for example, the common gradient method
(2)or the regularized Newton's recursion
(3)where
is a small
positive regularization factor and
is the step
size specified by the designer.
Stochastic-gradient algorithms replace the covariance
matrix and the cross-covariance vector by local approximations directly from
data at each iteration. There are several ways for obtaining such
approximations. The tradeoff is computation complexity, convergence
performance, and steady-state behavior [8].
Assume that we have access to observations of the
random variables
and
over time
(4)
The Least-mean-square (LMS) algorithm simply uses the
instantaneous values for approximations
and
. The corresponding steepest-descent recursion (2) and
Newton's recursion (3) become
(5)
The affine projection algorithm however employs better
approximations. Specifically,
and
are replaced by
the instantaneous approximations from the
most recent
regressors and observations. Denoting
(6)one has
(7)Therefore, (2) and (3)
become
(8)
(9)and (9), by the matrix inversion lemma, is equivalent to [8]
(10)It is noted that this
equivalence lets us deal with the matrix
instead of
and it plays a
very important role in the derivation of kernel extensions. We call recursion
(8) APA-1 and recursion (10) APA-2.
In some circumstances, a regularized solution is
needed instead of (1). The regularized LS problem is
(11)where
is the
regularization parameter (not the regularization factor
in Newton's
recursion). The gradient method is
(12)The Newton's recursion with
is
(13)If the approximations (7) are
used, we have
(14)
(15)which is, by the matrix
inversion lemma, equivalent to
(16)
For simplicity, recursions (14) and (16) are named
here APA-3 and APA-4, respectively.
3. The Kernel Affine Projection Algorithms
A kernel
[9] is a continuous,
symmetric, positive-definite function
.
is the input
domain, a compact subset of
. The commonly used kernels include the Gaussian
kernel (17) and the polynomial kernel (18):
(17)
(18)
The Mercer theorem [9, 10] states that any kernel
can be expanded
as follows:
(19)where
and
are the
eigenvalues and the eigenfunctions, respectively. The eigenvalues are
nonnegative.
Therefore, a mapping
can be
constructed as
(20)such that
(21)By construction, the
dimensionality of
is determined
by the number of strictly positive eigenvalues, which can be infinite in the
Gaussian kernel case.
We utilize this theorem to transform the data
into the
feature space
as
and interpret
(21) as the usual dot product. Denoting
, we formulate the affine projection algorithms on the
example sequence
and
to estimate the
weight vector
that
solves
(22)By straightforward manipulation,
(8) becomes
(23)and (10) becomes
(24)where
.
Accordingly, (14) becomes
(25)and (16) becomes
(26)For simplicity, we refer to the
recursions (23), (24), (25), and (26) as KAPA-1, KAPA-2, KAPA-3, and KAPA-4,
respectively.
3.1. Kernel Affine Projection Algorithm (KAPA-1)
It may be difficult to have direct access to the weights and the transformed data in feature space, so (23) needs to be modified. If we set the initial guess
, the iteration of (23) will be
(27)where
for simplicity.
Note that during the iteration, the weight vector in
the feature space assumes the following expansion:
(28)that is, the weight at time
is a linear
combination of the previous transformed input. This result may seem simply a
restatement of the representer theorem in [11]. However, it should be emphasized that this result
does not rely on any explicit minimal norm constraint as required for the
representer theorem. As pointed out in [12],
the gradient search in (28) has an inherent regularization mechanism which
guarantees the solution is in the data subspace under appropriate
initialization. In general, the initialization
can introduce
whatever apriori information is available, which can be any linear combination
of any transformed data in order to utilize the kernel trick.
By (28), the updating on the weight vector reduces to
the updating on the expansion coefficients
(29)Since
is the
prediction error of data
by the network
, the interpretation of (29) is straightforward:
allocate a new unit with coefficient
and update the
coefficients for the other
most recent
units by
for
.
The pseudocode for KAPA-1 is listed in Algorithm
1.
Algorithm 1: Kernel affine projection algorithm (KAPA-1).
3.2. Normalized KAPA (KAPA-2)
Similarly, the regularized Newton's recursion (24) can be factorized into the following steps:
(30)In practice, we do not have
access to the transformed weight
or any
transformed data, so the update has to be on the expansion coefficient
like in KAPA-1.
The whole recursion is similar to the KAPA-1 except that the error is
normalized by a
matrix
.
3.3. Leaky KAPA (KAPA-3)
The feature space may be infinite dimensional depending on the chosen kernel, which may cause the cost function (22) to be ill posed in the conventional empirical risk minimization (ERM) sense [13]. The common practice is to constrain the solution
norm:
(31)As we have already shown in
(25), the leaky KAPA is
(32)Again, the iteration will be on
the expansion coefficient
, which is similar to the KAPA-1:
(33)The only difference is that
KAPA-3 has a scaling factor
multiplying the
previous weight, which is usually less than 1, and it imposes a forgetting
mechanism so that the training data in the far past are scaled down
exponentially. Furthermore, since the network size is growing over training,
any transformed data can be pruned
from the expansion easily if its coefficient is smaller than some prespecified
threshold. For large data sets, the growing nature of this family of algorithms
poses a big problem for implementations, therefore, network size control is
very important. We will discuss this issue more in the sparsification section.
3.4. Leaky KAPA with Newton's Recursion (KAPA-4)
As before, the KAPA-4 (26) reduces to
(34)Among these four algorithms, the
first three require the error information to update the network which is
computationally expensive. Therefore, the different update rule in KAPA-4 has a
huge significance in terms of computation since it only needs a
matrix
inversion, which, by using the sliding-window trick, only requires
operations
[14].
We summarize the four KAPA update equations in Table 2 for convenience.
Table 2: Comparison of four
KAPA update rules.
4. A Taxonomy for Related Algorithms
4.1. Kernel Least-Mean-Square Algorithm (KAPA-1,
)
If
, KAPA-1 reduces to the following kernel
least-mean-square algorithm (KLMS) introduced in [6]:
(35)
It is not difficult to verify that the weight vector
assumes the following expansion:
(36)where
is the apriori
error.
It is seen that the KLMS allocates a new unit when a
new training data comes in with the input
as the center
and the prediction error as the coefficient (scaled by the step size). In other
words, once the unit is allocated, the coefficient is fixed. It mimics the
resource-allocating step in the RAN algorithm whereas it neglects the
adaptation step. In this sense, the KAPA algorithms, that allocate a new unit
for the present input and also adapt the other
most recent
allocated units, are closer to the original RAN.
The normalized version of the KLMS is as follows
(NKLMS):
(37)Notice that for translation
invariant kernels, that is,
, the KLMS is automatically normalized. Sometimes we
use KLMS-1 and KLMS-2 to distinguish the two.
4.2. Norma (KAPA-3,
)
Similarly, the
KAPA-3 (25) reduces to the Norma algorithm introduced by Kivinen in [15]:
(38)
4.3. Kernel Adaline (KAPA-1,
)
Assume that the
size of the training data is finite
. If we set
, then the update rule of the KAPA-1
becomes
(39)where the full data matrices are
(40)
It is easy to check that the weight vector also
assumes the following expansion:
(41)and the updating on the
expansion coefficients is
(42)
This is nothing but the kernel adaline introduced in
[5]. Notice the fact
that the kernel adaline is not an online method.
4.4. Recursively Adapted Radial Basis Function Networks (KAPA-3,
,
)
Assume the size of the training data is
as above. If we
set
and
, the update rule of KAPA-3 becomes
(43)which is the recursively adapted
RBF (RA-RBF) network introduced in [16]. This is a very intriguing algorithm using the
“global” error directly to compose the new network. By contrast, the KLMS-1
uses the apriori errors to compose the network.
4.5. Sliding-Window Kernel RLS (KAPA-4,
)
In KAPA-4, if we set
, we have
(44)which is the sliding-window
kernel RLS (SW-KRLS) introduced in [14]. The inverse operation of the sliding-window Gram
matrix can be simplified to
.
4.6. Regularization Networks (KAPA-4,
,
)
We assume there
are only
training data
and
. Equation (26) becomes directly
(45)which is the regularization
network (RegNet) [13].
We summarize all the related algorithms in Table 3
for convenience.
Table 3: List of related algorithms.
5. KAPA Implementation
In this section, we will discuss the implementation of the KAPA algorithms in detail.
5.1. Error Reusing
As we see in KAPA-1, KAPA-2, and KAPA-3, the most time-consuming part of the computation is to obtain the error information. For example, suppose
. We need to calculate
(
) to
compute
, which consists of
kernel
evaluations. As
increases, this
dominates the computation time. In this sense, the computation complexity of
the KAPA is
times of the
KLMS. However, after a careful manipulation, we can shrink the complexity gap
between KAPA and the KLMS.
Assume that we store all the
errors
for
from the
previous iteration. At the present iteration, we have
(46)Since
has not been
computed yet, we have to calculate
by
times kernel
evaluation anyway. Overall the computation complexity of the KAPA-1 is
, which is
more than the
KLMS.
5.2. Sliding-Window Gram Matrix Inversion
In KAPA-2 and KAPA-4, another computation difficulty is to invert a
matrix, which
normally requires
. However, in the KAPA, the data matrix
has a sliding
window structure, therefore, a trick can be used to speed up the computation.
The trick is based on the matrix inversion formula and was introduced in
[14]. We outline the
basic calculation steps here. Suppose the sliding matrices share the same
sub-matrix
:
(47)and we know from the previous
iteration
that
(48)First, we need to calculate the
inverse of
as
(49)Then, we can update the inverse
of the new Gram matrix as
(50)with
.
is the Schur
complement of
in
, which actually measures the distance of the new data
to the other
most recent
data in the feature space. The overall complexity is
.
5.3. Sparsification
A sparse model
is desired because it reduces the complexity in terms of computation and
memory, and it usually yields better generalization [3]. On the other hand, in the
context of adaptive filtering, training data may just be available
sequentially, that is, one at a time. As we see in the formulation of KAPA, the
network size increases linearly with the number of training data, which may
pose a big problem for the KAPA algorithms to be applied in online
applications. The sparse model idea is inspired by Vapnik's support vector
machines. It is also introduced in [7] with the novelty criterion and extensively studied in
[3] under approximate
linear dependency (ALD). There are many other ways to achieve sparseness that
require the creation of a basis dictionary and storage of the corresponding
coefficients. Suppose the present dictionary is
, where
is the
th
center and
is the
cardinality. When a new data pair
is presented, a
decision is made immediately whether
should be added
into the dictionary as a center.
The novelty criterion introduced by Platt is
relatively simple. First, it calculates the distance of
to the present
dictionary
. If it is smaller than some preset threshold, say
,
will not be
added into the dictionary. Otherwise, the method computes the prediction error
. Only if the
prediction error is larger than another preset threshold, say
,
will be
accepted as a new center.
The ALD test introduced in [3] is more computationally
involved. It tests the following cost
which indicates
the distance of the new input to the linear span of the present dictionary in
the feature space. It turns out that
is the Schur
complement of the Gram matrix of the present dictionary. As we saw in the
previous section, this result can be used to get the new Gram matrix inverse if
is accepted
into the dictionary. Therefore, this method is more suitable for the KAPA-2 and
KAPA-4 because of efficiency. This link is very interesting since it reveals
that the ALD test actually guarantees the invertibility of the new Gram matrix.
In the sparse model, if the new data is determined to
be “novel,” the
most recent
data points in the dictionary are used to form the data matrix
together with
the new data. Therefore, a new unit is allocated and the update is on the
most recent
units in the dictionary. If the new data is determined to be not “novel,” it
is simply discarded in this paper, but a different strategy can be employed to
utilize the information like in [3, 7].
The important consequences of the sparsification
procedure are as follows.
(1) If the input domain
is a compact
set, the cardinality of the dictionary is always finite and upper bounded. This
statement is not hard to prove using the finite covering theorem of the compact
set and the fact that elements in the dictionary are
-separable
[3]. Here is the brief
idea: suppose spheres with diameter
are used to
cover
and the optimal
covering number is
. Then, because any two centers in the dictionary can
not be in the same sphere, the total number of the centers will be no greater
than
regardless of
the distribution and temporal structure of
. Of course, this is a worst case upper bound. In the
case of finite training data, the network size will be finite anyway. This is
true in applications like channel equalization, where the training sequence is
part of each transmission frame. In a stationary environment, the network
converges quickly and the threshold on prediction errors plays its part to
constrain the network size. We will validate this claim in the simulation
section. In a nonstationary environment, more sophisticated pruning methods
should be used to constrain the network size. Simple strategies include pruning
the oldest unit in the dictionary [14], pruning randomly [17], and pruning the unit with the least coefficient or
similar [18, 19]. Another alternative
approach is to solve the problem in the primal space [20, 21] directly by using the low
rank approximation methods such as Nyström method [22], incomplete Cholesky
factorization [23],
and kernel principal component analysis [2]. It should be pointed out that the scalability issue
is at the core of the kernel methods and so all the kernel methods need to deal
with it in one way or the other. Indeed, the sequential nature of the KAPA
enables active learning [24, 25] on huge data sets which is impossible in batch mode
algorithms like regularization networks. The discussion on active learning with
the KAPA is out of the scope of this paper and will be part of the future work.
(2) Based on (1), we can prove that the solution norms
of KLMS-1 and KAPA-1 are upper bounded [12].
The significance of (1) is of practical interest
because it states that the system complexity is controlled by the novelty
criterion parameters, and designers can estimate a worst case upper bound. The
significance of (2) is of theoretical interest because it guarantees the
well-posedness of the algorithms. The well-posedness of the KAPA-3 and KAPA-4
is mostly ensured by the regularization term, see [13, 14] for details.
6. Simulations
6.1. Time Series Prediction
The first example is the short-term prediction of the Mackey-Glass (MG) chaotic time series [26, 27]. It is generated from the
following time delay ordinary differential equation:
(51)with
,
, and
. The time series is discretized at a sampling period
of 6 seconds. The time embedding is 7, that is,
are used as the
input to predict the present one
which is the
desired response here. A segment of 500 samples is used as the training data
and another 100 points as the test data (in the testing phase, the filter is
fixed). All the data is corrupted by Gaussian noise with zero mean and 0.001
variance.
We compare the prediction performance of KLMS, KAPA-1,
KAPA-2, KRLS, and a linear combiner trained with LMS. A Gaussian kernel with
kernel parameter
in (17) is
chosen for all the kernel-based algorithms. One hundred Monte Carlo simulations
are run with different realizations of noise. The results are summarized in
Table 4. Figure 1 is the learning curves for the LMS, KLMS-1, KAPA-1, KAPA-2 (
), and
KRLS, respectively. As expected, the KAPA outperforms the KLMS.
Table 4: Performance comparison in MG time series prediction.
Figure 1: The learning
curves of the LMS, KLMS, KAPA-1 (

),
KAPA-2 (

), SW-KRLS (

), and KRLS.
As we can see in Table 4, the performance of the KAPA-2 is substantially better than the KLMS. All the results in the tables are in the form of “average
standard
deviation.” Table 5 summarizes the computational complexity of these
algorithms. The KLMS and KAPA effectively reduce the computational complexity
and memory storage when compared with the KRLS. KAPA-3 and sliding-window KRLS
are also tested on this problem. It is observed that the performance of the
KAPA-3 is similar to KAPA-1 when the forgetting term is very close to 1 as
expected, and the results are severely biased when the forgetting term is
reduced further. The reason can be found in [12]. The performance of the sliding-window KRLS is
included in Figure 1 and Table 4 with
. It is observed that KAPA-4 (including the
sliding-window KRLS) does not perform well with small
(
).
Table 5: Complexity
comparison at iteration

.
Next, we test how the novelty criterion affects the
performance. A segment of 1000 samples is used as the training data and another
100 as the test data. All the data is corrupted by Gaussian noise with zero
mean and 0.001 variance. The thresholds in the novelty criterion are set as
and
. The learning curves are shown in Figure 2 and the
results are summarized in Table 6. It is seen that the complexity can be
reduced dramatically with the novelty criterion with slight performance
degeneration. Here, SKLMS and SKAPA denote the sparse KLMS and the sparse KAPA,
respectively.
Table 6: Performance comparison in MG time series prediction on novelty criterion.
Figure 2: The learning curves of the KLMS-1, KAPA-1 (

), and
KAPA-2 (

) with and without sparsification.
Several comments follow: although formally being
adaptive filters, these algorithms can be viewed as efficient alternatives to
batch mode RBF networks; therefore, it is practical to freeze their weights
during the test phase. Moreover, when compared with other nonlinear filters such as
RBF's, we divide the data in training and testing as normally done in neural
networks. Of course, it is also feasible to use the apriori prediction error as
a performance indicator like in conventional adaptive filtering literature.
6.2. Noise Cancellation
Another important problem in signal processing is noise cancellation in which an unknown interference has to be removed based on some reference measurement. The basic structure of a noise cancellation system is shown in Figure 3. The
primary signal is
and its noisy
measurement
acts as the
desired signal of the system.
is a white
noise process which is unknown, and
is its
reference measurement, that is, a distorted version of the noise process
through some distortion function, which is unknown in general. Here,
is the input of
the adaptive filter. The objective is to use
as the input to
the filter and to obtain, as the filter output, an estimate of the noise source
. Therefore, the noise can be subtracted from
to improve the
signal-noise ratio.
Figure 3: The basic structure of the noise cancellation system.
In this example, the noise source is assumed white,
uniformly distributed between
. The interference distortion function is assumed to
be
(52)
As we see, the distortion function has infinite
impulsive response, which, on the other hand, means it is impossible to recover
from a finite
time delay embedding of
. We rewrite the distortion function
as
(53)Therefore, the present value of
the noise source
depends not
only on the reference noise measure
, but also on the previous value
, which in turn depends on
, and so on. It means we need a very long time
embedding (infinite long theoretically) in order to recover
accurately.
However, the recursive nature of the adaptive system provides a feasible
alternative, that is, we feedback the output of the filter
, which is the estimate of
, to estimate the present one, pretending
is the true
value of
. Therefore, the input of the adaptive filter can be
in the form of
. It can be seen that the system is inherently
recurrent. In the linear case with a DARMA model, it is studied under output
error methods [28].
However, it will be nontrivial to generalize the results concerning convergence
and stability to nonlinear cases, and we will address it in the future work.
We assume the primary signal
during the
training phase. And the system simply tries to reconstruct the noise source
from the reference measure. We use a linear filter trained with the normalized
LMS, two nonlinear filters trained with the SKLMS-1, and the SKAPA-2 (
),
respectively. 2000 training samples are used and 400 Monte Carlo simulations
are run to get the ensemble learning curves as shown in Figure 4. The step size
and regularization parameter for the NLMS are 0.2
and 0.005. The step sizes for SKLMS-1 and SKAPA-2
are 0.5 and 0.2, respectively. The Gaussian kernel
is used for both KLMS and KAPA with kernel parameter
. The tolerance parameters for KLMS and KAPA are
and
, and the noise reduction factor (NR), which is
defined as
, is listed in Table 7. The performance improvement
of SKAPA-2 is obvious when compared with SKLMS-1.
Table 7: Noise reduction comparison in noise cancellation.
Figure 4: Ensemble learning
curves of NLMS, SKLMS-1, and SKAPA-2 (

) in noise cancellation.
6.3. Nonlinear Channel Equalization
In this example, we consider a nonlinear channel equalization problem, where the nonlinear channel is modeled by a nonlinear Wiener model. The nonlinear Wiener
model consists of a serial connection of a linear filter and a memoryless
nonlinearity (See Figure 5). This kind of model has been used to model digital
satellite communication channels [29] and digital magnetic recording channels [30].
Figure 5: Basic structure of the
nonlinear channel.
The problem setting is as follows: a binary signal
is fed into the
nonlinear channel. At the receiver end of the channel, the signal is further
corrupted by additive i.i.d. Gaussian noise and is then observed as
. The aim of channel equalization (CE) is to construct
an inverse filter that reproduces the original signal with as low an
error rate as possible. It is easy to formulate CE as a regression problem,
with input-output examples
, where
is the time
embedding length, and
is the
equalization time lag.
In this experiment, the nonlinear channel model is
defined by
,
, where
is the white
Gaussian noise with a variance of
. We compare the performance of the LMS1, the APA1,
the SKLMS1, the SKAPA1 (
), and
the SKAPA2 (
). The
Gaussian kernel with
is used in the
SKLMS and SKAPA selected with cross validation.
and
in the
equalizer. The noise variance is fixed here
. The learning curve is plotted in Figure 6. The MSE
is calculated between the continuous output (before taking the hard decision)
and the desired signal. For the SKLMS1, SKAPA1, and SKAPA2, the novelty
criterion is employed with
,
. The incremental growth of the network is also
plotted in Figure 7 over the training. It can be seen that at the beginning,
the network sizes increase quickly, but after convergence, the network sizes
increase slowly. And in fact, we can stop adding new centers after convergence
by cross-validation by noticing that the MSE does not change after convergence.
Figure 6: The learning curves of
the LMS1, APA1, SKLMS1, SKAPA1, and SKAPA2 in the nonlinear channel
equalization (

).
Figure 7: Network size over training in the nonlinear channel equalization.
Next, different noise variances are set. To make the
comparison fair, we tune the novelty criterion parameters to make the network
size almost the same (around 100) in each scenario by cross validation. For
each setting, 20 Monte Carlo simulations are run with different training data
and different testing data. The size of the training data is 1000 and the size
of the testing data is
. The filters are fixed during the testing phase. The
results are presented in Figure 8. The normalized signal-noise ratio (SNR) is
defined as
. It is clearly shown that the SKAPA-2 outperforms the
SKLMS-1 substantially in terms of the bit error rate (BER). The linear methods
never really work in this simulation regardless of the SNR. The improvement of
the SKAPA-1 on the SKLMS-1 is marginal but it exhibits a smaller variance. The
variability in the curves is mostly due to the variance from the stochastic
training.
Figure 8: Performance comparison with different SNR in the nonlinear channel
equalization.
In the last simulation, we test the tracking ability
of the proposed methods by introducing an abrupt change during training. The
training data is 1500. For the first 500 data, the channel model is kept the
same as before, but for the last 1000 data, the nonlinearity of the channel is
switched to
. The ensemble learning curves from 100 Monte Carlo
simulations are plotted in Figure 9, and the dynamic change of the network size
is plotted in Figure 10. It is seen that the SKAPA-2 outperforms other methods
with its fast tracking speed. It is also noted that the network sizes increase
right after the change to the channel model.
Figure 9: Ensemble learning curves in the nonlinear channel equalization with an abrupt change at iteration 5000.
Figure 10: Network size over training in the nonlinear channel equalization with an abrupt change at iteration 500.
7. Discussion and Conclusion
This paper proposes the KAPA algorithm family which is intrinsically a stochastic gradient methodology to solve the Least Squares problem in RKHS. It is a follow-up study of the recently introduced KLMS. Since the KAPA update equation can be written as inner products, KAPA can be efficiently computed in the input space. The good approximation ability of the KAPA stems from the fact that the transformed
data
includes
possibly infinite different features of the original data. In the framework of
stochastic projection, the space spanned by
is so large
that the projection error of the desired signal could be very small [31], as is well known from
Cover's theorem [32].
This capability includes modeling of nonlinear systems, which is the main
reason why the KAPA can achieve good performance in the Mackey-Glass system
prediction, adaptive noise cancellation, and nonlinear channel equalization.
Comparing with the KLMS, KRLS, and regularization
networks (batch mode training), KAPA gives yet another way of calculating the
coefficients for shallow RBF like neural networks. The performance of the KAPA
is somewhere between the KLMS and KRLS, which is specified by the window length
. Therefore, it not only provides a further
theoretical understanding of RBF like neural networks, but it also brings much
flexibility for application design with the constraints on performance and
computation resources.
Three examples are studied in the paper, namely, time
series prediction, nonlinear channel equalization, and nonlinear noise
cancellation. In all examples, the KAPA demonstrates superior performance when
compared with the KLMS, which is expected from the classic adaptive filtering
theory.
As pointed out, the study of the KLMS and KAPA has a
close relation with the resource-allocating networks, but in the framework of
RKHS, any Mercer kernel can be used instead of restricting the architecture to
the Gaussian kernel. An important avenue for further research is how to choose
the optimal kernel for a specific problem. A lot of work [33–35] has been done in the context of classical machine
learning, which is usually derived in a strict optimization manner. Notice that
with stochastic gradient methods, the solution obtained is not strictly the
optimal solution, therefore, further investigation is warranted. As we
mentioned before, how to control the network size is still a big issue, which
needs further study.
Acknowledgment
This work was partially supported by NSF, Grant no. ECS-0601271.
References
- V. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, NY, USA, 1995.
- B. Schölkopf, A. J. Smola, and K.-R. Müller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Computation, vol. 10, no. 5, 1299 pages, 1998.
- Y. Engel, S. Mannor, and R. Meir, “The kernel recursive least-squares algorithm,” IEEE Transactions on Signal Processing, vol. 52, no. 8, 2275 pages, 2004.
- K. I. Kim, M. O. Franz, and B. Schölkopf, “Iterative kernel principal component analysis for image modeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 9, 1351 pages, 2005.
- T.-T. Frieb and R. F. Harrison, “A kernel-based adaline,” in Proceedings of the 7th European Symposium on Artificial Neural Networks (ESANN '99), p. 245, Bruges, Belgium, April 1999.
- P. P. Pokharel, W. Liu, and J. C. Príncipe, “Kernel LMS,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '07), vol. 3, p. 1421, Honolulu, Hawaii, USA, April 2007.
- J. Platt, “A resource-allocating network for function interpolation,” Neural Computation, vol. 3, no. 2, 213 pages, 1991.
- A. Sayed, Fundamentals of Adaptive Filtering, John Wiley & Sons, New York, NY, USA, 2003.
- N. Aronszajn, “Theory of reproducing kernels,” Transactions of the American Mathematical Society, vol. 68, no. 3, 337 pages, 1950.
- C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, 121 pages, 1998.
- B. Schölkopf, R. Herbrich, and A. J. Smola, “A generalized representer theorem,” in Proceedings of the 14th Annual Conference on Computational Learning Theory and 5th European Conference on Computational Learning Theory, p. 416, Amsterdam, The Netherlands, July 2001.
- W. Liu, P. P. Pokharel, and J. C. Príncipe, “The kernel least mean square algorithm,” IEEE Transactions on Signal Processing, vol. 56, no. 2, 543 pages, 2008.
- F. Girosi, M. Jones, and T. Poggio, “Regularization theory and neural networks architectures,” Neural Computation, vol. 7, no. 2, 219 pages, 1995.
- S. Van Vaerenbergh, J. Vía, and I. Santamaría, “A sliding-window kernel RLS algorithm and its application to nonlinear channel identification,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), vol. 5, p. 789, Toulouse, France, May 2006.
- J. Kivinen, A. J. Smola, and R. C. Williamson, “Online learning with kernels,” IEEE Transactions on Signal Processing, vol. 52, no. 8, 2165 pages, 2004.
- W. Liu, P. P. Pokharel, and J. C. Príncipe, “Recursively adapted radial basis function networks and its relationship to resource allocating networks and online kernel learning,” in Proceedings of IEEE International Workshop on Machine Learning for Signal Processing (MLSP '07), p. 300, Thessaloniki, Greece, August 2007.
- G. Cavallanti, N. Cesa-Bianchi, and C. Gentile, “Tracking the best hyperplane with a simple budget Perceptron,” Machine Learning, vol. 69, no. 2-3, 143 pages, 2007.
- S. Yonghong, P. Saratchandran, and N. Sundararajan, “A direct link minimal resource allocation network for adaptive noise cancellation,” Neural Processing Letters, vol. 12, no. 3, 255 pages, 2000.
- O. Dekel, S. Shalev-Shwartz, and Y. Singer, “The forgetron: a kernel-based perceptron on a fixed budget,” in Advances in Neural Information Processing Systems 18, p. 1342, MIT Press, Cambridge, Mass, USA, 2006.
- A. Navia-Vázquez, F. Pérez-Cruz, A. Artés-Rodríguez, and A. R. Figueiras-Vidál, “Weighted least squares training of support vector classifiers leading to compact and adaptive schemes,” IEEE Transactions on Neural Networks, vol. 12, no. 5, 1047 pages, 2001.
- J. A. K. Suykens, T. V. Gestel, J. D. Brabanter, B. D. Moor, and J. Vandewalle, Least Squares Support Vector Machines, World Scientific, Singapore, 2002.
- C. K. I. Williams and M. Seeger, “Using the Nyström method to speed up kernel machines,” in Advances in Neural Information Processing Systems 13, T. K. Leen, T. G. Dietterich, and V. Tresp, Eds., p. 682, chapter 13, MIT Press, Cambridge, Mass, USA, 2001.
- S. Fine and K. Scheinberg, “Efficient svm training using low-rank kernel representations,” Journal of Machine Learning Research, vol. 2, 242 pages, 2001.
- A. Bordes, S. Ertekin, J. Weston, and L. Bottou, “Fast kernel classifiers with online and active learning,” Journal of Machine Learning Research, vol. 6, 1579 pages, 2005.
- K. Fukumizu, “Active learning in multilayer perceptrons,” in Advances in Neural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, Eds., p. 295, MIT Press, Cambridge, Mass, USA, 1996.
- L. Glass and M. Mackey, From Clocks to Chaos: The Rhythms of Life, Princeton University Press, Princeton, NJ, USA, 1988.
- S. Mukherjee, E. Osuna, and F. Girosi, “Nonlinear prediction of chaotic time series using support vector machines,” in Proceedings of the 7th IEEE Workshop on Neural Networks for Signal Processing, J. C. Príncipe, L. Giles, N. Morgan, and E. Wilson, Eds., p. 511, IEEE Press, Amelia Island, Fla, USA, September 1997.
- G. C. Goodwin and K. S. Sin, Adaptive Filtering Prediction and Control, Prentice Hall, Upper Saddle River, NJ, USA, 1984.
- G. Kechriotis, E. Zervas, and E. S. Manolakos, “Using recurrent neural networks for adaptive communication channel equalization,” IEEE Transactions on Neural Networks, vol. 5, no. 2, 267 pages, 1994.
- N. P. Sands and J. M. Cioffi, “Nonlinear channel models for digital magnetic recording,” IEEE Transactions on Magnetics, vol. 29, no. 6, part 2, 3996 pages, 1993.
- E. Parzen, “Statistical methods on time series by hilbert space methods,” Applied Mathematics and Statistics Laboratory, Stanford University, Stanford, Calif, USA, 1959.
- S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, Upper Saddle River, NJ, USA, 2nd edition, 1998.
- C. A. Micchelli and M. Pontil, “Learning the kernel function via regularization,” Journal of Machine Learning Research, vol. 6, 1099 pages, 2005.
- A. Argyriou, C. A. Micchelli, and M. Pontil, “Learning convex combinations of continuously parameterized basic kernels,” in Proceedings of the18th Annual Conference on Computational Learning Theory (COLT '05), p. 338, Bertinoro, Italy, June 2005.
- O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, “Choosing multiple parameters for support vector machines,” Machine Learning, vol. 46, no. 1–3, 131 pages, 2002.