Department of Electronic Engineering, Faculty of Science, Technology and Engineering, La Trobe University, Bundoora, VIC 3086, Australia
Abstract
The M-estimate of a linear observation model has many important engineering applications
such as identifying a linear system under non-Gaussian noise. Batch algorithms based on the
EM algorithm or the iterative reweighted least squares algorithm have been widely adopted.
In recent years, several sequential algorithms have been proposed. In this paper, we propose a
family of sequential algorithms based on the Bayesian formulation of the problem. The basic
idea is that in each step we use a Gaussian approximation for the posterior and a quadratic
approximation for the log-likelihood function. The maximum a posteriori (MAP) estimation
leads naturally to algorithms similar to the recursive least squares (RLSs) algorithm. We discuss
the quality of the estimate, issues related to the initialization and estimation of parameters, and
robustness of the proposed algorithm. We then develop LMS-type algorithms by replacing the
covariance matrix with a scaled identity matrix under the constraint that the determinant of the
covariance matrix is preserved. We have proposed two LMS-type algorithms which are effective
and low-cost replacement of RLS-type of algorithms working under Gaussian and impulsive
noise, respectively. Numerical examples show that the performance of the proposed algorithms
are very competitive to that of other recently published algorithms.
1. Introduction
We consider a robust estimation problem for a linear observation model:
(1)
where
is the impulse response to be estimated,
is the known training data and the noise
follows an independent and identical distribution (i.i.d.). Given a set of training data
, the maximum likelihood estimation (MLE) of
leads to the following problem:
(2)
where
is the negative log likelihood function. The M-estimate of a linear model can also be expressed as the above MLE problem
when those well-developed penalty functions [1, 2] are regarded as generalized negative log-likelihood function. This is a robust regression problem. The solution not only is an essential data analysis tool [3, 4], but also has many practical engineering applications such as in system identification, where the noise model is heavy tailed [5].
The batch algorithms and the sequential algorithms are two basic approaches to solve the problem of (2). The batch algorithms include the EM algorithm for a family of heavy-tailed distributions [3, 4] and iterative reweighted least squares (IRLSs) algorithm for the M-estimate [2, 6]. In signal processing applications, a major disadvantage of a batch algorithm is that when a new set of training data is available the same algorithm must be run again on the whole data. A sequential algorithm, in contrast to a batch algorithm, updates the estimate as a new set
of training data is received. In recent years, several sequential algorithms [7–9] have been proposed for the M-estimate of a linear model. These algorithms are based on factorizing the IRLS solution [7] and factorizing the so-called M-estimate normal equation [8, 9]. These sequential algorithms can be regarded as a generalization of recursive least squares (RLSs) algorithm [10]. Other published works include robust LMS-type algorithms [11–13].
Bayesian learning has been a powerful tool for developing sequential learning algorithms. The problem is formulated as a maximum a posteriori (MAP) estimate problem. The basic idea is to break the sequential learning problem into two major steps [14]. In the update step, an approximate of the posterior at time
is used to obtain the new posterior at time
. In the approximation step, this new posterior is approximated by using a particular parametric distribution family. There are many well-documented techniques such as Laplace method [15] and Fisher scoring [16]. The variational Bayesian method has also been studied [17, 18].
In a recent paper [19], we address this problem from a Bayesian perspective and develop RLS-type and LMS-type of sequential learning algorithms. The development is based on using a Laplace approximation of the posterior and solving the maximum a posteriori (MAP) estimate problem by using the MM
algorithm [20]. The development of the algorithm is quite complicated. The RLS-type of algorithm is
further simplified as an LMS-type algorithm by treating the covariance matrix as being fixed. This has significantly reduced the computational complexity at the cost of degraded performance.
There are two major motivations of this work which is clearly an extension of our previous work [19]. Our first motivation is to follow the same problem formulation as in [19] and to explore an alternative and simpler approach to develop sequential M-estimate algorithms. More specifically, at each iteration, we use Gaussian approximation for the likelihood and the prior. As such, we can determine a
close form solution of an MAP estimate sequentially when a set of new training data is available. This MAP estimate is in the similar form as that of an RLS algorithm. Our second motivation is to extend the RLS-type algorithm to the LMS-type algorithm with an adaptive step size. It is well established that a learning algorithm with adaptive step size usually outperforms those with fixed step size in terms of faster initial learning rate and lower steady state [21]. Therefore,
instead of treating the covariance as being fixed, as in our previous work, we propose to use a scaled identity matrix to approximate the covariance matrix. The approximation is subject to preserving the determinant of the covariance matrix. As such, instead of updating the covariance, the scaling factor is updated. The update of the impulse response and the scaling factor thus
constitute an LMS-type algorithm with an adaptive step size. A major contribution of this work is thus the development of new sequential and adaptive learning algorithms. Another major contribution is that performance of proposed LMS-type of algorithms is very close to that of the RLS-type counterpart.
Since this work is an extension of our previous work in which a survey of related works and Bayesian sequential learning have already been briefly discussed, in this paper, for brevity purpose, we have omitted the presentation of an extensive literature survey. Interested readers can refer to [19] and references therein for more information. The rest of this paper is organized as follows. In Section 2, we present the development of the proposed algorithm including a suboptimal solution. We show that the proposed algorithm consists of an approximation step and a minimization step which lead to the update of the covariance matrix and impulse response, respectively. We also discuss the quality of the estimate, issues related to the initialization and estimation of parameters, and the relationship of the proposed algorithms with those of our previous work. In Section 3, we first develop the general LMS-type of algorithm. We then present three specific algorithms, discuss their stability conditions and parameter initiation. In Section 4, we present three numerical examples. The first one evaluates the performance of the proposed RLS-type of algorithms, while the second and the third evaluate the performance of the proposed LMS-type of algorithms under Gaussian and impulsive noise conditions, respectively. A summary of this paper is presented in Section 5.
2. Development of the Algorithm
2.1. Problem Formulation
From the Bayesian perspective, after receiving
sets of training data
the log posterior for the linear observation model (1) is given by
(3)
where
is the prior before receiving any training data and
represents the model assumption. Throughout this paper, we use “c” to represent a constant. The MAP estimate of
is given by
(4)
Since the original M-estimation problem (2) can be regarded as a maximum likelihood estimation problem, in order to apply the above Bayesian approach, in this paper we attempt to solve the following problem:
(5)
This is essentially a constrained MLE problem:
(6)
Using the Lagrange multiplier method, the constrained MLE problem can be recasted as (5), where
is the Lagrange multiplier and is related to the constant
. We can see that both
and
can be regarded as regularization parameters which are used to control the model complexity. Bayesian [22] and non-Bayesian [23] approaches have been developed to determine regularization parameters.
We can see that the constrained MLE problem is equivalent to the MAP problem when we set
and
. This is equivalent to regarding the penalty function as the negative log likelihood and setting a zero mean Gaussian prior for
with covariance matrix
where
is an identity matrix. Therefore, in this paper we develop a sequential M-estimation algorithm by solving an MAP problem which is equivalent to a constrained MLE problem.
Since we frequently use the three variables
,
and
, we define them as follows:
,
and
, where
and
are the estimates of
at time
and
, respectively. We can see that
is the additive noise at time
, and
and
are the modelling errors due to using
and
as the impulse response at time
, respectively.
2.2. The Proposed RLS-Type Algorithms
To develop a sequential algorithm, we rewrite (3) as follows:
(7)
where the term
is the log posterior at time (
) and is also the log prior at time
. The term
is the log-likelihood function. The basic idea of the proposed sequential algorithm is that an approximated log posterior is formed by replacing the log prior
with its quadratic approximation. The negative of the approximated log posterior is then minimized to obtain a new estimate.
To illustrate the idea, we start our development from the beginning stage of the learning process. Since the exact prior distribution for
is usually unknown, we use a Gaussian distribution with zero mean
and covariance
as an approximation. The negative log prior
is approximated by 
(8)
When the first set of training data
is received, the negative log likelihood is
and the negative log posterior with the approximated prior, denoted by
, can be written as
(9)
This is the approximation step. In the minimization step, we determine the minimizer of
, denoted by
, by solving the equation
.
We then determine a quadratic approximation of
around
through the Taylor-series expansion:
(10)
where
is a constant,
is the Hessian evaluated at
and the linear term
is zero since
. Ignoring higher-order terms, we have the quadratic approximation for
as follows:
(11)
This is equivalent to using a Gaussian distribution to approximate the posterior distribution
with mean
and covariance
. In Bayesian learning, this is well-known technique called Laplace approximation [15]. In optimization theory [24], a local quadratic approximation of the objective function is frequently used.
When we receive the second set of training data, we form the negative log posterior, denoted
, by replacing
with
as follows:
(12)
The minimization step results in an optimal estimate
.
Continuing this process and following the same procedure, at time
, we use a quadratic approximation for
and form an approximation of the negative log posterior
as
(13)
where
is optimal estimate at time
and is the minimizer of
. The MAP estimate at time
, denoted by
, satisfies the following equation:
(14)
where
and
. Note that,
in (13) is replaced by
in (14) because
is replaced by
. From (14), it is easy to show that
(15)
Since
depends on
, we need to determine
. Left-multiplying (15) by
, then using the definition of
, we can show that
(16)
where
. Once we have determined
from (16), we can calculate
and substitute it into (15). We show in Appendix that the solution of (16) has the following properties: when
,
, when
,
and
.
Next, we determine a quadratic approximation for
around
. This is equivalent to approximating the posterior
by a Gaussian distribution with mean
and the covariance matrix 
(17)
where
. Using a matrix inverse formula, we have the update of the covariance matrix for
as follows:
(18)
If
, then we have
.
If there is no closed form solution for (16), then we must use a numerical algorithm [25] such as Newton's method or a fixed-point iteration algorithm to find a solution. This would add a significant computational cost to proposed algorithm. An alternative way is to seek a closed form solution by
using a quadratic approximation of the penalty function
as follows:
(19)
As such, the cost function
is approximated by
(20)
In Appendix , we show that the optimal estimate and the update of the covariance matrix are given by
(21)
(22)
respectively. Comparing (15) with (21), we can see that using the quadratic approximation for
results in an approximation of
by
. Comparing (18) with (22), we can see that the only change due to the approximation is replacing
by
.
In summary, the proposed sequential algorithm for a particular penalty function can be developed as follows. Suppose at time
, we have
,
and the training data. We have two approaches here. If we can solve (16) for
, then we can calculate
using (15) and update
using (18). On the other hand, if there is no close form solution for
or the solution is very complicated, then we can use (21) and (22).
2.3. Specific Algorithms
In this section, we present three examples of the proposed
algorithm using three commonly used penalty functions. These penalty functions and their first and second derivatives are listed in Table 1. These functions
are shown in Figure 1. We also discuss the robustness of these algorithms. To simplify discussion, we use (21) and (22) for the algorithm development.
Table 1: A list of some commonly used penalty functions and their first and second derivatives, denoted by

,

and

, respectively.
Figure 1: The three penalty functions and their first and second derivatives. We set

and

when plotting these functions.
2.3.1. The
Penalty Function
We can easily see that by substituting
and
into (21) and (22), we have an RLS-type of algorithm [19]:
(23)
(24)
When
, this reduced to a recursive least squares algorithm[27]. One can easily see that the update of the impulse response is proportional to
. As such, it is not robust against impulsive noise which leads to a large value of
and thus a large unnecessary adjustment.
We note that we have used an approximate approach to derive (23) and (24). This is only used for the simplification of the presentation. In fact, for an
penalty function (23) and (24) can be directly derive from (15) and (18), respectively. The results are exactly the same as (23) and (24).
2.3.2. Huber's Penalty Function
By substituting the respective terms of
and
into (21) and (22), we have the following:
(25)
(26)
where
. Comparing (25) with (23), we can see that when
they are the same. However, when
indicating a possible case of outlier, (25) only uses the sign information to avoid making large misadjustment. For the update of the covariance matrix, when
, it is the same as (24). However, when
, no update is performed.
2.3.3. The Fair Penalty Function
We note that for the Fair penalty function, we have
and
. Substituting the respective values of
and
into (21) and (22), we have the following two update equations:
(27)
where
(28)
It is easy to show that for the Fair penalty function, we have
(29)
(30)
Therefore, the value of
is increasing in
and is bounded by
. As a result, the learning algorithm avoids making large misadjustment when
is large. In addition, the update for the covariance is controlled by the term
which is increasing in
. Thus the amount of adjustment decreases as
increases.
2.4. Discussions
2.4.1. Properties of the Estimate
Since in each step a Gaussian approximation is used for the posterior, it is an essential requirement that
must be positive definite. We show that this requirement is indeed satisfied. Referring to (17) and using the fact that
is nonnegative for the penalty functions considered [see Table 1] and that
is positive definite, we can see that the inverse of the covariance matrix
is positive definite. Using mathematical induction, it is easy to prove that
is positive definite.
In the same way, we can prove that the Hessian of the objective function given by
(31)
is also positive definite. Thus the objective function is strictly convex and the solution
is a global minimum.
Another interesting question is: does the estimate improve due to the new data
? To answer this question, we can study the determinant of the precision matrix which is defined as
. The basic idea is that for a univariate Gaussian, the precision is the inverse of the variance. A smaller variance is equivalent to a larger precision which implies a better estimate. From (17), we can write
(32)
where we have used the substitution
. In deriving the above results, we have used a matrix identity:
. Since
and
[see Table 1], we have
. It means that the precision of the current estimate due to the new training data is better than or at least as good as that of the previous estimate. We note that when we use the update (18) for the covariance matrix, the above discussion is still valid.
2.4.2. Parameter Initialization and Estimation
The proposed algorithm starts with a Gaussian approximation of the prior. We can simply set the prior mean as zero
and set the prior covariance as
, where
is an identity matrix and
is set to a small value to reflect the uncertainty about the true prior distribution. In our simulations, we set
. For the robust penalty functions listed in Table 1,
is a scaling parameter. We propose a simple online algorithm to estimate
as follows:
(33)
where
in our simulations. The function
takes the smaller value of the two inputs as the output. It makes the estimate of
robust to outliers.
It should be noted that for a 0.95 asymptotic efficiency on the standard normal distribution, the optimal value for
can be found in [2]. In addition, for Huber's penalty function, the additional parameter
is set to
for a 0.95 asymptotic efficiency on the normal distribution [2].
2.4.3. Connection with the One-Step MM Algorithm [19]
Since the RLS-type of algorithm [see (21) and (22)] is derived from the same problem formulation as that in our previous work [19] and is based on different approximations, it is interesting to compare the results. For easy
reference, we recall that in [19] we defined
where
. It is easy to show that
(34)
(35)
For easy reference, we reproduce (40) and (44) in [19] as follows:
(36)
(37)
where
,
and
. Substituting (34) into (36), we have the RLS-type algorithm which is the one-step MM algorithm in terms of
as the following:
(38)
(39)
We can easily see that (39) is exactly the same as (22). To compare (38) with (21), we rewrite (21) as follows:
(40)
It is clear that (40) has an extra term
compared to (38). The value of this term depends on the penalty function. For the
penalty function, this term equals to one.
2.4.4. Connections with Other RLS-Type Algorithms
We briefly comment on the connections of the proposed algorithm with that based on the
framework (see [26, Problem 2]) and the classical RLS algorithm with a forgetting factor [10]. For easy reference, the
update equations for these algorithms are listed in Table 2. Comparing these algorithms, we can see that a major difference is in the way
is updated. The robustness of the proposed algorithm is provided by the scaling factor
which controls the “amount” of update. Please refer to Figure 1 for a graphical representation of this function. For the
-based algorithm, an adaptively calculated quantity
(see [26, equation (9)]) is subtracted from the update. This is another way of controlling the “amount” update. For the RLS algorithm, the forgetting factor plays the role of exponential-weighted sum of squared errors. The update is not controlled based on the current modelling error. It is now clear that the term
and the term
play a very different role in their respective algorithms.
Table 2: The update equations of three RLS-type algorithms.
It should be noted that by using the Bayesian approach, it is quite easy to introduce the forgetting factor into the proposed
algorithm. Using the forgetting factor, the tracking performance of the proposed algorithm can be controlled. Since the development has been reported in our previous work [19], we do not discuss it in detail in this paper. A further interesting point is the interpretation of the matrix
. For the
penalty function,
can be called the covariance matrix. But for the Huber and fair penalty function, its interpretation is less clear. However, when we use a Gaussian distribution to approximate the posterior, we can still regard it as a covariance matrix of the Gaussian.
3. Extension to LMS-Type of Algorithms
3.1. General Algorithm
For the RLS-type algorithms, a major contribution to the computational cost is the update of the covariance matrix. To reduce the cost, a key idea is to approximate the covariance matrix
in each iteration by
where
is a positive scalar and
is an identity matrix of suitable dimension. In this paper, we propose an approximation under the constraint of preserving the determinant, that is,
. Since the determinant of the covariance matrix is an indication of the precision of the estimate, preserving the determinant thus permits passing on information about the quality of the estimate at time
to the next iteration. As such, we have
where
is the length of the impulse response. The task of updating
becomes updating
.
From (17) and using a matrix identity
, we can see that
(41)
[Here we assume that the size of the matrix
and the sizes of the two vectors
and
are properly defined]. Suppose, at time
, we have the approximation
. Substituting this approximation into the left-hand side of (41), we have
(42)
Substituting
into (42), we have the following:
(43)
Using a further approximation
to simply (43), we derive the update rule for
as follows:
(44)
Replacing
in (21) by
, we have the update of the estimate
(45)
Equations (44) and (45) can be regarded as the LMS-type of algorithm with an adaptive step size.
In [28], a stability condition for a class of LMS-type of algorithm is established as follows. The system is stable when
(
) is satisfied. We will use this condition to discuss the stability of the proposed algorithms in Section 3.2.
We point out that in developing the above update scheme for
, we have assumed that
is fixed. As such, the update rule cannot cope with a sudden change of
since
is increasing with
. This is inherent problem with the problem formulation. A systematic way to deal with it is to reformulate the problem to allow a time varying
by using a state space model. Another way is to detect the change of
and reset
to its default value accordingly.
3.2. Specific Algorithms
Specific algorithms for the three penalty functions can be developed by substituting
and
into (44) and (45). We note that the
penalty function can be regarded a special case of the penalty functions used in the M-estimate. The discussion of
robustness is very similar to that presented in Section 2.3 and is omitted. Details of the algorithms are described below.
3.2.1. The
Penalty Function
Substituting
and
into (45), we have
(46)
where
. From (44), we have
(47)
which can be rewritten as follows:
(48)
The proposed algorithm is thus given by (46) and (48). A very attractive property of this algorithm is that it has no parameters. We only need to set the initial value of
which can be set to zero (i.e.,
) reflecting our assumption that the prior distribution of
is flat.
The stability of this algorithm can be established by noting that
(49)
Since
when
, the stability condition is satisfied.
3.2.2. Huber's Penalty Function
In a similar way, we obtain the update for
and
as follows:
(50)
(51)
where
. The stability of the algorithm can be established by noting that when
, we have
(52)
which is the same as the
case. One the other hand, when
, we can easily show that
. As such, from (50) we have for 
(53)
Since
, we have
. Thus the stability condition is also satisfied.
3.2.3. The Fair Penalty Function
For the Fair penalty function, we define
. We have
and
. Using (45), we can write
(54)
where
. The update for the precision is given by
(55)
A potential problem is that the algorithm may be unstable in that the stability condition
may not be satisfied. This is because
(56)
where
. We can easily see that when
, we have
which leads to an unstable system.
To solve the potential instability problem, we propose to replace
in (54) by
which is defined as
(57)
where
. We note that
can be regarded as a special case of
when
. When
we can show that
. As a result, the system is stable. On the other hand, when
(implying
), we can show that
which also leads to a stable system.
3.3. Initialization and Estimation of Parameters
In actual implementation, we can set
which corresponds to setting
. In the Bayesian perspective, this sets a uniform prior for
, which represents the uncertainty about
before receiving any training data. To enhance the learning speed of this algorithm, we shrink the value of
in the first
iterations, that is,
where
. An intuitive justification is that
is an approximation of the precision of the estimate. In the
penalty function case,
is scaled by the unknown but assumed constant noise variance. Due to the nature of the approximation that ignores the higher order terms, the precision is overly estimated. A natural idea is to scale the estimated precision
. In simulations, we find that
and
lead to improved learning speed.
For the Huber and the fair penalty functions, it is necessary to estimate the scaling parameter
. We use a simple online algorithm to estimate
as follows:
(58)
where
in our simulations. In addition, for Huber's penalty function, the additional parameter
is set to
for a 0.95 asymptotic efficiency on the normal
distribution [2].
4. Numerical Examples
4.1. General Simulation Setup
To use the proposed algorithms to identify the linear observation model of (1), at the nth iteration we generate a zero mean
Gaussian random vector
of size (
) as the input vector. The variance of this random vector is 1. We then generate the noise and calculate the output of the system
. The performance of an algorithm is measured by
which is a function of
and is called the learning curve. Each learning curve is the result of averaging 50-run of the program using the same
additive noise. The purpose is to average out possible effect of the random input vector
. The result is then plotted in the log scale, that is,
where
is the averaged learning curve.
4.2. Performance of the Proposed RLS Algorithms
We set up the following simulation experiments. The impulse response to be identified is given by
. In the nth iteration, a random input signal vector
is generated as
and
is calculated using (1). The noise
is generated from a mixture of two zero mean Gaussian distributions which is simulated in Matlab by: 
. The threshold
controls the percentage of impulsive noise. In our experiments, we set
which correspond to about 1.2% of impulsive noise. A typical case for the noise used in our simulation is shown in Figure 2
Figure 2: Noise signal used in simulations.
Since the proposed algorithms using Huber and fair penalty functions are similar to the RLS algorithm, we compare their learning performance with that of the RLS and a recently published RLM algorithm [8] using suggested
values of parameters. Simulation results are shown in Figure 3. We observe from simulation results that the learning curves of proposed algorithms are very close to that of the RLM algorithm and are significantly better than that of
the RLS algorithm which is not robust to non-Gaussian noise. The performance of the proposed algorithm in this paper is also very closed to that of our previous work [19] and the comparison results are not presented for brevity.
Figure 3: A comparison of learning curves for different RLS-type algorithms.
4.3. Performance of Proposed LMS Type of Algorithms
We first compare the performance of our proposed LMS-type of algorithms using the fair and Huber penalty functions to a recently
published robust LMS algorithm (called the CAF algorithm in this paper) using the suggested settings of parameters [13]. The CAF algorithm adaptively combines the NLMS and the signed NLMS algorithms. As a bench mark, we also include simulation results using the RLM algorithm which is computationally more demanding than any LMS type of algorithms. The noise used is similar to that described in Section 4.2. We have tested these algorithms with three different length of impulse responses
. In each simulation, the impulse response is generated as a zero-mean Gaussian random (
) vector with standard deviation of 1. Simulation results are shown in Figure 4.
Figure 4: A comparison of the learning performance of different algorithms in terms of the transient response (right panel of the figure) and the steady state (left panel of the figure). Subfigures presented from top to bottom are results of testing different length of impulse response

. Legends for all subfigures are the same and are included only in the top-right sub-figure.
From this figure, we can see that the performance of the two proposed algorithms is consistently better than that of the CAF
algorithm. The performance of the proposed algorithm with the fair penalty function is also better than that with the Huber penalty function. When the length of the impulse response is moderate, the performance of the proposed algorithm with the fair penalty function is very close to that of the RLM algorithm. The latter has a notable faster learning rate than the former when the length is 512. Therefore, the proposed algorithm with the fair penalty
function can be a low computational-cost replacement of the RLM algorithm for identifying an unknown linear system with moderate length.
We now compare the performance of the proposed LMS-type algorithm using the
penalty function with a recently published NLMS algorithm with adaptive parameter estimation [21]. This algorithm (see [21, equation (10)]) is called the VSS-NLMS algorithm in this paper. The VSS-NLMS algorithm is chosen because its performance has been compared to many other LMS-type of algorithms with variable step sizes. We tune the parameter of the VSS-NLMS algorithm such that it reach the lowest possible steady state in each case. As a bench mark, we also include simulation results using the RLS algorithm. We have tested these algorithms with three different length of impulse responses
. In each simulation, the impulse response is generated as a zero mean Gaussian random (
) vector with standard deviation of 1. We have also tested settings with three different noise variances
and 1. We have obtained similar results for all three cases. In Figure 5, we present the steady state and the transient responses for these algorithms under the condition
. We can see that the performance of the proposed algorithm is very close to that of the RLS
algorithm for the two cases
and
In fact, these two algorithms converge to almost the same steady state and the learning rate of the RLS algorithm is
slightly faster. For the case of
the RLS algorithm, being a lot more computational demanding, has a faster learning rate in the transient response
than the proposed algorithm does. Comparing with the VSS-NLMS algorithm, the performance of the proposed algorithm is consistently better. Therefore, the proposed algorithm can be a low computational-cost replacement for the RLS algorithm for learning an unknown linear system of moderate length.
Figure 5: A comparison of the learning performance of different algorithms in terms of the transient response (right panel of the
figure) and the steady state (left panel of the figure). Subfigures presented from top to bottom are results of testing different length of impulse response

. Legends for all subfigures are the same and are included only in the top-right subfigure. We note that for the two cases

and 100, the proposed algorithm converges to the almost the same level of steady state as that of the RLS algorithm.
5. Conclusion
In this paper, we develop a general sequential algorithm for the M-estimate of a linear observation model. Our development is based on formulating the problem from a Bayesian perspective and using a
Gaussian approximation for the posterior and likelihood function in each learning step. The sequential algorithm is then developed by determining a maximum a posteriori (MAP) estimate when a new set of training data is received. The Gaussian approximation leads naturally to a quadratic objective function and the MAP estimate is an RLS-type algorithm. We have discussed the quality of the estimate, issues related to the initialization and estimation of
parameters, and the relationship of the proposed algorithm with those of previous work. Motivated by reducing computational cost of the RLS-type algorithm, we develop a family of LMS-type algorithms by replacing the covariance matrix with a scaled identity matrix. Instead of updating the covariance matrix, we update the scalar which is set to preserve the determinant of the covariance matrix. Simulation results show that the learning performance of the proposed algorithms is competitive to that of some recently published algorithms. In particular, the performance of proposed LMS-type algorithms has been shown to be very close to that of their respective RLS-type algorithms. Thus they can be replacements for RLS-type of algorithms at a relatively low computational cost.
Appendices
A. Properties of 
Let us consider the solution to the following equation:
(A.1)
Comparing it to (16), we can see that
,
,
(
) and
. We note that for the penalty functions
used for M-estimation, we have the following:
,
and
. Let
be a solution of (A.1). We can easily see that when
the solution is
. When
, we can rewrite (A.1) as follows:
(A.2)
The solution
must satisfy two conditions:
and
. These two conditions imply that
which is same as
.
B. Derivation of Equations (21) and (22)
Substituting (19) into (20) and taking the first
derivative, we have
(B.1)
The update for
is then determined by solving
as follows:
(B.2)
where we have replaced
by
. Left multiplying both sides of the above equation by
, then subtracting both sides by
, we obtain
(B.3)
Substitute
into (B.2), we have the update for
given by (21). The update of the covariance matrix
given by (22) can be determined by using
, where
is given by
(B.4)
References
- P. J. Huber, Robust Statistics, John Wiley & Sons, New York, NY, USA, 1981.
- W. J. J. Rey, Introduction to Robust and Quasi-Robust Statistical Methods, Springer, Berlin, Germany, 1983.
- K. Lange and J. S. Sinsheimer, “Normal/independent distributions and their applications in robust regression,” Journal of Computational and Graphical Statistics, vol. 2, no. 2, pp. 175–198, 1993.
- A. Gelman, H. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian Data Analysis, Chapman & Hall/CRC, Boca Raton, Fla, USA, 2004.
- S. A. Kassam and H. V. Poor, “Robust techniques for signal proecssing: a survey,” Proceedings of the IEEE, vol. 73, no. 3, pp. 433–481, 1985.
- P. Petrus, “Robust Huber adaptive filter,” IEEE Transactions on Signal Processing, vol. 47, no. 4, pp. 1129–1133, 1999.
- K. L. Boyer, M. J. Mirza, and G. Ganguly, “Robust sequential estimator: a general approach and its application to surface organization in range data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 10, pp. 987–1001, 1994.
- S.-C. Chan and Y.-X. Zou, “A recursive least M-estimate algorithm for robust adaptive filtering in impulsive noise: fast algorithm and convergence performance analysis,” IEEE Transactions on Signal Processing, vol. 52, no. 4, pp. 975–991, 2004.
- D. S. Pham and A. M. Zoubir, “A sequential algorithm for robust parameter estimation,” IEEE Signal Processing Letters, vol. 12, no. 1, pp. 21–24, 2005.
- S. Haykin, Adaptive Filter Theory, Prentice-Hall, Englewood Cliffs, NJ, USA, 4th edition, 2002.
- W. A. Sethares, “The least mean square family,” in Adaptive System Identification and Signal Processing Algorithms, N. Kalouptsidis and S. Theodoridis, Eds., pp. 84–122, Prentice-Hall, Englewood Cliffs, NJ, USA, 1993.
- J. Chambers and A. Avlonitis, “A robust mixed-norm adaptive filter algorithm,” IEEE Signal Processing Letters, vol. 4, no. 2, pp. 46–48, 1997.
- J. Arenas-García and A. R. Figueiras-Vidal, “Adaptive combination of normalised filters for robust system identification,” Electronics Letters, vol. 41, no. 15, pp. 874–875, 2005.
- M. Opper, “A Bayesian approach to online learning,” in Online Learning in Neural Networks, D. Saad, Ed., pp. 363–378, Cambridge University Press, Cambridge, UK, 1998.
- D. J. C. MacKay, Information Theory, Inference and Learning Algorithms, Cambridge University Press, Cambridge, UK, 2003.
- T. Briegel and V. Tresp, “Robust neural network regression for offline and online learning,” in Advances in Neural Information Processing Systems 12, T. K. Leen, K.-R. Muller, and S. A. Solla, Eds., pp. 407–413, MIT Press, Cambridge, Mass, USA, 2000.
- Z. Ghahramani and M. J. Beal, “Propagation algorithms for variational Bayesian learning,” in Advances in Neural Information Processing Systems, T. K. Leen, T. Dietterich, and V. Tresp, Eds., vol. 13, pp. 507–513, MIT Press, Cambridge, Mass, USA, 2001.
- A. Honkela and H. Valpola, “Online variational Bayesian learning,” in Proceedings of the 4th International Symposium on Independent Component
Analysis and Blind Signal Separation (ICA '03), pp. 803–808, Nara, Japan, April 2003.
- G. Deng, “Robust sequential learning algorithms for linear observation models,” IEEE Transactions on Signal Processing, vol. 55, no. 6, pp. 2472–2485, 2007.
- D. R. Hunter and K. Lange, “A tutorial on MM algorithms,” American Statistician, vol. 58, no. 1, pp. 30–37, 2004.
- H.-C. Shin, A. H. Sayed, and W.-J. Song, “Variable step-size nlms and affine projection algorithms,” IEEE Signal Processing Letters, vol. 11, no. 2, pp. 132–135, 2004.
- D. J. C. Mackay, “Bayesian interpolation,” Neural Computation, vol. 4, no. 3, pp. 415–447, 1992.
- M. J. L. Orr, “Recent advances in radial basis function networks,” 1999, http://www.anc.ed.ac.uk/#mjo/papers/recad.ps.gz.
- S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge, UK, 2004.
- M. T. Heath, Scientific Computing: An Introductory Survey, McGraw-Hill, New York, NY, USA, 2nd edition, 2002.
- B. Hassibi and T. Kailath, “Adaptive filtering with an h-infinity criterion,” in Proceedings of the 28th Asilomar Conference on Signals, Systems and Computers (ACSSC '94, pp. 1483–1487, Pacific Grove, Calif, USA, October-November 1994.
- S. M. Kay, Fundamentals of Statistical Signal Processing Estimation Theory, Prentice-Hall, Englewood Cliffs, NJ, USA, 1993.
- S. C. Douglas and M. Rupp, “A posteriori update for adaptive filters,” in Proceedings of the 31st Asilomar Conference on Signals, Systems and Computers (ACSSC '97), vol. 2, pp. 1641–1645, Pacific Grove, Calif, USA, November 1997.