EURASIP Journal on Advances in Signal Processing
Volume 2008 (2008), Article ID 459586, 13 pages
doi:10.1155/2008/459586
Research Article

Sequential and Adaptive Learning Algorithms for M-Estimation

Guang Deng

Department of Electronic Engineering, Faculty of Science, Technology and Engineering, La Trobe University, Bundoora, VIC 3086, Australia

Received 1 October 2007; Revised 9 January 2008; Accepted 1 April 2008

Academic Editor: Sergios Theodoridis

Copyright © 2008 Guang Deng. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

The M-estimate of a linear observation model has many important engineering applications such as identifying a linear system under non-Gaussian noise. Batch algorithms based on the EM algorithm or the iterative reweighted least squares algorithm have been widely adopted. In recent years, several sequential algorithms have been proposed. In this paper, we propose a family of sequential algorithms based on the Bayesian formulation of the problem. The basic idea is that in each step we use a Gaussian approximation for the posterior and a quadratic approximation for the log-likelihood function. The maximum a posteriori (MAP) estimation leads naturally to algorithms similar to the recursive least squares (RLSs) algorithm. We discuss the quality of the estimate, issues related to the initialization and estimation of parameters, and robustness of the proposed algorithm. We then develop LMS-type algorithms by replacing the covariance matrix with a scaled identity matrix under the constraint that the determinant of the covariance matrix is preserved. We have proposed two LMS-type algorithms which are effective and low-cost replacement of RLS-type of algorithms working under Gaussian and impulsive noise, respectively. Numerical examples show that the performance of the proposed algorithms are very competitive to that of other recently published algorithms.

1. Introduction

We consider a robust estimation problem for a linear observation model: (1) where is the impulse response to be estimated, is the known training data and the noise follows an independent and identical distribution (i.i.d.). Given a set of training data , the maximum likelihood estimation (MLE) of leads to the following problem: (2) where is the negative log likelihood function. The M-estimate of a linear model can also be expressed as the above MLE problem when those well-developed penalty functions [1, 2] are regarded as generalized negative log-likelihood function. This is a robust regression problem. The solution not only is an essential data analysis tool [3, 4], but also has many practical engineering applications such as in system identification, where the noise model is heavy tailed [5].

The batch algorithms and the sequential algorithms are two basic approaches to solve the problem of (2). The batch algorithms include the EM algorithm for a family of heavy-tailed distributions [3, 4] and iterative reweighted least squares (IRLSs) algorithm for the M-estimate [2, 6]. In signal processing applications, a major disadvantage of a batch algorithm is that when a new set of training data is available the same algorithm must be run again on the whole data. A sequential algorithm, in contrast to a batch algorithm, updates the estimate as a new set of training data is received. In recent years, several sequential algorithms [79] have been proposed for the M-estimate of a linear model. These algorithms are based on factorizing the IRLS solution [7] and factorizing the so-called M-estimate normal equation [8, 9]. These sequential algorithms can be regarded as a generalization of recursive least squares (RLSs) algorithm [10]. Other published works include robust LMS-type algorithms [1113].

Bayesian learning has been a powerful tool for developing sequential learning algorithms. The problem is formulated as a maximum a posteriori (MAP) estimate problem. The basic idea is to break the sequential learning problem into two major steps [14]. In the update step, an approximate of the posterior at time is used to obtain the new posterior at time . In the approximation step, this new posterior is approximated by using a particular parametric distribution family. There are many well-documented techniques such as Laplace method [15] and Fisher scoring [16]. The variational Bayesian method has also been studied [17, 18].

In a recent paper [19], we address this problem from a Bayesian perspective and develop RLS-type and LMS-type of sequential learning algorithms. The development is based on using a Laplace approximation of the posterior and solving the maximum a posteriori (MAP) estimate problem by using the MM algorithm [20]. The development of the algorithm is quite complicated. The RLS-type of algorithm is further simplified as an LMS-type algorithm by treating the covariance matrix as being fixed. This has significantly reduced the computational complexity at the cost of degraded performance.

There are two major motivations of this work which is clearly an extension of our previous work [19]. Our first motivation is to follow the same problem formulation as in [19] and to explore an alternative and simpler approach to develop sequential M-estimate algorithms. More specifically, at each iteration, we use Gaussian approximation for the likelihood and the prior. As such, we can determine a close form solution of an MAP estimate sequentially when a set of new training data is available. This MAP estimate is in the similar form as that of an RLS algorithm. Our second motivation is to extend the RLS-type algorithm to the LMS-type algorithm with an adaptive step size. It is well established that a learning algorithm with adaptive step size usually outperforms those with fixed step size in terms of faster initial learning rate and lower steady state [21]. Therefore, instead of treating the covariance as being fixed, as in our previous work, we propose to use a scaled identity matrix to approximate the covariance matrix. The approximation is subject to preserving the determinant of the covariance matrix. As such, instead of updating the covariance, the scaling factor is updated. The update of the impulse response and the scaling factor thus constitute an LMS-type algorithm with an adaptive step size. A major contribution of this work is thus the development of new sequential and adaptive learning algorithms. Another major contribution is that performance of proposed LMS-type of algorithms is very close to that of the RLS-type counterpart.

Since this work is an extension of our previous work in which a survey of related works and Bayesian sequential learning have already been briefly discussed, in this paper, for brevity purpose, we have omitted the presentation of an extensive literature survey. Interested readers can refer to [19] and references therein for more information. The rest of this paper is organized as follows. In Section 2, we present the development of the proposed algorithm including a suboptimal solution. We show that the proposed algorithm consists of an approximation step and a minimization step which lead to the update of the covariance matrix and impulse response, respectively. We also discuss the quality of the estimate, issues related to the initialization and estimation of parameters, and the relationship of the proposed algorithms with those of our previous work. In Section 3, we first develop the general LMS-type of algorithm. We then present three specific algorithms, discuss their stability conditions and parameter initiation. In Section 4, we present three numerical examples. The first one evaluates the performance of the proposed RLS-type of algorithms, while the second and the third evaluate the performance of the proposed LMS-type of algorithms under Gaussian and impulsive noise conditions, respectively. A summary of this paper is presented in Section 5.

2. Development of the Algorithm

2.1. Problem Formulation

From the Bayesian perspective, after receiving sets of training data the log posterior for the linear observation model (1) is given by (3) where is the prior before receiving any training data and represents the model assumption. Throughout this paper, we use “c” to represent a constant. The MAP estimate of is given by (4)

Since the original M-estimation problem (2) can be regarded as a maximum likelihood estimation problem, in order to apply the above Bayesian approach, in this paper we attempt to solve the following problem: (5)

This is essentially a constrained MLE problem: (6) Using the Lagrange multiplier method, the constrained MLE problem can be recasted as (5), where is the Lagrange multiplier and is related to the constant . We can see that both and can be regarded as regularization parameters which are used to control the model complexity. Bayesian [22] and non-Bayesian [23] approaches have been developed to determine regularization parameters.

We can see that the constrained MLE problem is equivalent to the MAP problem when we set and . This is equivalent to regarding the penalty function as the negative log likelihood and setting a zero mean Gaussian prior for with covariance matrix where is an identity matrix. Therefore, in this paper we develop a sequential M-estimation algorithm by solving an MAP problem which is equivalent to a constrained MLE problem.

Since we frequently use the three variables , and , we define them as follows: , and , where and are the estimates of at time and , respectively. We can see that is the additive noise at time , and and are the modelling errors due to using and as the impulse response at time , respectively.

2.2. The Proposed RLS-Type Algorithms

To develop a sequential algorithm, we rewrite (3) as follows: (7) where the term is the log posterior at time () and is also the log prior at time . The term is the log-likelihood function. The basic idea of the proposed sequential algorithm is that an approximated log posterior is formed by replacing the log prior with its quadratic approximation. The negative of the approximated log posterior is then minimized to obtain a new estimate.

To illustrate the idea, we start our development from the beginning stage of the learning process. Since the exact prior distribution for is usually unknown, we use a Gaussian distribution with zero mean and covariance as an approximation. The negative log prior is approximated by (8) When the first set of training data is received, the negative log likelihood is and the negative log posterior with the approximated prior, denoted by , can be written as (9) This is the approximation step. In the minimization step, we determine the minimizer of , denoted by , by solving the equation .

We then determine a quadratic approximation of around through the Taylor-series expansion: (10) where is a constant, is the Hessian evaluated at and the linear term is zero since . Ignoring higher-order terms, we have the quadratic approximation for as follows: (11) This is equivalent to using a Gaussian distribution to approximate the posterior distribution with mean and covariance . In Bayesian learning, this is well-known technique called Laplace approximation [15]. In optimization theory [24], a local quadratic approximation of the objective function is frequently used.

When we receive the second set of training data, we form the negative log posterior, denoted , by replacing with as follows: (12) The minimization step results in an optimal estimate .

Continuing this process and following the same procedure, at time , we use a quadratic approximation for and form an approximation of the negative log posterior as (13) where is optimal estimate at time and is the minimizer of . The MAP estimate at time , denoted by , satisfies the following equation: (14) where and . Note that, in (13) is replaced by in (14) because is replaced by . From (14), it is easy to show that (15) Since depends on , we need to determine . Left-multiplying (15) by , then using the definition of , we can show that (16) where . Once we have determined from (16), we can calculate and substitute it into (15). We show in Appendix that the solution of (16) has the following properties: when , , when , and .

Next, we determine a quadratic approximation for around . This is equivalent to approximating the posterior by a Gaussian distribution with mean and the covariance matrix (17) where . Using a matrix inverse formula, we have the update of the covariance matrix for as follows: (18) If , then we have .

If there is no closed form solution for (16), then we must use a numerical algorithm [25] such as Newton's method or a fixed-point iteration algorithm to find a solution. This would add a significant computational cost to proposed algorithm. An alternative way is to seek a closed form solution by using a quadratic approximation of the penalty function as follows: (19) As such, the cost function is approximated by (20) In Appendix , we show that the optimal estimate and the update of the covariance matrix are given by (21)(22) respectively. Comparing (15) with (21), we can see that using the quadratic approximation for results in an approximation of by . Comparing (18) with (22), we can see that the only change due to the approximation is replacing by .

In summary, the proposed sequential algorithm for a particular penalty function can be developed as follows. Suppose at time , we have , and the training data. We have two approaches here. If we can solve (16) for , then we can calculate using (15) and update using (18). On the other hand, if there is no close form solution for or the solution is very complicated, then we can use (21) and (22).

2.3. Specific Algorithms

In this section, we present three examples of the proposed algorithm using three commonly used penalty functions. These penalty functions and their first and second derivatives are listed in Table 1. These functions are shown in Figure 1. We also discuss the robustness of these algorithms. To simplify discussion, we use (21) and (22) for the algorithm development.

Table 1: A list of some commonly used penalty functions and their first and second derivatives, denoted by , and , respectively.
Figure 1: The three penalty functions and their first and second derivatives. We set and when plotting these functions.
2.3.1. The Penalty Function

We can easily see that by substituting and into (21) and (22), we have an RLS-type of algorithm [19]: (23)(24) When , this reduced to a recursive least squares algorithm[27]. One can easily see that the update of the impulse response is proportional to . As such, it is not robust against impulsive noise which leads to a large value of and thus a large unnecessary adjustment.

We note that we have used an approximate approach to derive (23) and (24). This is only used for the simplification of the presentation. In fact, for an penalty function (23) and (24) can be directly derive from (15) and (18), respectively. The results are exactly the same as (23) and (24).

2.3.2. Huber's Penalty Function

By substituting the respective terms of and into (21) and (22), we have the following: (25)(26) where . Comparing (25) with (23), we can see that when they are the same. However, when indicating a possible case of outlier, (25) only uses the sign information to avoid making large misadjustment. For the update of the covariance matrix, when , it is the same as (24). However, when , no update is performed.

2.3.3. The Fair Penalty Function

We note that for the Fair penalty function, we have and . Substituting the respective values of and into (21) and (22), we have the following two update equations: (27) where (28) It is easy to show that for the Fair penalty function, we have (29)(30) Therefore, the value of is increasing in and is bounded by . As a result, the learning algorithm avoids making large misadjustment when is large. In addition, the update for the covariance is controlled by the term which is increasing in . Thus the amount of adjustment decreases as increases.

2.4. Discussions
2.4.1. Properties of the Estimate

Since in each step a Gaussian approximation is used for the posterior, it is an essential requirement that must be positive definite. We show that this requirement is indeed satisfied. Referring to (17) and using the fact that is nonnegative for the penalty functions considered [see Table 1] and that is positive definite, we can see that the inverse of the covariance matrix is positive definite. Using mathematical induction, it is easy to prove that is positive definite.

In the same way, we can prove that the Hessian of the objective function given by (31) is also positive definite. Thus the objective function is strictly convex and the solution is a global minimum.

Another interesting question is: does the estimate improve due to the new data ? To answer this question, we can study the determinant of the precision matrix which is defined as . The basic idea is that for a univariate Gaussian, the precision is the inverse of the variance. A smaller variance is equivalent to a larger precision which implies a better estimate. From (17), we can write (32) where we have used the substitution . In deriving the above results, we have used a matrix identity: . Since and [see Table 1], we have . It means that the precision of the current estimate due to the new training data is better than or at least as good as that of the previous estimate. We note that when we use the update (18) for the covariance matrix, the above discussion is still valid.

2.4.2. Parameter Initialization and Estimation

The proposed algorithm starts with a Gaussian approximation of the prior. We can simply set the prior mean as zero and set the prior covariance as , where is an identity matrix and is set to a small value to reflect the uncertainty about the true prior distribution. In our simulations, we set . For the robust penalty functions listed in Table 1, is a scaling parameter. We propose a simple online algorithm to estimate as follows: (33) where in our simulations. The function takes the smaller value of the two inputs as the output. It makes the estimate of robust to outliers.

It should be noted that for a 0.95 asymptotic efficiency on the standard normal distribution, the optimal value for can be found in [2]. In addition, for Huber's penalty function, the additional parameter is set to for a 0.95 asymptotic efficiency on the normal distribution [2].

2.4.3. Connection with the One-Step MM Algorithm [19]

Since the RLS-type of algorithm [see (21) and (22)] is derived from the same problem formulation as that in our previous work [19] and is based on different approximations, it is interesting to compare the results. For easy reference, we recall that in [19] we defined where . It is easy to show that (34)(35) For easy reference, we reproduce (40) and (44) in [19] as follows: (36)(37) where , and . Substituting (34) into (36), we have the RLS-type algorithm which is the one-step MM algorithm in terms of as the following: (38)(39) We can easily see that (39) is exactly the same as (22). To compare (38) with (21), we rewrite (21) as follows: (40) It is clear that (40) has an extra term compared to (38). The value of this term depends on the penalty function. For the penalty function, this term equals to one.

2.4.4. Connections with Other RLS-Type Algorithms

We briefly comment on the connections of the proposed algorithm with that based on the framework (see [26, Problem 2]) and the classical RLS algorithm with a forgetting factor [10]. For easy reference, the update equations for these algorithms are listed in Table 2. Comparing these algorithms, we can see that a major difference is in the way is updated. The robustness of the proposed algorithm is provided by the scaling factor which controls the “amount” of update. Please refer to Figure 1 for a graphical representation of this function. For the -based algorithm, an adaptively calculated quantity (see [26, equation (9)]) is subtracted from the update. This is another way of controlling the “amount” update. For the RLS algorithm, the forgetting factor plays the role of exponential-weighted sum of squared errors. The update is not controlled based on the current modelling error. It is now clear that the term and the term play a very different role in their respective algorithms.

Table 2: The update equations of three RLS-type algorithms.

It should be noted that by using the Bayesian approach, it is quite easy to introduce the forgetting factor into the proposed algorithm. Using the forgetting factor, the tracking performance of the proposed algorithm can be controlled. Since the development has been reported in our previous work [19], we do not discuss it in detail in this paper. A further interesting point is the interpretation of the matrix . For the penalty function, can be called the covariance matrix. But for the Huber and fair penalty function, its interpretation is less clear. However, when we use a Gaussian distribution to approximate the posterior, we can still regard it as a covariance matrix of the Gaussian.

3. Extension to LMS-Type of Algorithms

3.1. General Algorithm

For the RLS-type algorithms, a major contribution to the computational cost is the update of the covariance matrix. To reduce the cost, a key idea is to approximate the covariance matrix in each iteration by where is a positive scalar and is an identity matrix of suitable dimension. In this paper, we propose an approximation under the constraint of preserving the determinant, that is, . Since the determinant of the covariance matrix is an indication of the precision of the estimate, preserving the determinant thus permits passing on information about the quality of the estimate at time to the next iteration. As such, we have where is the length of the impulse response. The task of updating becomes updating .

From (17) and using a matrix identity , we can see that (41) [Here we assume that the size of the matrix and the sizes of the two vectors and are properly defined]. Suppose, at time , we have the approximation . Substituting this approximation into the left-hand side of (41), we have (42) Substituting into (42), we have the following: (43) Using a further approximation to simply (43), we derive the update rule for as follows: (44) Replacing in (21) by , we have the update of the estimate (45) Equations (44) and (45) can be regarded as the LMS-type of algorithm with an adaptive step size.

In [28], a stability condition for a class of LMS-type of algorithm is established as follows. The system is stable when () is satisfied. We will use this condition to discuss the stability of the proposed algorithms in Section 3.2.

We point out that in developing the above update scheme for , we have assumed that is fixed. As such, the update rule cannot cope with a sudden change of since is increasing with . This is inherent problem with the problem formulation. A systematic way to deal with it is to reformulate the problem to allow a time varying by using a state space model. Another way is to detect the change of and reset to its default value accordingly.

3.2. Specific Algorithms

Specific algorithms for the three penalty functions can be developed by substituting and into (44) and (45). We note that the penalty function can be regarded a special case of the penalty functions used in the M-estimate. The discussion of robustness is very similar to that presented in Section 2.3 and is omitted. Details of the algorithms are described below.

3.2.1. The Penalty Function

Substituting and into (45), we have (46) where . From (44), we have (47) which can be rewritten as follows: (48) The proposed algorithm is thus given by (46) and (48). A very attractive property of this algorithm is that it has no parameters. We only need to set the initial value of which can be set to zero (i.e., ) reflecting our assumption that the prior distribution of is flat.

The stability of this algorithm can be established by noting that (49) Since when , the stability condition is satisfied.

3.2.2. Huber's Penalty Function

In a similar way, we obtain the update for and as follows: (50)(51) where . The stability of the algorithm can be established by noting that when , we have (52) which is the same as the case. One the other hand, when , we can easily show that . As such, from (50) we have for (53) Since , we have . Thus the stability condition is also satisfied.

3.2.3. The Fair Penalty Function

For the Fair penalty function, we define . We have and . Using (45), we can write (54) where . The update for the precision is given by (55) A potential problem is that the algorithm may be unstable in that the stability condition may not be satisfied. This is because (56) where . We can easily see that when , we have which leads to an unstable system.

To solve the potential instability problem, we propose to replace in (54) by which is defined as (57) where . We note that can be regarded as a special case of when . When we can show that . As a result, the system is stable. On the other hand, when (implying ), we can show that which also leads to a stable system.

3.3. Initialization and Estimation of Parameters

In actual implementation, we can set which corresponds to setting . In the Bayesian perspective, this sets a uniform prior for , which represents the uncertainty about before receiving any training data. To enhance the learning speed of this algorithm, we shrink the value of in the first iterations, that is, where . An intuitive justification is that is an approximation of the precision of the estimate. In the penalty function case, is scaled by the unknown but assumed constant noise variance. Due to the nature of the approximation that ignores the higher order terms, the precision is overly estimated. A natural idea is to scale the estimated precision . In simulations, we find that and lead to improved learning speed.

For the Huber and the fair penalty functions, it is necessary to estimate the scaling parameter . We use a simple online algorithm to estimate as follows: (58) where in our simulations. In addition, for Huber's penalty function, the additional parameter is set to for a 0.95 asymptotic efficiency on the normal distribution [2].

4. Numerical Examples

4.1. General Simulation Setup

To use the proposed algorithms to identify the linear observation model of (1), at the nth iteration we generate a zero mean Gaussian random vector of size () as the input vector. The variance of this random vector is 1. We then generate the noise and calculate the output of the system . The performance of an algorithm is measured by which is a function of and is called the learning curve. Each learning curve is the result of averaging 50-run of the program using the same additive noise. The purpose is to average out possible effect of the random input vector . The result is then plotted in the log scale, that is, where is the averaged learning curve.

4.2. Performance of the Proposed RLS Algorithms

We set up the following simulation experiments. The impulse response to be identified is given by . In the nth iteration, a random input signal vector is generated as and is calculated using (1). The noise is generated from a mixture of two zero mean Gaussian distributions which is simulated in Matlab by: . The threshold controls the percentage of impulsive noise. In our experiments, we set which correspond to about 1.2% of impulsive noise. A typical case for the noise used in our simulation is shown in Figure 2

Figure 2: Noise signal used in simulations.

Since the proposed algorithms using Huber and fair penalty functions are similar to the RLS algorithm, we compare their learning performance with that of the RLS and a recently published RLM algorithm [8] using suggested values of parameters. Simulation results are shown in Figure 3. We observe from simulation results that the learning curves of proposed algorithms are very close to that of the RLM algorithm and are significantly better than that of the RLS algorithm which is not robust to non-Gaussian noise. The performance of the proposed algorithm in this paper is also very closed to that of our previous work [19] and the comparison results are not presented for brevity.

Figure 3: A comparison of learning curves for different RLS-type algorithms.
4.3. Performance of Proposed LMS Type of Algorithms

We first compare the performance of our proposed LMS-type of algorithms using the fair and Huber penalty functions to a recently published robust LMS algorithm (called the CAF algorithm in this paper) using the suggested settings of parameters [13]. The CAF algorithm adaptively combines the NLMS and the signed NLMS algorithms. As a bench mark, we also include simulation results using the RLM algorithm which is computationally more demanding than any LMS type of algorithms. The noise used is similar to that described in Section 4.2. We have tested these algorithms with three different length of impulse responses . In each simulation, the impulse response is generated as a zero-mean Gaussian random () vector with standard deviation of 1. Simulation results are shown in Figure 4.

Figure 4: A comparison of the learning performance of different algorithms in terms of the transient response (right panel of the figure) and the steady state (left panel of the figure). Subfigures presented from top to bottom are results of testing different length of impulse response . Legends for all subfigures are the same and are included only in the top-right sub-figure.

From this figure, we can see that the performance of the two proposed algorithms is consistently better than that of the CAF algorithm. The performance of the proposed algorithm with the fair penalty function is also better than that with the Huber penalty function. When the length of the impulse response is moderate, the performance of the proposed algorithm with the fair penalty function is very close to that of the RLM algorithm. The latter has a notable faster learning rate than the former when the length is 512. Therefore, the proposed algorithm with the fair penalty function can be a low computational-cost replacement of the RLM algorithm for identifying an unknown linear system with moderate length.

We now compare the performance of the proposed LMS-type algorithm using the penalty function with a recently published NLMS algorithm with adaptive parameter estimation [21]. This algorithm (see [21, equation (10)]) is called the VSS-NLMS algorithm in this paper. The VSS-NLMS algorithm is chosen because its performance has been compared to many other LMS-type of algorithms with variable step sizes. We tune the parameter of the VSS-NLMS algorithm such that it reach the lowest possible steady state in each case. As a bench mark, we also include simulation results using the RLS algorithm. We have tested these algorithms with three different length of impulse responses . In each simulation, the impulse response is generated as a zero mean Gaussian random () vector with standard deviation of 1. We have also tested settings with three different noise variances and 1. We have obtained similar results for all three cases. In Figure 5, we present the steady state and the transient responses for these algorithms under the condition . We can see that the performance of the proposed algorithm is very close to that of the RLS algorithm for the two cases and In fact, these two algorithms converge to almost the same steady state and the learning rate of the RLS algorithm is slightly faster. For the case of the RLS algorithm, being a lot more computational demanding, has a faster learning rate in the transient response than the proposed algorithm does. Comparing with the VSS-NLMS algorithm, the performance of the proposed algorithm is consistently better. Therefore, the proposed algorithm can be a low computational-cost replacement for the RLS algorithm for learning an unknown linear system of moderate length.

Figure 5: A comparison of the learning performance of different algorithms in terms of the transient response (right panel of the figure) and the steady state (left panel of the figure). Subfigures presented from top to bottom are results of testing different length of impulse response . Legends for all subfigures are the same and are included only in the top-right subfigure. We note that for the two cases and 100, the proposed algorithm converges to the almost the same level of steady state as that of the RLS algorithm.

5. Conclusion

In this paper, we develop a general sequential algorithm for the M-estimate of a linear observation model. Our development is based on formulating the problem from a Bayesian perspective and using a Gaussian approximation for the posterior and likelihood function in each learning step. The sequential algorithm is then developed by determining a maximum a posteriori (MAP) estimate when a new set of training data is received. The Gaussian approximation leads naturally to a quadratic objective function and the MAP estimate is an RLS-type algorithm. We have discussed the quality of the estimate, issues related to the initialization and estimation of parameters, and the relationship of the proposed algorithm with those of previous work. Motivated by reducing computational cost of the RLS-type algorithm, we develop a family of LMS-type algorithms by replacing the covariance matrix with a scaled identity matrix. Instead of updating the covariance matrix, we update the scalar which is set to preserve the determinant of the covariance matrix. Simulation results show that the learning performance of the proposed algorithms is competitive to that of some recently published algorithms. In particular, the performance of proposed LMS-type algorithms has been shown to be very close to that of their respective RLS-type algorithms. Thus they can be replacements for RLS-type of algorithms at a relatively low computational cost.

Appendices

A. Properties of

Let us consider the solution to the following equation: (A.1) Comparing it to (16), we can see that , , () and . We note that for the penalty functions used for M-estimation, we have the following: , and . Let be a solution of (A.1). We can easily see that when the solution is . When , we can rewrite (A.1) as follows: (A.2) The solution must satisfy two conditions: and . These two conditions imply that which is same as .

B. Derivation of Equations (21) and (22)

Substituting (19) into (20) and taking the first derivative, we have (B.1) The update for is then determined by solving as follows: (B.2) where we have replaced by . Left multiplying both sides of the above equation by , then subtracting both sides by , we obtain (B.3) Substitute into (B.2), we have the update for given by (21). The update of the covariance matrix given by (22) can be determined by using , where is given by (B.4)

References

  1. P. J. Huber, Robust Statistics, John Wiley & Sons, New York, NY, USA, 1981.
  2. W. J. J. Rey, Introduction to Robust and Quasi-Robust Statistical Methods, Springer, Berlin, Germany, 1983.
  3. K. Lange and J. S. Sinsheimer, “Normal/independent distributions and their applications in robust regression,” Journal of Computational and Graphical Statistics, vol. 2, no. 2, pp. 175–198, 1993.
  4. A. Gelman, H. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian Data Analysis, Chapman & Hall/CRC, Boca Raton, Fla, USA, 2004.
  5. S. A. Kassam and H. V. Poor, “Robust techniques for signal proecssing: a survey,” Proceedings of the IEEE, vol. 73, no. 3, pp. 433–481, 1985.
  6. P. Petrus, “Robust Huber adaptive filter,” IEEE Transactions on Signal Processing, vol. 47, no. 4, pp. 1129–1133, 1999.
  7. K. L. Boyer, M. J. Mirza, and G. Ganguly, “Robust sequential estimator: a general approach and its application to surface organization in range data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 10, pp. 987–1001, 1994.
  8. S.-C. Chan and Y.-X. Zou, “A recursive least M-estimate algorithm for robust adaptive filtering in impulsive noise: fast algorithm and convergence performance analysis,” IEEE Transactions on Signal Processing, vol. 52, no. 4, pp. 975–991, 2004.
  9. D. S. Pham and A. M. Zoubir, “A sequential algorithm for robust parameter estimation,” IEEE Signal Processing Letters, vol. 12, no. 1, pp. 21–24, 2005.
  10. S. Haykin, Adaptive Filter Theory, Prentice-Hall, Englewood Cliffs, NJ, USA, 4th edition, 2002.
  11. W. A. Sethares, “The least mean square family,” in Adaptive System Identification and Signal Processing Algorithms, N. Kalouptsidis and S. Theodoridis, Eds., pp. 84–122, Prentice-Hall, Englewood Cliffs, NJ, USA, 1993.
  12. J. Chambers and A. Avlonitis, “A robust mixed-norm adaptive filter algorithm,” IEEE Signal Processing Letters, vol. 4, no. 2, pp. 46–48, 1997.
  13. J. Arenas-García and A. R. Figueiras-Vidal, “Adaptive combination of normalised filters for robust system identification,” Electronics Letters, vol. 41, no. 15, pp. 874–875, 2005.
  14. M. Opper, “A Bayesian approach to online learning,” in Online Learning in Neural Networks, D. Saad, Ed., pp. 363–378, Cambridge University Press, Cambridge, UK, 1998.
  15. D. J. C. MacKay, Information Theory, Inference and Learning Algorithms, Cambridge University Press, Cambridge, UK, 2003.
  16. T. Briegel and V. Tresp, “Robust neural network regression for offline and online learning,” in Advances in Neural Information Processing Systems 12, T. K. Leen, K.-R. Muller, and S. A. Solla, Eds., pp. 407–413, MIT Press, Cambridge, Mass, USA, 2000.
  17. Z. Ghahramani and M. J. Beal, “Propagation algorithms for variational Bayesian learning,” in Advances in Neural Information Processing Systems, T. K. Leen, T. Dietterich, and V. Tresp, Eds., vol. 13, pp. 507–513, MIT Press, Cambridge, Mass, USA, 2001.
  18. A. Honkela and H. Valpola, “Online variational Bayesian learning,” in Proceedings of the 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA '03), pp. 803–808, Nara, Japan, April 2003.
  19. G. Deng, “Robust sequential learning algorithms for linear observation models,” IEEE Transactions on Signal Processing, vol. 55, no. 6, pp. 2472–2485, 2007.
  20. D. R. Hunter and K. Lange, “A tutorial on MM algorithms,” American Statistician, vol. 58, no. 1, pp. 30–37, 2004.
  21. H.-C. Shin, A. H. Sayed, and W.-J. Song, “Variable step-size nlms and affine projection algorithms,” IEEE Signal Processing Letters, vol. 11, no. 2, pp. 132–135, 2004.
  22. D. J. C. Mackay, “Bayesian interpolation,” Neural Computation, vol. 4, no. 3, pp. 415–447, 1992.
  23. M. J. L. Orr, “Recent advances in radial basis function networks,” 1999, http://www.anc.ed.ac.uk/#mjo/papers/recad.ps.gz.
  24. S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge, UK, 2004.
  25. M. T. Heath, Scientific Computing: An Introductory Survey, McGraw-Hill, New York, NY, USA, 2nd edition, 2002.
  26. B. Hassibi and T. Kailath, “Adaptive filtering with an h-infinity criterion,” in Proceedings of the 28th Asilomar Conference on Signals, Systems and Computers (ACSSC '94, pp. 1483–1487, Pacific Grove, Calif, USA, October-November 1994.
  27. S. M. Kay, Fundamentals of Statistical Signal Processing Estimation Theory, Prentice-Hall, Englewood Cliffs, NJ, USA, 1993.
  28. S. C. Douglas and M. Rupp, “A posteriori update for adaptive filters,” in Proceedings of the 31st Asilomar Conference on Signals, Systems and Computers (ACSSC '97), vol. 2, pp. 1641–1645, Pacific Grove, Calif, USA, November 1997.