Abstract

Sparsity-promoting prior along with Bayesian inference is an effective approach in solving sparse linear models (SLMs). In this paper, we first introduce a new sparsity-promoting prior coined as Double Lomax prior, which corresponds to a three-level hierarchical model, and then we derive a full variational Bayesian (VB) inference procedure. When noninformative hyperprior is assumed, we further show that the proposed method has one more latent variable than the canonical automatic relevance determination (ARD). This variable has a smoothing effect on the solution trajectories, thus providing improved convergence performance. The effectiveness of the proposed method is demonstrated by numerical simulations including autoregressive (AR) model identification and compressive sensing (CS) problems.

1. Introduction

Consider the following linear model: where is the designed matrix and is the measurement vector. is the measurement noise, and is the unknown coefficient vector. This linear model is called underdetermined if and overdetermined otherwise. In the underdetermined case, there are in general numerous solutions, and additional assumptions need to be assumed on for getting the desired one. In the applications we consider, the sparsity of is an important assumption: most elements of should be tiny or equal to zero when they are not required to tell the data, but few should be allowed to be large if necessary. A linear model with such an assumption on is referred to the sparse linear model (SLM). SLMs can represent various types of signals in different transform domains [15]. Obtaining such a representation is equivalent to solving for a linear inverse problem with sparsity-promoting regularization. It is a fundamental issue of model (basis) selection [5, 6], compressive sensing [7, 8], source localization [9, 10], and data fusion [11, 12].

Under the sparsity assumption, one seeks for a representation of with a minimal number of basis vectors in . This is equivalent to seeking for the sparsest solution of , which has a maximal number of elements equal to zero. The sparsest solution can be induced by the improper -norm of [13]: . The associated maximum a posteriori (MAP) estimation (assuming that is white Gaussian noise) is a penalized least squares problem: where is the model parameter that controls the relative importance applied to error term and sparseness term. The challenge in solving for (2) is that it requires a combinatorial search which is NP-hard in general. A natural approach is to replace by a tractable approximation. Among all the tractable approximations, -norm is the tightest convex relaxation of -norm; it leads to the successful Lasso algorithm [1, 3, 4, 1416]. However, a nonconvex relaxation is able to recover sparsity in a more efficient way. It is proved that a nonconvex relaxation that closely resembles the -norm is able to find the correct solution with fewer measurements [1720].

One difficulty with nonconvex relaxation is the existence of many local minima. Point-estimation methods such as MAP may easily get stuck in poor local solutions [21]. A posterior-estimation Bayesian approach can nicely alleviate this problem. Given the sparsity-promoting prior , which is the probabilistic representation of a particular -norm relaxation, the Bayesian inference can be expressed by where is the posterior. The Bayesian approach computes the posterior mean instead of its mode, and it integrates instead of maximizing over . In case of many local minima, the mass of the probability density tends to concentrate at a strong minimum, which is more likely to be the global optimum. Although is not necessarily the exact global optimum in general, it is proved that [22], by taking a “zero temperature limit,” is able to converge to the global optimum. In other words, as approaches the -norm, converges to the solution.

In Bayesian formulation, the choice of prior is important for both solution sparsity and computational tractability. The Student- prior is a relatively standard prior that stands out by being non-log-concave (results in nonconvex penalty) and having a Gaussian-integral representation for efficient processing. With denoting the Student- prior, the distribution is defined as where controls the degree of sparsity enforced and is known as shape parameter and is the scale parameter. Additionally, denoting the Gaussian prior with variance by and the Gamma prior by , defined as , the Student- prior corresponds to a Gaussian integral as follows: where is the latent variable modifying the precision of the Gaussian mixture. The representation in (5) can be treated as an integral over scale parameter of the Gaussian density, thus is referred to as Gaussian scale mixtures (GSM) [23], which naturally invokes a two-level hierarchical Bayes model.

Specifically, setting leads to the noninformative-hyperpriors case of Student- prior and gives the well known automatic relevance determination (ARD) prior (the scale parameter is also assumed to be unity) [24] as follows: As the limited case, the ARD prior provides the strongest sparsity-promoting potential among all Student-t priors with different values of (see the remark of Proposition 1), and it has drawn tremendous attention in many problems [5, 8, 25]. Several methods have been proposed to solve for SLMs incorporating ARD priors, including variational Bayes (VB) inference [25], type II maximum likelihood (the “evidence framework”) [26], and a series of reweighted problems [27].

In this paper, we formulate the SLM problem from a Bayesian perspective. Despite of the two-level Student-t prior, we propose the use of a new non-log-concave sparsity prior referred to as Double Lomax prior, which corresponds to a three-level hierarchical Bayes model. Additionally, a full VB inference procedure is derived to solve for the SLM using Double Lomax priors. As shown here, when noninformative hyperpriors are assumed (), the updating procedure of proposed method includes the canonical ARD updating procedure as a special case, but it has a better global convergence performance that can result in improved sparse estimations.

The organization of this paper is as follows. In Section 2, we present the proposed Double Lomax prior and demonstrate its properties. In Section 3, we formulate the Bayesian model of SLM using Double Lomax priors based on a generalized noise model. In Section 4, a full VB inference procedure is developed. In Section 5, we discuss the relationship between the proposed method and the ARD approach. Experimental results are presented in Section 6. Two SLM applications with different matrix properties are considered in this section. One application indicates order determination and coefficients estimation for autoregressive (AR) models, where the rows of matrix are strongly correlated. The other application is sparse signal recovery from the standard compressive sensing (CS) setup, where the rows of matrix are uncorrelated. Conclusions are given in Section 7.

2. Double Lomax Prior and Its Properties

The proposed Double Lomax prior can be derived from two back-to-back spliced Lomax distributions (also known as Pareto II distribution [28]). That is, It can be shown that , and . Thus (7) is a probability distribution function of defined on , symmetrical about zero. is the scale parameter, is the shape parameter, and 0 refers to zero mean.

Moreover, note that Thus when the shape parameter goes to infinite, Double Lomax prior asymptotically becomes the Laplace prior with inverse scale ; when the shape parameter goes to zero, Double Lomax prior tends to strongly concentrate its mass on the zero value, distributing with the same summit but much longer and thinner tails. This behavior can be observed in Figure 1, which shows the Double Lomax priors with different values of .

Proposition 1. Double Lomax prior in (7) is log-convex.

Proof. The logarithm of density function equation (7) is It is continuous on , and the derivatives are Thus, Double Lomax prior is log-convex for any and is strictly log-convex for , even though it does not have first and second derivative at the point .

Remark 2. Proposition 1 shows that the implicit penalty induced by Double Lomax prior, , is (strictly) concave, spanning from the convex -norm penalty (when ) to the strongly concave -norm-like penalty (when ). Figure 2 is plotted for different values of .
For Student- prior in (4), , thus, the implicit penalty induced by Student- prior, , has inconsistent concavity, that is, convex on the central interval and concave on each of the outer intervals, and . When , the outer concave intervals are vanished, and the Student-t prior becomes the Gaussian prior with precision and leads to the -norm penalty that does not induce sparsity at all; when , the central convex interval is vanished, and the Student-t prior is now referred to as ARD prior, which becomes proportional to the noninformative Jeffreys prior and leads to the log-sum penalty , providing similar sparsity-promoting effect as the Double Lomax prior does.

Proposition 3. Double Lomax prior in (7) can be represented as a GSM. Given Gaussian prior , exponential prior , and Gamma prior , one has

Proof. Equation (7) can be expressed as Meanwhile, note that the Laplace prior has a hierarchical representation as follows: where is an exponential prior. Substituting (13) into (12) gives (11). This proves Proposition 3.

Remark 4. Equation (11) corresponds to a three-level hierarchical Bayes model with two latent variables and . On the first level, random variables are generated from zero-mean Gaussian priors with independent variance . On the second level, a set of is generated from exponential priors with independent inverse scales, which are formed by multiplying a set of independent variables . On the third level, the positive squares of independent parameters are generated from a Gamma prior with two parameters of the same value.
Moreover, besides this GSM model, Double Lomax prior can be considered as a Laplacian scaled mixture as well, and the inverse scale can be thought as the width-controller of the corresponding Laplace prior: if is large, the corresponding distribution is narrow and the regularizing effect towards zero is strong.

3. Sparse Linear Model Using Double Lomax Priors

We assume independent Double Lomax priors over the coefficients and use the model in (1). Without loss of generality, the scale parameter is assumed to be unity. Using (11), the coefficient model of (1) can be written as where is the latent variable modifying the variance of the GSM model, is the latent variable modifying the inverse scale of exponential prior, and denotes the precision matrix.

Traditionally the noise of SLM is presumed to be Gaussian distributed, which means that the measurements are also Gaussian distributed. However, in many real applications, measurements are distributed with tails that decay more slowly than Gaussian, which can not be described well by Gaussian noise model. Here we allow for very slow decay of tails by modeling the noise model using Student-t distribution. As a generalized noise model, Student-t can model a wide range of noises which from Gaussian to very heavy tailed. It also makes a connection of the robust statistic and the sparsity priors, under the view of the capability of modeling extreme events: those variables who have small probabilities but a large value range.

Using a hierarchical Bayes model which is similar to (5), denoting the shape parameter by (also called degrees of freedom), the scale parameter by (also called precision parameter), and the latent variable by , and imposing two additional Gamma priors on and , the hierarchical form of Student-t distributed data can be written as where is the th row of and is the precision matrix.

Figure 3 shows the complete Bayesian model and the interdependencies among the model parameters. All hyperparameters in this graph are set to small values for getting flat hyperpriors (e.g., ). It can be observed from the graph that the joint distribution of the observations and parameters can be expressed as where the , , , , , and are given in (17), (14), (15), (16), (19), (18), and (20), respectively.

4. Variational Bayesian Sparse Regression

It is usually difficult to find the analytical results of the integrals in Bayesian estimation, hence approximation methods are required. There are two major classes of the approximations. One is Markov chain Monte Carlo (MCMC), which represents the posterior by samples; unbiased results can be obtained in infinite time. However, MCMC is relatively time consuming [29]. The other is variational relaxations [22], which relaxes the Bayesian estimation into a series of feasible optimization problems, providing a good tradeoff between computation accuracy and efficiency. VB inference [3032] is one approach of this class.

VB approximates the posterior density by introducing a factorable distribution . Using Jensen’s inequality, we get the lower bound of the log probability of : In this work, we factorize the approximation distribution over groups of parameters as follows: Maximizing the right-hand side of (22) with respect to , we can get a general expression of the optimal solution , where denotes the expectation operator with respect to the distributions for all . is the integration over which is a normalization constant. If the conjugate prior exists and is chosen for each group, the approximate posterior has the same functional form as the prior [30]. Hence the normalization constant does not need to be considered.

Now we derive a full VB inference procedure for the above problem. We use the general optimal solution in (24) to obtain the factorized approximate posterior densities . However, different from most VB updates in the literature, the challenge here is, for several random variables in the proposed model, that there are no conjugate priors for their likelihoods, and hence, in order to obtain an iterative algorithm with closed form, extra integration analysis has to be incorporated in our derivation.

4.1. Coefficients,

To compute , we only need to consider those terms that are functionally dependent on . Because is the conjugate prior for , would still be a Gaussian. It is convenient to ignore the normalization constant. The optimal solution is given by where and . Comparing the square term with standard Gaussian function, we can identify the mean and variance of this Gaussian, and get

4.2. Coefficient’s Latent Variables,

From (14) and (15) we observe that the terms involving are composed of -order multiplicative terms over , , and , a further factorization of this approximate posterior density becomes where and denote the th element of a vector and the th diagonal element of a matrix, respectively.

However, since exponential distribution is not the conjugate prior for the variance of Gaussian distribution, the normalization term must be reinstated. Integrating over , the normalization constant of this posteriori becomes Therefore, the complete form of this posteriori is It is noted that the posteriori density in (30) corresponds to a probability function known as generalized inverse Gaussian distribution (), which has the probability function [33] where is the modified Bessel function of the third kind with index , one of its integral representations is given by From (32), we have Substituting (33) into (30) and comparing with (31), it can be found that The th order moment of a distribution is given by [33] In addition, from (32), it can be derived that Therefore, for the specific in (34), using (35) with (36), we have:

4.3. Coefficient’s Latent Variables,

The approximate posterior density of can be factorized as Equation (39) contains logarithmic, square, and linear terms, thus it is not in correspondence with any standard distribution. Using a definite integration result in [34], the normalization term of can be expressed as where is the parabolic cylinder function, whose computation can be implemented [35]. Thereby,

However, when noninformative hyperprior is assumed, is set to zero and becomes the Rayleigh distribution who has the probability function Here is . Therefore, Equation (43) indicates that the value of is simply inversely proportional to when .

4.4. Noise Precision Parameter,

Since Gamma distribution is conjugate to the precision of Gaussian function, we still obtain a Gamma distribution for as follows: where

4.5. Noise’s Latent Variables,

It is observed that the approximate posterior density of can be further factorized as And is still a Gamma distribution as follows: where

4.6. Noise’s Degrees of Freedom,

It is noted that which is not in correspondence with any standard distribution. This problem has been solved by expanding using Sterling’s approximation [6], . Thus, where where is the digamma function, defined by .

We summarize the VB inference of Double Lomax formulation (noninformative hyperprior) as follows.

For coefficient model, consider For noise (observation) model, consider When initial expectations are assumed, the iterative updating can be carried out in closed form.

5. Relations to ARD Update Procedure

Now let us inspect the relations between the proposed Double Lomax updating procedure and the ARD updating procedure.

Considering (57) and (59) that are induced by the Double Lomax priors, it can be shown that if just (57) and (59) are iterated to convergence, then the final result is equivalent to directly setting which is the exact ARD update [26]. To prove this, construct a function as . It can be shown that (57) minimizes partially with respect to , and (59) minimizes partially with respect to . If we do the complete minimization over both and , then we directly get (66). Therefore, when noninformative hyperprior is assumed, the proposed Double Lomax updates only have one more latent variable than the canonical ARD updates; this latent variable can be further vanished if we only consider the updating equations of (57) and (59).

However, if we consider the full iteration procedure involving all the updating equations, (54), (55), (57), (59), (61), (63), and (65), then the difference is shown: the solution trajectories of Double Lomax are not necessarily consistent with those of ARD. To see this, we need to denote ARD’s Equation (66), and Double Lomax’s Equations (57) and (59) from an iteration perspective, so we have for ARD and for Double Lomax, where denotes the th iteration.

It can be seen from (69) that the latent variable of Double Lomax can be expressed as a function of , which involves all the updates of from former iterations. That is, Considering the above expression and comparing (67) with (68), the difference between ARD and Double Lomax () is shown: in contrast to the ARD’s that depends only on , the Double Lomax’s depends not only on but also on all the former updates of , including . Consequently, it makes the Double Lomax method being able to smooth the solution trajectory of and also the trajectories of all other variables, as the results of smoothness propagation effect caused by the alternating-update strategy of VB inference. With these smooth trajectories, the algorithm may be prevented from a large deviation of the right direction when there are some disturbances or unreliable updates. As will be shown experimentally, comparing to ARD, Double Lomax tends to move more carefully around local minima and provides a better convergence performance.

Following, we use linear regression as an example, the coefficient vector is , the basis matrix is a 12 × 10 random matrix whose entries are independently chosen from , and is zero-mean Gaussian noise with a standard deviation of 0.01. The VB-Double Lomax () formulation established in the above section (denoted with DL) and an equivalent formulation with ARD prior (denoted with ARD) are used here for comparison. The maximum number of iterations are set to 50 for both of the algorithms. Figure 4 shows the estimation results of 50 trials, and the estimation mean (denoted with mean) and standard deviation (denoted with std) are also reported. Neither DL nor ARD can recover the exact coefficients due to the noise. In general, DL has shown more cases of convergence to global minima, while ARD has shown more cases of convergence to local minima, which has a relatively large deviation of global minima though with a small probability since that does not change the mean values much. Figure 5(a) reports the coefficient update traces of a typical case when ARD converges to a local minima but DL does not. Figure 5(b) reports the coefficient update traces of a typical case when both ARD and DL converge to the global minima. In both situations, the DL appears to have smoother coefficient update traces, which is in agreement with the above analysis. Furthermore, the update traces of latent variables and and noise parameter when both ARD and DL converge to the global minima (the case in Figure 5(b)) are also reported in Figure 6. Comparing to the traces of ARD’s , smoother traces of DL’s are observed, even if they are plotted on a log scale. It is noted that the trace shapes of DL’s are similar to those of DL’s . But the differences can be seen in Figures 6(d) and 6(e) (nonzero coefficient’s traces are shown in Figure 6(d) while zero coefficients are shown in Figure 6(e) for better visualization). If we force to equal to be , an instance of ARD update is obtained. It is worth noting that since an extra level of iterations is introduced by , a slower convergence speed of DL can be expected in general.

6. Computer Simulations

In this section, the proposed VB-Double Lomax () formulation (denoted with DL) is compared with the equivalent formulation with ARD prior (denoted with ARD). Furthermore, a hybrid formulation (denoted by hybrid), with first iterations employing DL and the rest employing ARD until convergence, is also considered here for a tradeoff between performance and speed. This hybrid approach is based on the observation that ARD tends to converge to a local minima at early iterations. Two kinds of SLM applications with different matrices are considered here. One is the AR model identification where the rows of matrix are strongly correlated, the other is the standard compressive sensing problem where the rows of matrix are uncorrelated. In the experiments reported below, the hybrid number , and an experimental criterion, and lasting for more than 50 iterations, is used to terminate the iterative procedure.

Autoregressive model is widely used to model time series data, which attempts to predict the future values based on previous values. An AR model of order is defined as where is the th observation (measurement) in the series, is the AR coefficients, and is the excitation noise. Using matrix notation, we have where is the matrix whose th row contains the lags for elements , that is, and is the AR coefficient vector to be estimated. If the model order is unknown, the strategy used here is choosing an where is large enough to ensure then computing the coefficient vector of , and the extra coefficients are supposed to have estimates of zero.

Figure 7 shows an example of AR identification using DL with , , and . The excitation noise follows a Student-t distribution with and . It is shown in the left of the figure that for coefficients , the algorithm gives relatively satisfactory estimates, and for the extra coefficients , the estimates have been set closely to zero. The center is plotted for the expected values of one-step-ahead predictions compared with the observations, and the right gives the comparison of the estimated excitation noise distribution with that used to generate the observations.

To compare DL, ARD, and hybrid, we use three AR models with , , and 15. In each case, and the excitation noise is Student-t distributed with and . The coefficient estimation error is calculated as , where and are the estimated and true coefficient vectors. The prediction mean square errors (MSEs) are calculated as , where and are the one-step-ahead predictions and actual observations. Each experiment is carried out 200 times and the average results for AR(5), AR(10), and AR(15) are shown in Figures 8(a), 8(b), and 8(c), respectively. The figures on the left report the coefficient estimation error with error bars (one standard deviation). It is shown that DL results in smaller errors than ARD, and the improvement is more significant for AR(10) and AR(15). It is also observed that hybrid gives almost the same performance as DL. As the number of observations increases, the coefficient estimation errors of DL, ARD, and Hybrid decrease and their differences become less significant. Similar observations are found in the prediction MSEs, which are reported in the center of Figure 8. However, the prediction MSEs do not decrease monotonously with the number of observations. This may be due to the reason that the current SLM formulation is based on the overall coefficient estimation error but does not put the partial autocorrelation of the AR model in consideration. Finally, the right of the figure reports the number of iterative steps required for convergence versus the number of observations. It is shown that for the same number of observations, DL requires more iterations than ARD for convergence. However, hybrid can achieve a comparable convergence rate as ARD. In general, as the number of observations increases, the convergence speeds of DL, ARD, and Hybrid also increase.

CS is a technique for acquiring and recovering sparse signals with underdetermined linear systems that can be represented as where is observations, is measurement matrix, , represents the acquisition noise, and is the unknown sparse signal, that is, most of its coefficients are zero.

We use the following default CS setup in our experiments. is induced by sampling i.i.d. columns from a unit sphere in . coefficients at random locations of the signal are drawn from four different probability distributions, and the rest of the coefficients are set to zero. The nonzero coefficients of the sparse signals are realizations of the following four distributions: (1) uniform ±1 random spikes, (2) nonnegative unit spikes, (3) zero-mean unit variance Gaussian, and (4) zero-mean Student-t with unit scale parameter and shape parameter equal 3. We fix and , and we vary the number of measurements . Moreover, both noiseless and noisy acquisitions are considered, where a zero-mean white Gaussian noise with a standard deviation of 0.005 is used in the noisy experiment. The reconstruction error is calculated as , where and are the estimated and true signals. Each experiment is carried out 200 times and the average results are reported. Figure 9 reports the number of measurements versus reconstruction error with error ranges (one standard deviation) for the noise-free case. It is shown that DL provides improved overall reconstruction performance over ARD for a reasonable number of measurements, the improvement is more significant for signal types (1) and (2) and less significant for types (3) and (4) (but the differences can still be seen from the error ranges). And the hybrid can provide a performance close to DL. Figure 10 reports the number of iterative steps required for convergence versus the number of measurements for the noise-free case. DL shows a slower convergence rate than ARD, and hybrid has a comparable rate as of the ARD. Figure 11 shows the number of estimated nonzero coefficients for the noise-free case, whose values are larger than . Overall, DL gives the sparsest estimation results, then hybrid followed by ARD. Figure 12 reports the results for noisy acquisitions. Similar observations to the noise-free case (shown in Figure 9) are found in this figure, except that this time none of the algorithms can recover the coefficients exactly due to the noise.

7. Conclusion

In this paper, we propose a new non-log-concave sparsity prior, referred to as Double Lomax Prior, which corresponds to a three-level hierarchical Bayes model. A VB inference procedure is introduced to solve for the SLM using Double Lomax priors. When noninformative hyperprior is assumed (), the proposed update procedure includes the canonical ARD update procedure as a special case. However, comparing to the canonical ARD update, the proposed update procedure has one more latent variable which has the smoothness effect that can result in an improved performance. A hybrid formulation of Double Lomax and ARD is also considered as a tradeoff between performance and computational speed. Experiments on both correlated and uncorrelated SLM simulations with applications to AR model identification and compressive sensing are carried out to demonstrate the effectiveness of the proposed approach.

Acknowledgments

The authors would like to thank the anonymous reviewers for their constructive suggestions. This work is supported by the National Natural Science Foundation of China (no. 61205017), the China Postdoctoral Science Foundation funded project (no. 2012M511058), and the Shanghai Postdoctoral Sustentation Fund funded project (no. 12R21412500).