Bayesian Inference for Nonnegative Matrix Factorisation Models

Cemgil, Ali Taylan

doi:https://doi.org/10.1155/2009/785152

Computational Intelligence and Neuroscience

On this page

Abstract Introduction Conclusions Appendix Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2009 | Article ID 785152 | https://doi.org/10.1155/2009/785152

Bayesian Inference for Nonnegative Matrix Factorisation Models

Ali Taylan Cemgil¹

Academic Editor: S. Cruces-Alvarez

Received29 Aug 2008

Accepted14 Feb 2009

Published27 May 2009

Abstract

We describe nonnegative matrix factorisation (NMF) with a Kullback-Leibler (KL) error measure in a statistical framework, with a hierarchical generative model consisting of an observation and a prior component. Omitting the prior leads to the standard KL-NMF algorithms as special cases, where maximum likelihood parameter estimation is carried out via the Expectation-Maximisation (EM) algorithm. Starting from this view, we develop full Bayesian inference via variational Bayes or Monte Carlo. Our construction retains conjugacy and enables us to develop more powerful models while retaining attractive features of standard NMF such as monotonic convergence and easy implementation. We illustrate our approach on model order selection and image reconstruction.

1. Introduction

In machine learning, nonnegative matrix factorisation (NMF) was introduced by Lee and Seung [1] as an alternative to k-means clustering and principal component analysis (PCA) for data analysis and compression (also see [2]). In NMF, given a nonnegative matrix , where , , we seek positive matrices and such thatwhere . We will refer to the matrix as the template matrix, and matrix the excitation matrix. The key property of NMF is that and are constrained to be positive matrices. This is in contrast with PCA, where there are no positivity constraints or k-means clustering where each column of is constrained to be a unit vector. Subject to the positivity constraints, we seek a solution to the following minimisation problem:Here, the function is a suitably chosen error function. One particular choice for , on which we will focus here, is the information (Kullback-Leibler) divergence, which we write asUsing Jensen's inequality [3] and concavity of , it can be shown that is nonnegative and if and only if . The objective in (2) could be minimised by any suitable optimisation algorithm. Lee and Seung [1] have proposed a very efficient variational bound minimisation algorithm that has attractive convergence properties and which has been successfully applied in various applications in signal analysis and source separation, for example, [4–6].

The interpretation of NMF, like singular value decomposition (SVD), as a low rank matrix approximation is sufficient for the derivation of a useful inference algorithm; yet this view arguably does not provide the complete picture about assumptions underlying the statistical properties of . Therefore, we describe NMF from a statistical perspective as a hierarchical model. In our framework, the original nonnegative multiplicative update equations of NMF appear as an expectation-maximisation (EM) algorithm for maximum likelihood estimation of a conditionally Poisson model via data augmentation. Starting from this view, we develop Bayesian extensions that facilitate more powerful modelling and allow more sophisticated inference, such as Bayesian model selection. Inference in the resulting models can be carried out easily using variational (structured mean field) or Markov Chain Monte Carlo (Gibbs sampler). The resulting algorithms outperform existing NMF strategies and open up the way for a full Bayesian treatment for model selection via computation of the marginal likelihoods (the evidence), such as estimating the dimensions of the template matrix or regularising overcomplete representations via automatic relevance determination.

2. The Statistical Perspective

The interpretation of NMF as a low-rank matrix approximation is sufficient for the derivation of an inference algorithm; yet this view arguably does not provide the complete picture. In this section, we describe NMF from a statistical perspective. This view will pave the way for developing extensions that facilitate more realistic and flexible modelling as well as more sophisticated inference, such as Bayesian model selection.

Our first step is the derivation of the information divergence error measure from a maximum likelihood principle. We consider the following hierarchical model: Here, denotes the Poisson distribution of the random variable with nonnegative intensity parameter , whereand is the gamma function. The priors and will be specified later. We call the variables latent sources. We can analytically marginalise out the latent sources to obtain the marginal likelihoodThis result follows from the well-known superposition property of Poisson random variables [7], namely, when and then the marginal probability is given by . The maximisation of this objective in and is equivalent to the minimisation of the information divergence in (3). In the derivation of original NMF in [8], this objective is stated first; the variables are introduced implicitly later during the optimisation on and . In the sequel, we show that this algorithm is actually equivalent to EM, ignoring the priors and .

2.1. Maximum Likelihood and the EM Algorithm

The log-likelihood of the observed data can be written aswhere is an instrumental distribution, that is arbitrary provided that the sum on the right exists; can only vanish at a particular only when does so. Note that this defines a lower bound to the log-likelihood. It can be shown via functional derivatives and imposing the normalisation condition via Lagrange multipliers that the lower bound is tight for the exact posterior of the latent sources, that is,Hence the log-likelihood can be maximised iteratively as follows:

Here, , the expectation of some function with respect to . In the E step, we compute the posterior distribution of . This defines a lower bound on the likelihoodFor many models in the exponential family, which includes (5), the expectation on the right depends on the sufficient statistics of and is readily available; in fact calculating should be literally taken as calculating the sufficient statistics of . The lower bound is readily obtained as a function of these sufficient statistics, and maximisation in the M Step yields a fixed point equation.

2.1.1. The E Step

To derive the posterior of the latent sources, we observe thatFor the model in (5), we haveIt follows from (5), (12), (13), and (7)where are the cell probabilities. Here, denotes a multinomial distribution defined bywhere , , and . Here, , are the cell probabilities, and is the index parameter where . The Kronecker delta function is defined by when , and otherwise. It is a standard result that the marginal mean isthat is, the expected value of each source is a fraction of the observation, where the fraction is given by the corresponding cell probability.

2.1.2. The M Step

It is indeed a good news that the posterior has an analytic form. Since now the M step can be calculated easily as follows:Fortunately, for maximisation with respect to and , the last two difficult terms are merely constant, and we need only to maximise the simpler objectivewhere we only need the expected value of the sources given by the previous values of the templates and excitations:Maximisation of the objective and substituting give the following fixed point equations: Equation (20) is identical to the multiplicative update rules of [8]. However, our derivation via data augmentation obtains the same result as an EM algorithm. It is interesting to note that in literature, NMF is often described as EM-like; here, we show that it is actually just an EM algorithm. We see that the efficiency of NMF is due to the fact that the object needs not to be explicitly calculated as we only need its marginal statistics (sums across or ).

We note that our model is valid when is integer valued. See [9] for a detailed discussion about consequences of this issue. Here, we assume that for nonnegative real valued , we only consider the integer part, that is, we let , where is a noise matrix with entries uniformly drawn in . In practice, this is not an obstacle when the entries of are large.

The interpretation of NMF as a maximum likelihood method in a Poisson model is mentioned in the original NMF paper [1] and discussed in more detail by [5, 10]. The equivalence of NMF and probabilistic latent sematic analysis is shown in [11]. Kameoka in [5] focuses on the optimisation and gives an equivalent description using auxiliary function maximisation. In contrast, the auxiliary variables can be viewed as model variables (the sources ) that are analytically integrated out [10]. A general framework is described in [12]. Prior structures are placed on conditionally Gaussian NMF models to enforce sparsity in [13]. However, all of these approaches are based on regularisation, that is, aim at calculating a maximum a posteriori estimate. In contrast, we provided in this article a full Bayesian treatment where the templates and excitations are integrated out.

2.2. Hierarchical Prior Structure

Given the probabilistic interpretation, it is possible to propose various hierarchical prior structures to fit the requirements of an application. Here we will describe a simple choice where we have a conjugate prior as follows:Here, denotes the density of a gamma random variable with shape and scale defined byThe primary motivation for choosing a Gamma distribution is computational convenience: Gamma distribution is the conjugate prior to Poisson intensity. The indexing highlights the most general case where there are individual parameters for each element and . Typically, we do not allow many free hyperparameters but tie them depending upon the requirements of an application. See Figure 1 for an example. As an example, consider a model where we tie the hyperparameters such as , , , and for , , and . This model is simple to interpret, where each component of the templates and the excitations is drawn independently from the Gamma family shown in Figure 2. Qualitatively, the shape parameter controls the sparsity of the representation. Remember that has the mean and standard deviation . Hence, for large , all coefficients will have more or less the same magnitude , and typical representations will be full. In contrast, for small , most of the coefficients will be very close to zero, and only very few will be dominating, hence favouring a sparse representation. The scale parameter is adapted to give the expected magnitude of each component.

(a)

(b)

Figure 1

(a) A schematic description of the NMF model with data augmentation. (b) Graphical model with hyperparameters. Each source element

s_{ν, i, τ}

is Poisson distributed with intensity

t_{ν, i} v_{i, τ}

. The observations are given by

x_{ν, τ} = \sum_{i} s_{ν, i, τ}

. In matrix notation, we write

X = \sum S_{i}

. We can analytically integrate out over

S

. Due to superposition property of Poisson distribution, intensities add up, and we obtain

〈 X 〉 = T V

. Given

X

, the NMF algorithm is shown to seek the maximum likelihood estimates of the templates

T

and excitations

V

. In our Bayesian treatment, we further assume that elements of

T

and

V

are Gamma distributed with hyperparameters

Θ

.

To model missing data, that is, when some of the are not observed, we define a mask matrix , the same size as where , if is missing and otherwise (see Appendix A.4 for details). Using the mask variables, the observation model with missing data can be written as

The hierarchical model in (21) is more powerful than the basic model of (5), in that it allows a lot of freedom for more realistic modelling. First of all, the hyperparameters can be estimated from examples of a certain class of source to capture the invariant features. Another possibility is Bayesian model selection, where we can compare alternative models in terms of their marginal likelihood. This enables one to estimate the model order, for example, the optimum number of templates to represent a source.

3. Full Bayesian Inference

Below, we describe various interesting problems that can be cast to Bayesian inference problems. In signal analysis and feature extraction with NMF, we may wish to calculate the posterior distribution of templates and excitations, given data and hyperparameters . Another important quantity is the marginal likelihood (also known as the evidence), whereThe marginal likelihood can be used to estimate the hyperparameters, given examples of a source classor to compare two given models via Bayes factors

This latter quantity is particularly useful for comparing different classes of models. Unfortunately, the integrations required cannot be computed in closed form. In the sequel, we will describe the Gibbs sampler and variational Bayes as approximate inference strategies.

3.1. Variational Bayes

We sketch here the Variational Bayes (VB) [3, 14] method to bound the marginal log-likelihood aswhere is an instrumental distribution, and is its entropy. The bound is tight for the exact posterior but as this distribution is complex, we assume a factorised form for the instrumental distribution by ignoring some of the couplings present in the exact posterior as follows:where denotes set of disjoint clusters. Hence, we are no longer guaranteed to attain the exact marginal likelihood . Yet, the bound property is preserved, and the strategy of VB is to optimise the bound. Although the best distribution respecting the factorisation is not available in closed form, it turns out that a local optimum can be attained by the following fixed point iteration:where . This iteration monotonically improves the individual factors of the distribution, that is, for given an initialisation . The order is not important for convergence; one could visit blocks in arbitrary order. However, in general, the attained fixed point depends upon the order of the updates as well as the starting point . We choose the following update order in our derivations:

3.2. Variational Update Equations and Sufficient Statistics

The expectations of are functions of the sufficient statistics of (see the expression in the Appendix A.2). The update equation for the latent sources (30) leads to the following: These equations are analogous to the multinomial posterior of EM given in (14); only the computation of cell probabilities is different. The excitation and template distributions and their sufficient statistics follow from the properties of the gamma distribution:

3.3. Efficient Implementation

One of the attractive features of NMF is easy and efficient implementation. In this section, we derive that the update equations of Section 3.2 in compact matrix notation are to illustrate that these attractive properties are retained for the full Bayesian treatment. A subtle but key point in the efficiency of the algorithm is that we can avoid explicitly storing and computing the object , as we only need the marginal statistics during optimisation. Consider (33). We can writeHere, the denominator has to be nonzero. In the last line, we have represented the expression in compact notation where we define the following matrices:

The matrices subscripted with are in and with are in . For notational convenience, we define and as elementwise matrix multiplication and division, respectively, and as a vector of ones. After straightforward substitutions, we obtain the variational nonnegative matrix factorisation algorithm, that can compactly be expressed as in panel Algorithm 1.

(1) Initialise:
$L_{t}^{(0)} = E_{t}^{(0)} ~ 𝒢 (\cdot; A_{t}, B_{t} {. / A}_{t}) L_{v}^{(0)} = E_{v}^{(0)} ~ 𝒢 (\cdot; A_{v}, B_{v} {. / A}_{v})$
(2) for $n = 1 \dots$ MAXITER do
(3) Source sufficient statistics
$Σ_{t}^{(n)} : = L_{t}^{(n - 1)} . * (((X . * M) . / (L_{t}^{(n - 1)} L_{v}^{(n - 1)})) {L_{v}^{(n - 1)}}^{⊤})$
$Σ_{v}^{(n)} : = L_{v}^{(n - 1)} . * ({L_{t}^{(n - 1)}}^{⊤} ((X . * M) . / (L_{t}^{(n - 1)} L_{v}^{(n - 1)})))$
(4) Means
$E_{t}^{(n)} : = α_{t}^{(n)} . * β_{t}^{(n)} α_{t}^{(n)} = A_{t} + Σ_{t}^{(n)} β_{t}^{(n)} = 1. / (A_{t} . / B_{t} + M {E_{v}^{(n - 1)}}^{⊤})$
$E_{v}^{(n)} : = α_{v}^{(n)} . * β_{v}^{(n)} α_{v}^{(n)} = A_{v} + Σ_{v}^{(n)} β_{v}^{(n)} = 1. / (A_{v} . / B_{v} + {E_{t}^{(n)}}^{⊤} M)$
(5) Optional: Compute Bound (See Appendix, (A.13)
(6) Means of Logs
$L_{t}^{(n)} = exp (Ψ (α_{t}^{(n)})) . * β_{t}^{(n)} L_{v}^{(n)} = exp (Ψ (α_{v}^{(n)})) . * β_{v}^{(n)}$
(7) Optional: Update Hyperparameters (See Appendix, Section )
(8) end for

Similarly, an iterative conditional modes (ICM) algorithm can be derived to compute the maximum a posteriori (MAP) solution (see Appendix A.4):Note that when the shape parameters go to zero, that is, , we obtain the maximum likelihood NMF algorithm.

3.4. Markov Chain Monte Carlo, the Gibbs Sampler

Monte Carlo methods [15, 16] are powerful computational techniques to estimate expectations of formwhere are independent samples drawn from . Under mild conditions on , the estimate converges to the true expectation for . The difficulty here is obtaining independent samples from complicated distributions.

The Markov Chain Monte Carlo (MCMC) techniques generate subsequent samples from a Markov chain defined by a transition kernel, that is, one generates conditioned on as follows:Note that the transition kernel is not needed explicitly in practice; all is needed is a procedure to sample a new configuration, given the previous one. Perhaps surprisingly, even though subsequent samples are correlated, provided that satisfies certain ergodicity conditions, (39) remains still valid, and estimated expectations converge to their true values when number of samples goes to infinity [15]. To design a transition kernel such that the desired distribution is the stationary distribution, that is, , many alternative strategies can be employed [16]. One particularly convenient and simple procedure is the Gibbs sampler where one samples each block of variables from full conditional distributions. For the NMF model, a possible Gibbs sampler is Note that this procedure implicitly defines a transition kernel . It can be shown [15] that the stationary distribution of is the exact posterior . Eventually, the Gibbs sampler converges regardless of the order that the blocks are visited, provided that each block is visited infinitely often in the limit . However, the rate of convergence is very difficult to assess as it depends upon the order of the updates as well as the starting configuration . It is instructive to contrast above (41) with the variational update of (30)–(32): algorithmically the two approaches are quite similar. The pseudo-code is given in Algorithm 2.

(1) Initialize:
$T^{(0)} = ~ 𝒢 (\cdot; A_{t}, B_{t}) V^{(0)} ~ 𝒢 (\cdot; A_{v}, B_{v})$
(2) for $n = 1 \dots$ MAXITER do
(3) Sample Sources
(4) for $τ = 1 \dots K$ , $ν = 1 \dots W$ do
(5) $p_{ν,1 : I, τ}^{(n)} = T^{(n - 1)} (ν,1 : I) . * V^{(n - 1)} {(1 : I, τ)}^{⊤} . / (T^{(n - 1)} (ν,1 : I) V^{(n - 1)} (1 : I, τ))$
(6) $S^{(n)} (ν,1 : I, τ) ~ ℳ (s_{ν,1 : I, τ}; x_{ν, τ}, p_{ν,1 : I, τ}^{(n)})$
(7) end for
$Σ_{t}^{(n)} = \sum_{τ} S_{ν, i, τ}^{(n)} Σ_{v}^{(n)} = \sum_{ν} S_{ν, i, τ}^{(n)}$
(8) Sample Templates
$\begin{matrix} α_{t}^{(n)} = A_{t} + Σ_{t}^{(n)} β_{t}^{(n)} = 1. / (A_{t} . / B_{t} + 1_{W} {(V^{(n - 1)} 1_{K})}^{⊤}) \end{matrix}$
$T^{(n)} ~ 𝒢 (T; α_{t}^{(n)}, β_{t}^{(n)})$
(9) Sample Excitations
$\begin{matrix} α_{v}^{(n)} = A_{v} + Σ_{v}^{(n)} β_{v}^{(n)} = 1. / (A_{v} . / B_{v} + {(1_{W}^{⊤} T^{(n - 1)})}^{⊤} 1_{K}^{⊤}) \end{matrix}$
$\begin{matrix} V^{(n)} ~ 𝒢 (V; α_{v}^{(n)}, β_{v}^{(n)}) \end{matrix}$
(10) end for

3.4.1. Marginal Likelihood Estimation with Chib's Method

The marginal likelihood can be estimated from the samples generated by the Gibbs sampler using a method proposed by Chib [17]. Suppose we have run the block Gibbs sampler until convergence and have samples as follows:The marginal likelihood is (omitting hyperparameters )This equation holds for all points . We choose a point in the configuration space; provided that the distribution is unimodal, a good candidate is a configuration near the mode . The numerator in (43) is easy to evaluate. The denominator isThe first term is full conditional, so it is available for the Gibbs sampler. The third term isThe second term is trickierThe first term here is full conditional. However, the original Gibbs run gives us only samples from , not . The idea is to run the Gibbs sampler for further iterations where we sample from , that is, with clamped at . The resulting estimate is Chib's method estimates the marginal likelihood as follows:

4. Simulations

Our goal is to illustrate our approach in a model selection context. We first illustrate that the variational approximation to the marginal likelihood is close to the one obtained from the Gibbs sampler via Chib's method. Then, we compare the quality of solutions we obtain via Variational NMF and compare them to the original NMF on a prediction task. Finally, we focus on reconstruction quality in the overcomplete case where the standard NMF is subject to overfitting.

Model Order Determination
To test our approach, we generate synthetic data from the hierarchical model in (21) with , , and the number of sources . The inference task is to find the correct number of sources, given . The hyperparameters of the true model are set to , , , and . In the first experiment, the hyperparameters are assumed to be known and in the second are jointly estimated from data, using hyperparameter adaptation. We evaluate the marginal likelihood for models with the number of templates , with the Gibbs sampler using Chib's method and variational lower bound via variational Bayes. We run the Gibbs sampler for steps following a burn-in period of steps; then we clamp the sources and continue the simulation for a further steps to estimate quantities required by Chib's method. We run the variational algorithm until convergence of the bound or iterations, whichever occurs first. In Figure 3(a), we show a comparison of the variational estimate with the average of independent runs obtained via Chib's method. We observe, that both methods give consistent results. In Figure 4, we show the lower bound as a function of model order , where for each , the bound is optimised independently by jointly optimising hyperparameters , and using the equations derived in the appendix. We observe, that the correct model order can be inferred even when the hyperparameters are unknown a priori. This is potentially useful for estimation of model order from real data.

(a)

(b)

(c)

Figure 3

Model selection results. (a) Comparison of model selection by variational bound (squares) and marginal likelihood estimated by Chib's (circles) method. The hyperparameters are assumed to be known. (b) Box-plot of marginal likelihood estimated by Chib's method using

5000,

10 000

, and

10 000

iterations for burn-in, free, and clamped sampling. The boxes show the lower quartile, median, and upper quartile values. (c) Model selection by variational bound when hyperparameters are unknown and jointly estimated.

(a)

(b)

As real data, we use a version of the Olivetti face image database ( images of pixels available at http://www.cs.toronto.edu/~roweis/data/olivettifaces.mat). We further downsampled to or pixels, hence our data matrix is or . We use a model with tied hyperparameters as , , , and , where all hyperparameters are jointly estimated. In Figure 4, bottom, we show results of model order determination for this dataset with joint hyperparameter adaptation. Here, we run the variational algorithm for each model order independently and evaluate the lower bound after optimising the hyperparameters. The Gibbs sampler is not found practical and is omitted here. The lower bound behaves as is expected from marginal likelihood, reflecting the tradeoff between too many and too few templates. Higher resolution implies more templates, consistent with our intuition that detail requires more templates for accurate representation.

We also investigate the nature of the representations (see Figure 5). Here, for each independent run, we fix the values of shape parameters to and only estimate and . This corresponds to enforcing sparse or nonsparse and . Each column shows templates estimated from the dataset conditioned on hyperparameters. The middle image is the same template image above weighted with the excitations corresponding to the reconstruction (the expected value of the predictive distribution) below. Here, we clearly see the effect of the hyperparameters. In the first condition , the prior does not enforces sparsity to the templates and excitations. Hence, for the representation of a given image, there are many active templates. In the second condition, we try to force both matrices to be sparse with . Here, the result is not satisfactory as isolated components of the templates are zeroed, giving a representation that looks like one contaminated by “salt-and-pepper” noise. The third condition () forces only the excitations to be sparse. Here, we observe that the templates correspond to some average face images. Qualitatively, each image is reconstructed using a superposition of a few of these templates. In the final representation, we enforce sparsity in the templates but not in the excitations. Here, our estimate finds templates that correspond to parts of individual face images (eyebrows, lips, etc.). This solution, intuitively corresponding to a parsimonious representation, also is the best in terms of the marginal likelihood. With proper initialisation, our variational procedure is able to find such solutions.

Prediction
We now compare variational Bayesian NMF with the maximum likelihood NMF on a missing data prediction task.

To illustrate the self regularisation effect, we set up an experiment in which we select a subset of the face data consisting of images. From half of the images, we remove the same patch (Figure 6) and predict the missing pixels. This is a rather small dataset for this task, as we have only images for each of the different persons, and half of these images have missing data at the same spot. We measure the quality of the prediction in terms of signal-to-noise ratio (SNR). The missing values are reconstructed using the mean of the predictive distribution where and are point estimates of the template and excitation matrix. We compare our variational algorithm with the classical NMF. For each algorithm, we test two different versions. The variational algorithms differ in how we estimate and . In the first variational algorithm, we use a crude estimate of and as the mean of the approximating distribution. In the second condition, after convergence of hyperparameters via VB, we reinitialise and randomly and switch to an ICM algorithm (see (38). This strategy finds a local mode of the exact posterior distribution. In NMF, we test two initialisation strategies: in the first condition, we initialise the templates randomly; in the second, we set them equal to the images in the dataset with random perturbations.

(a)

(b)

(c)

In Figure 6, we show the reconstruction results for a typical run, for a model with templates. Note that this is an overcomplete model, with twice as many templates as there are images. To characterise the nature of the estimated template and excitation matrices, we use the sparseness criteria [18] of an matrix , defined as . This measure is when the matrix has only a single nonzero entry and when all entries are equal. We see that the variational algorithms are superior in this case in terms of SNR as well as the visual quality of the reconstruction. This is perhaps not surprising, since with maximum likelihood estimation; if the model order is not carefully chosen, generalisation performance is poor: the “noise” in the observed data is fitted but the prediction quality drops on new data. An interesting observation is that highly sparse solutions (either in templates or excitations) do not give the best result; the solution that balances both seems to be the best in this setting. This example illustrates that sparseness in itself may not be necessarily a good criteria to optimise; for model selection, the marginal likelihood should be used as the natural quantity.

On the same face dataset, we compare the prediction error in terms of the SNR for varying model order . Our goal is to compare the prediction performance of the full Bayesian approach with the ML-NMF for a range of conditions (under-complete, complete, and overcomplete). The results shown in Figure 7 are averages of several runs with hyperparameter adaptation and different hyperparameter tying. In the simulations, the shape parameters are tied always as (and ). The scale parameters are untied or tied as (across sources) or (different for each source) and jointly optimised. Regardless of the hyperparameter tying structure, the results were quite similar. The best SNR values are attained with untied scale parameters for both excitations and templates.

We observe that, due to the implicit self-regularisation in the Bayesian approach, the prediction performance is not very sensitive to the model order and is immune to overfitting. In contrast, the ML-NMF with random initialisation is prone to overfitting, and prediction performance drops with increasing model order. Interestingly, when we initialise the ML-NMF algorithm to true data points with small perturbations, the prediction performance in terms of SNR improves. Note that this strategy would not be possible for data where the pixels were truly missing. However, visual inspection shows that the interpolation can still be “patchy” (see Figure 6).

We observe that hyperparameter adaptation is crucial for obtaining good prediction performance. In our simulations, results for VB without hyperparameter adaptation were occasionally poorer than the ML estimates. Good initialisation of the shape hyperparameters seems to be also important. We obtain best results when initialising the shape hyperparameters asymmetrically, for example, and (see 3rd and 4th panels from left in Figure 5). When the shape hyper-parameters are initialised to small , the EM seems to get stuck in a local minima more often. Consequently, the prediction results are poorer. We have also carried out tests with more undercomplete representations when the model order is low . For these simulations, while the marginal likelihood was in favour of the VB solutions, we have not observed statistically significant differences between VB and ML in terms of SNR. The SNR improvement of VB over ML was on average about dB only.

5. Discussion and Conclusions

In this paper, we have investigated KL-NMF from a statistical perspective. We have shown that KL minimisation formulation the original algorithm can be derived from a probabilistic model where the observations are superposition of independent Poisson-distributed latent sources. Here, the template and excitation matrices turn out to be latent intensity parameters. The interpretation of NMF as a maximum likelihood method in a Poisson model is mentioned in the original NMF paper [1] and discussed in more detail by [5, 10], and [5] focuses on the optimisation and gives an equivalent description using auxiliary function maximisation. In contrast, [10] illustrates that the auxiliary variables can be viewed as model variables (the sources ) that are analytically integrated out. The relationship between KL divergence and the Poisson distribution is not just a lucky coincidence. There exists a duality between divergence functions and exponential family distributions. If a cost function is a Bregman divergence, there exists a regular exponential family where minimising the cost corresponds to maximum likelihood parameter estimation [19]; also see [12] it in the context of matrix factorisation models.

The novel observation in the current article is the exact characterisation of the approximating distribution or full conditionals as a product of multinomial distributions, leading to a richer approximation distribution than a naive mean field or single site Gibbs (which would freeze due to deterministic ). This conjugate form leads to significant simplifications in full Bayesian integration. Apart from the conditionally Gaussian case, NMF with KL objective seems to be unique in this respect. For several other distance metrics , we find that full Bayesian inference not as practical as is not standard.

We have also shown that the standard KL-NMF algorithm with multiplicative update rules is in fact an EM algorithm with data augmentation. Extending upon this observation, we have developed an hierarchical model with conjugate Gamma priors. We have developed a variational Bayes algorithm and a Gibbs sampler for inference in this hierarchical model. We have also developed methods for estimating the marginal likelihood for model selection. This is an additional feature that is lacking in existing NMF approaches with regularisation, where only MAP estimates are obtained, such as [13, 18, 20].

Our simulations suggest that the variational bound seems to be a reasonable approximation to the marginal likelihood and can guide model selection for NMF. The computational requirements are comparable to the ML-NMF. A potentially time-consuming step in the implementation of the variational algorithm is the evaluation of the function but this step can also be replaced by a simple piecewise polynomial approximation since for .

We first compare the variational inference with a Gibbs sampler. In our simulations, we observe that both algorithms give qualitatively very similar results, both for inference of templates and excitations as well as model order selection. We find the variational approach somewhat more practical as it can be expressed as simple matrix operations, where both the fixed point equations as well as the bound can be compactly and efficiently implemented using matrix computation software. In contrast, our Gibbs sampler is computationally more demanding, and the calculation of marginal likelihood is somewhat more tricky. With our implementation of both algorithms, the variational method is faster by a factor of around . Reference implementations of both algorithms in Matlab are available from the following url: http://www.cmpe.boun.edu.tr/~cemgil/bnmf/.

In terms of computational requirements, the variational procedure has several advantages. First, we circumvent sampling from multinomial variables, which is the main computational bottleneck with the Gibbs sampler. Whilst efficient algorithms are developed for multinomial sampling [21], the procedure is time consuming when the number of latent sources is large. In contrast, the variational method estimates the expected sufficient statistics directly by elementary matrix operations. Another advantage is hyperparameter estimation. In principle, it is possible to maximise the marginal likelihood via a Monte Carlo EM procedure [22, 23]; yet this potentially requires many more iterations. In contrast, the evaluation of the derivatives of the lower bound is straightforward and can be implemented without much additional computational cost.

The efficiency of the Gibbs sampler could be improved by working out the distribution of the sufficient statistics of sources directly (namely, quantities or ) to circumvent multinomial sampling. Unfortunately, for the sum of binomial random variables with different cell probability parameters, the sum does not have a simple form but various approximations are possible [24].

Inference based on VB is easy to implement but at the end of the day, the fixed point iteration is just a gradient-based lower bound optimisation procedure, and second order Newton methods can provide more efficient alternatives. For NMF models, there exist many conditional independence relations, hence the Hessian matrix has a special block structure [12]. It is certainly interesting to develop efficient inference methods that make use of the special block structure of the Hessian matrix. However, as our primary goal was a practical full Bayesian treatment, we have not investigated this path yet. Another approach in this direction is using alternative deterministic integration techniques such as expectation propagation (EP) [25]. Those techniques work directly on an approximation of the true marginal likelihood rather than a bound. A related approach known as expectation consistent (EC) inference is used with success in related source separation problems [26].

From a modelling perspective, our hierarchical model provides some attractive properties. It is easy to incorporate prior knowledge about individual latent sources via hyperparameters, and one can easily capture variability in the templates and excitations that is potentially useful for developing robust techniques. The prior structure here is qualitatively similar to an entropic prior [20, 27], and we find qualitatively similar representations to the ones found by NMF reported earlier by [1, 18]. However, none of the above mentioned methods provide an estimate of the marginal likelihood, which is useful for model selection. Our generative model formulation can be extended in various ways to suit the specific needs of particular applications. For example, one can enforce more structured prior models such as chains or fields [10]. As a second possibility, the Poisson observation model can be replaced with other models such as clipped Gaussian, Gamma, or Gaussians which lead to alternative source separation algorithms. For example, the case of Gaussian sources where the excitations and templates correspond to the variances is discussed in [28].

Our main contribution here is the development of a principled and practical way to estimate both the optimal sparsity criteria and model order, in terms of marginal likelihood. By maximising the bound on marginal likelihood, we have a method where all the hyperparameters can be estimated from data, and the appropriate sparseness criteria is found automatically. We believe that our approach provides a practical improvement to the highly popular KL-NMF algorithm without incurring much additional computational cost.

Appendix

A. Standard Distributions in Exponential Form, Their Sufficient Statistics and Entropies

(i)Gamma Here, denotes the digamma function defined as .(ii)Poisson (iii)Multinomial Here, , , and . Here, , are the cell probabilities, and is the index parameter where . The entropy is given as follows: A closed form expression for the entropy is not known due to terms but asymptotic expansions exist [29, 30]. Computationally efficient sampling from a multinomial distribution is not trivial; see [21] for a comparison of various methods and detailed discussion of tradeoffs.

A.1. Summary of the Generative Model

We have the following. Indices: , source index;, Row (frequency bin) index;, Column (time frame) index;: template variable at th row of the th source: excitation variable of the th source at th column: source variable of th source at th row (frequency bin) and th column (time frame): observation at th row (frequency bin) and th column (time frame)

A.2. Expression of the Full Joint Distribution

Here, ,

A.3. The Variational Bound

The variational bound in (27) can be written aswhere the energy term is given by the expectation of the expression in Appendix A.2, and denotes the entropy of the variational approximation distribution where the individual entropies are defined in Appendix A:One potential problem is that this expression requires the entropy of a multinomial distribution for which there is no known simple expression. This is due to terms of form where only asymptotic expansions are known. Fortunately, the difficult terms in the energy term can be canceled by the corresponding terms in the entropy term, and one obtains the following expression that only depends on known sufficient statistics:After some careful manipulations, the following expression is obtained where denotes here elementwise logarithm of matrix :

A.4. Handling Missing Data and Map Estimation

When there is missing data, that is, when some of the are not observed, computation is still straightforward in our framework and can be accomplished by a simple modification to the original algorithm. We first define a mask matrix , same size as , whereUsing the mask variables, the observation model with missing data can be written as follows:The prior is not affected. Hence, we merely replace the first two lines of the expression for the full joint distribution (given in the Appendix A.2) as follows:Consequently, it is easy to see thatBy a derivation analogous to one detailed in Section 3.3, we see that the excitation update equations in Algorithm 1, line 4, can be written using matrix notation as follows:The update rules for the templates are similar. Note that when there is no missing data, we have which gives the original algorithm. The bound in (A.13) can also be easily modified for handling missing data. We merely replace and the first term .

We conclude this subsection by noting that the standard NMF update equations, given in (20), can be also rewritten to handle missing data as follows:Here, the denominator has to be nonzero. Similarly, an iterative conditional modes (ICM) algorithm can be derived to compute the maximum a posteriori (MAP) solution as follows:Note that when the shape parameters go to zero, that is, , we obtain the maximum likelihood NMF algorithm.

A.5. Hyperparameter Optimisation

The hyperparameters can be estimated by maximising the bound in 13. Below, we will derive the results for the excitations; the results for templates are similar. The solution for shape parameters involves finding the zero of a function , whereThe solution can be found by Newton's method by iteration of the following fixed point equation:It is well known that Newton iterations can diverge if started away from the root. Occasionally, we observe that can become negative. If this is the case, we set , and try again. The digamma function and its derivative are available in numeric computation libraries (e.g., in Matlab as and , resp.).

The derivation of the hyperparameter update equations is straightforward: Tying parameters across as and yields Tying parameters across and , , and yields The derivation of the template parameters is exactly analogous. We can express the update equations once again in compact matrix notation as follows: Here, we assume is a matrix-valued function that finds root for each element of , starting from the initial matrix . If is a scalar or vector, it is repeated over the missing index to implement parameter tying. For example, if is a vector and is , we assume for all , and the output is the same size as . This is only a notational convenience; an actual implementation can be achieved more efficiently. Again, the implementation of the template parameters is exactly analogous; merely replace above the subscripts as , and .

Acknowledgments

The author would like to thank Nick Whiteley, Tuomas Virtanen, and Paul Peeling for fruitful discussion and for their comments on earlier drafts of this paper. This research is funded by the Engineering and Physical Sciences Research Council (EPSRC) under the grant EP/D03261X/1 and by the Turkish Science, Technology Research Council grant TUBITAK 107E021 and Boğaziçi University Research Fund BAP 09A105P. The research is carried out while the author was with the Signal Processing and Comms. Lab, Department of Engineering, University of Cambridge, UK and Department of Computer Engineering, Boğaziçi University, Turkey.

References

D. D. Lee and H. S. Seung, “Learning the parts of objects with nonnegative matrix factorization,” Nature, vol. 401, pp. 788–791, 1999.
View at: Publisher Site | Google Scholar
P. Paatero and U. Tapper, “Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values,” Environmetrics, vol. 5, no. 2, pp. 111–126, 1994.
View at: Publisher Site | Google Scholar
C. M. Bishop, Pattern Recognition and Machine Learning, Springer, New York, NY, USA, 2006.
T. Virtanen, Sound source separation in monaural music signals, Ph.D. thesis, Tampere University of Technology, Tampere, Finland, November 2006.
H. Kameoka, Statistical approach to multipitch analysis, Ph.D. thesis, University of Tokyo, Tokyo, Japan, 2007.
A. Cichocki, M. Mørup, P. Smaragdis, W. Wang, and R. Zdunek, “Advances in nonnegative matrix and tensor factorization,” Computational Intelligence and Neuroscience, vol. 2008, Article ID 852187, 3 pages, 2008.
View at: Publisher Site | Google Scholar
J. F. C. Kingman, Poisson Processes, Oxford Science, Oxford, UK, 1993.
D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS '00), vol. 13, pp. 556–562, Denver, Colo, USA, October-November 2000.
View at: Google Scholar
J. Le Roux, H. Kameoka, N. Ono, and S. Sagayama, “On the relation between divergence-based minimization and maximum-likelihood estimation for the i-divergence,” Personal Communication, 2008.
View at: Google Scholar
T. Virtanen, A. T. Cemgil, and S. Godsill, “Bayesian extensions to non-negative matrix factorisation for audio signal modelling,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '08), pp. 1825–1828, Las Vegas, Nev, USA, March-April 2008.
View at: Publisher Site | Google Scholar
E. Gaussier and C. Goutte, “Relation between PLSA and NMF and implication,” in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 601–602, Salvador, Brazil, August 2005.
View at: Google Scholar
A. P. Singh and G. J. Gordon, “A unified view of matrix factorization models,” in Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD '08), vol. 5212 of Lecture Notes in Computer Science, pp. 358–373, Springer, Antwerp, Belgium, September 2008.
View at: Publisher Site | Google Scholar
M. N. Schmidt and H. Laurberg, “Nonnegative matrix factorization with Gaussian process priors,” Computational Intelligence and Neuroscience, vol. 2008, Article ID 361705, 10 pages, 2008.
View at: Publisher Site | Google Scholar
Z. Ghahramani and M. Beal, “Propagation algorithms for variational Bayesian learning,” in Advances in Neural Information Processing Systems, vol. 13, pp. 507–513, MIT Press, Cambridge, Mass, USA, 2000.
View at: Google Scholar
W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, Eds., Markov Chain Monte Carlo in Practice, CRC Press, London, UK, 1996.
J. S. Liu, Monte Carlo Strategies in Scientific Computing, Springer, New York, NY, USA, 2004.
S. Chib, “Marginal likelihood from the gibbs output,” Journal of the Acoustical Society of America, vol. 90, no. 432, pp. 1313–1321, 1995.
View at: Google Scholar
P. O. Hoyer, “Non-negative matrix factorization with sparseness constraints,” The Journal of Machine Learning Research, vol. 5, pp. 1457–1469, 2004.
View at: Google Scholar
A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering with Bregman divergences,” Journal of Machine Learning Research, vol. 6, pp. 1705–1749, 2005.
View at: Google Scholar
M. V. Shashanka, B. Raj, and P. Smaragdis, “Sparse overcomplete latent variable decomposition of counts data,” in Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS '07), Vancouver, Canada, December 2007.
View at: Google Scholar
C. S. Davis, “The computer generation of multinomial random variates,” Computational Statistics and Data Analysis, vol. 16, no. 2, pp. 205–217, 1993.
View at: Publisher Site | Google Scholar
M. A. Tanner, Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions, Springer, New York, NY, USA, 3rd edition, 1996.
F. A. Quintana, J. S. Liu, and G. E. del Pino, “Monte Carlo EM with importance reweighting and its applications in random effects models,” Computational Statistics and Data Analysis, vol. 29, no. 4, pp. 429–444, 1999.
View at: Publisher Site | Google Scholar
K. Butler and M. Stephens, “The distribution of a sum of binomial random variables,” Tech. Rep. 467, Stanford University, Stanford, Calif, USA, April 1993, prepared for the Office of Naval Research.
View at: Google Scholar
T. Minka, Expectation propagation for approximate Bayesian inference, Ph.D. thesis, MIT Media Lab, Cambridge, Mass, USA, 2001.
O. Winther and K. B. Petersen, “Bayesian independent component analysis: variational methods and non-negative decompositions,” Digital Signal Processing, vol. 17, no. 5, pp. 858–872, 2007.
View at: Publisher Site | Google Scholar
W. Buntine, “Variational extensions to EM and multinomial PCA,” in Proceedings of the 13th European Conference on Machine Learning (ECML '02), vol. 2430 of Lecture Notes In Computer Science, pp. 23–34, Springer, Helsinki, Finland, August 2002.
View at: Google Scholar
A. T. Cemgil, P. Peeling, O. Dikmen, and S. Godsill, “Prior structures for time-frequency energy distributions,” in Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA '07), pp. 151–154, New Paltz, NY, USA, October 2007.
View at: Publisher Site | Google Scholar
P. Flajolet, “Singularity analysis and asymptotics of Bernoulli sums,” Theoretical Computer Science, vol. 215, no. 1-2, pp. 371–381, 1999.
View at: Publisher Site | Google Scholar
P. Jacquet and W. Szpankowski, “Entropy computations via analytic depoissonization,” IEEE Transactions on Information Theory, vol. 45, no. 4, pp. 1072–1081, 1999.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2009 Ali Taylan Cemgil. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

10559

Downloads

6454

Citations