About this Journal Submit a Manuscript Table of Contents
Computational Intelligence and Neuroscience
Volume 2008 (2008), Article ID 947438, 9 pages
http://dx.doi.org/10.1155/2008/947438
Research Article

Probabilistic Latent Variable Models as Nonnegative Factorizations

1Mars Incorporated, 800 High Street, Hackettstown, New Jersy 07840, USA
2Mitsubishi Electric Research Laboratories, 201 Broadway, Cambridge MA 02139, USA
3Adobe Systems Incorporated, 275 Grove Street, Newton MA 02466, USA

Received 21 December 2007; Accepted 13 February 2008

Academic Editor: Rafal Zdunek

Copyright © 2008 Madhusudana Shashanka et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

This paper presents a family of probabilistic latent variable models that can be used for analysis of nonnegative data. We show that there are strong ties between nonnegative matrix factorization and this family, and provide some straightforward extensions which can help in dealing with shift invariances, higher-order decompositions and sparsity constraints. We argue through these extensions that the use of this approach allows for rapid development of complex statistical models for analyzing nonnegative data.

1. Introduction

Techniques to analyze nonnegative data are required in several applications such as analysis of images, text corpora and audio spectra to name a few. A variety of techniques have been proposed for the analysis of such data, such as nonnegative PCA [1], nonnegative ICA [2], nonnegative matrix factorization (NMF) [3], and so on. The goal of all of these techniques is to explain the given nonnegative data as a guaranteed nonnegative linear combination of a set of nonnegative “bases” that represents realistic “building blocks” for the data. Of these, probably the most developed is non-negative matrix factorization, with much recent research devoted to the topic [46]. All of these approaches view each data vector as a point in an -dimensional space and attempt to identify the bases that best explain the distribution of the data within this space. For the sake of clarity, we will refer to data that represent vectors in any space as point data.

A somewhat related, but separate topic that has garnered much research over the years is the analysis of histograms of multivariate data. Histogram data represent the counts of occurrences of a set of events in a given data set. The aim here is to identify the statistical factors that affect the occurrence of data through the analysis of these counts and appropriate modeling of the distributions underlying them. Such analysis is often required in the analysis of text, behavioral patterns, and so on. A variety of techniques, such as probabilistic latent semantic analysis [7], latent Dirichlet allocation [8], and so on and their derivatives have lately become quite popular. Most, if not all of them, can be related to a class of probabilistic models, known in the behavioral sciences community as latent class models [911], that attempt to explain the observed histograms as having been drawn from a set of latent classes, each with its own distribution. For clarity, we will refer to histograms and collections of histograms as histogram data.

In this paper, we argue that techniques meant for analysis of histogram data can be equally effectively employed for decomposition of nonnegative point data as well, by interpreting the latter as scaled histograms rather than vectors. Specifically, we show that the algorithms used for estimating the parameters of a latent class model are numerically equivalent to the update rules for one form of NMF. We also propose alternate latent variable models for histogram decomposition that are similar to those commonly employed in the analysis of text, to decompose point data and show that these too are identical to the update rules for NMF. We will generically refer to the application of histogram-decomposition techniques to point data as probabilistic decompositions. (This must not be confused with approaches that model the distribution of the set of vectors. In our approach, the vectors themselves are histograms, or, alternately, scaled probability distributions.)

Beyond simple equivalences to NMF, the probabilistic decomposition approach has several advantages, as we explain. Nonnegative PCA/ICA and NMF are primarily intended for matrix-like two-dimensional characterizations of data—the analysis is obtained for matrices that are formed by laying data vectors side-by-side. They do not naturally extend to higher-dimensional tensorial representations, this has been often accomplished by implicit unwrapping the tensors into a matrix. However, the probabilistic decomposition naturally extends from matrices to tensors of arbitrary dimensions.

It is often desired to control the form or structure of the learned bases and their projections. Since the procedure for learning the bases that represent the data is statistical, probabilistic decomposition affords control over the form of the learned bases through the imposition of a priori probabilities, as we will show. Constraints such as sparsity can also be incorporated through these priors.

We also describe extensions to the basic probabilistic decomposition framework that permits shift invariance along one or more of the dimensions (of the data tensor) that can abstract convolutively combined bases from the data.

The rest of the paper is organized as follows. Since, the probabilistic decomposition approach we promote in this paper is most analogous to nonnegative matrix factorization (NMF) among all techniques that analyze nonnegative point data, we begin with a brief discussion of NMF. We present the family of latent variable models in Section 3 that we will employ for probabilistic decompositions. We present tensor generalizations in Section 4.1 and convolutive factorizations in Section 4.2. In Section 4.3, we discuss extensions such as incorporation of sparsity and in Section 4.4, we present aspects of geometric interpretation of these decompositions.

2. Nonnegative Matrix Factorization

Nonnegative matrix factorization was introduced by [3] to find nonnegative parts-based representation of data. Given an matrix where each column corresponds to a data vector, NMF approximates it as a product of nonnegative matrices and , that is, , where is an matrix and is a matrix. The above approximation can be written column by column as , where and are the columns of and respectively. In other words, each data vector is approximated by a linear combination of the columns of , weighted by the entries of . The columns of can be thought of as basis vectors that, when combined with appropriate mixture weights (entries of the columns of ), provide a linear approximation of .

The optimal choice of matrices and are defined by those nonnegative matrices that minimize the reconstruction error between and . Different error functions have been proposed which lead to different update rules (e.g., [3, 12]). Shown below are multiplicative update rules derived by [3] using an error measure similar to the Kullback-Leibler divergence:where represents the value at row and the column of matrix .

3. Latent Variable Models

In its simplest form, NMF expresses an data matrix as the product of non-negative matrices and . The idea is to express the data vectors (columns of ) as a combination of a set of basis components or latent factors (columns of ). Below, we show that a class of probabilistic models employing latent variables, known in the field of social and behavioral sciences as latent class models (e.g., [9, 11, 13]), is equivalent to NMF.

Let us represent the two dimensions of the matrix by and respectively. We can consider the nonnegative entries as having been generated by an underlying probability distribution . Variables and are multinomial random variables, where can take one out of a set of values in a given draw and can take one out of a set of values in a given draw. In other words, one can model , the entry in row and column , as the number of times features and were picked in a set of repeated draws from the distribution . Unlike NMF which tries to characterize the observed data directly, latent class models characterize the underlying distribution . This subtle difference of interpretation preserves all the advantages of NMF, while overcoming some of its limitations by providing a framework that is easy to generalize, extend, and interpret.

There are two ways of modeling and we consider them separately below.

3.1. Symmetric Factorization

Latent class models enable one to attribute the observations as being due to hidden or latent factors. The main characteristic of these models is conditional independence—multivariate data are modeled as belonging to latent classes such that the random variables within a latent class are independent of one another. The model expresses a multivariate distribution such as as a mixture where each component of the mixture is a product of one-dimensional marginal distributions. In the case of two dimensional data such as , the model can be written mathematically asIn (2), is a latent variable that indexes the hidden components and takes values from the set . This equation assumes the principle of local independence, whereby the latent variable renders the observed variables and independent. This model was presented independently as probabilistic latent component analysis (PLCA) by [14]. The aim of the model is to characterize the distribution underlying the data as shown above by learning the parameters so that hidden structure present in the data becomes explicit.

The model can be expressed as a matrix factorization. Representing the parameters , and as entries of matrices , and respectively, where (i) is a matrix such that corresponds to the probability ;(ii) is a matrix such that corresponds to the probability ; and(iii) is a diagonal matrix such that corresponds to the probability ; one can write the model of (2) in matrix form aswhere the entries of matrix correspond to and . Figure 1 illustrates the model schematically.

47438.fig.001
Figure 1: Latent variable model of (2) as matrix factorization.

Parameters can be estimated using EM algorithm. The update equations for the parameters can be written as

Writing the above update equations in matrix form using and from (3), we obtainThe above equations are identical to the NMF update equations of (1) upto a scaling factor in . This is due to the fact that the probabilistic model decomposes which is equivalent to a normalized version of the data Reference [14] presents detailed derivation of the update algorithms and comparison with NMF update equations. This model has been used in analyzing image and audio data among other applications (e.g., [1416]).

3.2. Asymmetric Factorization

The latent class model of (2) considers each dimension symmetrically for factorization. The two dimensional distribution is expressed as a mixture of two-dimensional latent factors where each factor is a product of one-dimensional marginal distributions. Now, consider the following factorization of :where and is a latent variable. This version of the model with asymmetric factorization is popularly known as probabilistic latent semantic analysis (PLSA) in the topic-modeling literature [7].

Without loss of generality, let and . We can write the above model in matrix form as , where is a column vector indicating is a column vector indicating , and is a matrix with the element corresponding to . If takes values, is a matrix. Concatenating all column vectors and as matrices and respectively, one can write the model aswhere is a diagonal matrix whose diagonal element is the sum of the entries of , and . Figure 2 provides a schematic illustration of the model.

47438.fig.002
Figure 2: Latent variable model of (6) as matrix factorization.

Given data matrix , parameters and are estimated by iterations of equations derived using the algorithm:Writing the above equations in matrix form using and from (7), we obtainThe above set of equations is exactly identical to the NMF update equations of (1). See [17, 18] for detailed derivation of the update equations. The equivalence between NMF and PLSA has also been pointed out by [19]. The model has been used for the analysis of audio spectra (e.g., [20]), images (e.g., [17, 21]), and text corpora (e.g., [7]).

4. Model Extensions

The popularity of NMF comes mainly from its empirical success in finding “useful components” from the data. As pointed out by several researchers, NMF has certain important limitations despite the success. We have presented probabilistic models that are numerically closely related to or identical to one of the widely used NMF update algorithms. Despite the numerical equivalence, the methodological difference in approaches is important. In this section, we outline some advantages of using this alternate probabilistic view of NMF.

The first and most straightforward implication of using a probabilistic approach is that it provides a theoretical basis for the technique. And more importantly, the probabilistic underpinning enables one to utilize all the tools and machinery of statistical inference for estimation. This is crucial for extensions and generalizations of the method. Beyond these obvious advantages, below we discuss some specific examples where utilizing this approach is more useful.

4.1. Tensorial Factorization

NMF was introduced to analyze two-dimensional data. However, there are several domains with nonnegative multidimensional data where a multidimensional correlate of NMF could be very useful. This problem has been termed as nonnegative tensor factorization (NTF). Several extensions of NMF have been proposed to handle multi-dimensional data (e.g., [46, 22]). Typically, these methods flatten the tensor into a matrix representation and proceed further with analysis. Conceptually, NTF is a natural generalization of NMF, but the estimation algorithms for learning the parameters, however, do not lend themselves to extensions easily. Several issues contribute to this difficulty. We do not present the reasons here due to lack of space but a detailed discussion can be found in [6].

Now, consider the symmetric factorization case of the latent variable model presented in Section 3.1. This model is naturally suited for generalizations to multiple dimensions. In its general form, the model expresses a -dimensional distribution as a mixture, where each -dimensional component of the mixture is a product of one-dimensional marginal distributions. Mathematically, it can be written aswhere is a -dimensional distribution of the random variable . is the latent variable indexing the mixture components and are one-dimensional marginal distributions. Parameters are estimated by iterations of equations derived using the EM algorithm and they are

In the two-dimensional case, the update equations reduce to (4).

To illustrate the kind of output of this algorithm, consider the following toy example. The input was the 3-dimensional distribution shown in the upper left plot in Figure 3. This distribution can also be seen as a rank 3 positive tensor. It is clearly composed out of two components, each being an isotropic Gaussian with means at and and variances and respectively. The bottom row of plots shows the derived sets of using the estimation procedure we just described. We can see that each of them is composed out of a Gaussian at the expected position and with the expected variance. The approximated using this mode is shown in the top right. Other examples of applications on more complex data and a detailed derivation of the algorithm can be found in [14, 23].

fig3
Figure 3: An example of a higher dimensional positive data decomposition. An isosurface of the original input is shown at the top left, the approximation by the model in (10) is shown in the top right, and the extracted marginals (or factors) are shown in the lower plots.
4.2. Convolutive Decompositions

Given a two-dimensional dataset, NMF finds hidden structure along one dimension (columnwise) that is characteristic to the entire dataset. Consider a scenario where there is localized structure present along both dimensions (rows and columns) that has to be extracted from the data. An example dataset would be an acoustic spectrogram of human speech which has structure along both frequency and time. Traditional NMF is unable to find structure across both dimensions and several extensions have been proposed to handle such datasets (e.g., [24, 25]).

The latent variable model can be extended for such datasets and the parameter estimation still follows a simple EM algorithm based on the principle of maximum likelihood. The model, known as a shift invariant version of PLCA, can be mathematically written as [23]where the kernel distribution where defines a local convex region along the dimensions of . Similar to the simple model of (2), the model expresses as a mixture of latent components. But instead of each component being a simple product of one-dimensional distributions, the components are convolutions between a multidimensional “kernel distribution” and a multidimensional “impulse distribution”. The update equations for the parameters are

Detailed derivation of the algorithm can be found in [14]. The above model is able to deal with tensorial data just as well as matrix data. To illustrate this model, consider the picture in the top left of Figure 4. This particular image is a rank-3 tensor . We wish to discover the underlying components that make up this image. The components are the digits 1, 2, 3 and appear in various spatial locations, thereby necessitating a “shift-invariant” approach. Using the aforementioned algorithm, we obtain the results shown in Figure 4. Other examples of such decompositions on more complex data are shown in [23].

fig4
Figure 4: An example of a higher dimensional shift-invariant positive data decomposition. The original input is shown at the top left, the approximation by the model in (12) is shown in the top middle, and the extracted kernels and impulses are shown in the lower plots.

The example above illustrates shift invariance, but it is conceivable that “components” that form the input might occur with transformations such as rotations and/or scaling in addition to translations (shifts). It is possible to extend this model to incorporate invariance to such transformations. The derivation follows naturally from the approach outlined above, but we omit further discussion here due to space constraints.

4.3. Extensions in the Form of Priors

One of the more apparent limitations of NMF is related to the quality of components that are extracted. Researchers have pointed out that NMF, as introduced by Lee and Seung, does not have an explicit way to control the “sparsity” of the desired components [26]. In fact, the inability to impose sparsity is just a specific example of a more general limitation. NMF does not provide a way to impose known or hypothesized structure about the data during estimation.

To elaborate, let us consider the example of sparsity. Several extensions have been proposed to NMF to incorporate sparsity (e.g., [2628]). The general idea in these methods is to impose a cost function during estimation that incorporates an additional constraint that quantifies the sparsity of the obtained factors. While sparsity is usually specified as the norm of the derived factors [29], the actual constraints used consider an norm, since the norm is not amenable to optimization within a procedure that primarily attempts to minimize the norm of the error between the original data and the approximation given by the estimated factors. In the probabilistic formulation, the relationship of the sparsity constraint to the actual objective function optimized is more direct. We characterize sparsity through the entropy of the derived factors, as originally specified in [30]. A sparse code is defined as a set of basis vectors such that any given data point can be largely explained by only a few bases from the set, such that the required contribution of the rest of the bases to the data point is minimal; that is, the entropy of the mixture weights by which the bases are combined to explain the data point is low. A sparse code can now be obtained by imposing the entropic prior over the mixture weights. For a given distribution , the entropic prior is defined as where is the entropy. Imposition of this prior (with a positive ) on the mixture weights just means that we obtain solutions where mixture weights with low entropy are more likely to occur—a low entropy ensures that few entries of the vector are significant. Sparsity has been imposed in latent variable models by utilizing the entropic prior and has been shown to provide a better characterization of the data [17, 18, 23, 31]. Detailed derivation and estimation algorithms can be found in [17, 18]. Notice that priors can be imposed on any set of parameters during estimation.

Information theoretically, entropy is a measure of information content. One can consider the entropic prior as providing an explicit way to control the amount of “information content” desired on the components. We illustrate this idea using a simple shift-invariance case. Consider an image which is composed out of scattered plus sign characters. Upon analysis of that image, we would expect the kernel distribution to be a “+”, and the impulse distribution to be a set of delta functions placing it appropriately in space. However, using the entropic prior we can distribute the amount of information from the kernel distribution to the impulse distribution or vice-versa. We show the results from this analysis in Figure 5 in terms of three cases - where no entropic prior is used (left panels), where it is used to make the impulse sparse (mid panels), and where it is used to make the kernel sparse (right panels). In the left panels, information about the data is distributed both in the kernel (top) and in the impulse distribution (bottom). In the other two cases, we were able to concentrate all the information either in the kernel or in the impulse distribution by making use of the entropic prior.

fig5
Figure 5: Example of the effect of the entropic prior on a set of kernel and impulse distributions. If no constraint is imposed, the information is evenly distributed among the two distributions (left column), if sparsity is imposed on the impulse distribution, most information lies in the kernel distribution (middle column), and vice verse if we request a sparse kernel distribution (right column).

Other prior distributions that have been used in various contexts include the Dirichlet [8, 32] and log-normal distributions [33] among others. The ability to utilize prior distributions during estimation provides a way to incorporate information known about the problem. More importantly, the probabilistic framework provides proven methods of statistical inference techniques that one can employ for parameter estimation. We point out that these extensions can work with all the generalizations that were presented in the previous sections.

4.4. Geometrical Interpretation

We also want to briefly point out that probabilistic models can sometimes provide insights that are helpful for an intuitive understanding of the workings of the model.

Consider the asymmetric factorization case of the latent variable model as given by (6). Let us refer to the normalized columns of the data matrix (obtained by scaling the entries of every column to sum to 1), , as data distributions. It can be shown that learning the model is equivalent to estimating parameters such that the model for any data distribution best approximates it. Notice that the data distributions , model approximations , and components are all -dimensional vectors that sum to unity, and hence points in a simplex. The model expresses as points within the convex hull formed by the components . Since it is constrained to lie within this convex hull, can model accurately only if the latter also lies within the convex hull. Thus the objective of the model is to estimate as corners of a convex hull such that all the data distributions lie within. This is illustrated in Figure 6 for a toy dataset of 400 three-dimensional data distributions.

47438.fig.006
Figure 6: Illustration of the latent variable model. Panel shows 3-dimensional data distributions as points within the Standard 2-Simplex given by . The model approximates data distributions as points lying within the convex hull formed by the components (basis vectors). Also shown are two data points (marked by and ) and their approximations by the model (resp., shown by and ).

Not all probabilistic formulations provide such a clean geometric interpretation but in certain cases as outlined above, it can lead to interpretations that are intuitively helpful.

5. Discussion and Conclusions

In this paper, we presented a family of latent variable models and shown their utility in the analysis of nonnegative data. We show that the latent variable models decompositions are numerically identical to the NMF algorithm that optimizes a Kullback Leibler metric. Unlike previously reported results [34], the proof of equivalence requires no assumption about the distribution of the data, or indeed any assumption about the data besides nonnegativity. The algorithms presented in this paper primarily compute a probabilistic factorization of non-negative data that optimizes the KL distance between the factored approximation and the actual data. We argue that the use of this approach presents a much more straightforward way to make easily extensible models. (It is not clear that the approach can be extended to similarly derive factorizations that optimize other Bregman divergences such as the metric—this is a topic for further investigation.)

To demonstrate this, we presented extensions that deal with tensorial data, shift invariances, and use priors on the estimation. The purpose of this paper is not to highlight the use of these approaches nor to present them thoroughly, but rather demonstrate a methodology which allows easier experimentation with nonnegative data analysis and opens up possibilities for more stringent and probabilistic modeling than before. A rich variety of real world applications and derivations of these and other models can be found in the references.

Acknowledgment

Madhusudana Shashanka acknowledges the support and helpful feedback received from Michael Giering at Mars, Inc.

References

  1. M. D. Plumbley and E. Oja, “A “nonnegative pca” algorithm for independent component analysis,” IEEE Transactions on Neural Networks, vol. 15, no. 1, pp. 66–76, 2004. View at Publisher · View at Google Scholar · View at PubMed
  2. M. D. Plumbley, “Geometrical methods for non-negative ICA: manifolds, lie groups and toral subalgebras,” Neurocomputing, vol. 67, pp. 161–197, 2005. View at Publisher · View at Google Scholar
  3. D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999. View at Publisher · View at Google Scholar · View at PubMed
  4. M. Heiler and C. Schnörr, “Controlling sparseness in non-negative tensor factorization,” in Proceeding of the 9th European Conference on Computer Vision (ECCV '06), vol. 3951, pp. 56–67, Graz, Austria, May 2006. View at Publisher · View at Google Scholar
  5. A. Cichocki, R. Zdunek, S. Choi, R. Plemmons, and S. Amari, “Nonnegative tensor factorization using alpha and beta divergences,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '07), Honolulu, Hawaii, USA, April 2007.
  6. A. Shashua and T. Hazan, “Non-negative tensor factorization with applications to statistics and computer vision,” in Proceedings of the 22nd International Conference on Machine Learning (ICML '05), pp. 793–800, Bonn, Germany, August 2005.
  7. T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine Learning, vol. 42, no. 1-2, pp. 177–196, 2001. View at Publisher · View at Google Scholar
  8. D. Blei, A. Ng, and M. Jordan, “Latent dirichlet allocation,” The Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.
  9. P. Lazarsfeld and N. Henry, Latent Structure Analysis, Houghton Mifflin, Boston, Mass, USA, 1968.
  10. J. Rost and R. Langeheine, Eds., Applications of Latent Trait and Latent Class Models in the Social Sciences, J. Rost and R. Langeheine, Eds., Waxmann, New York, NY, USA, 1997.
  11. L. A. Goodman, “Exploratory latent structure analysis using both identifiable and unidentifiable models,” Biometrika, vol. 61, no. 2, pp. 215–231, 1974. View at Publisher · View at Google Scholar
  12. D. Lee and H. Seung, “Algorithms for non-negative matrix factorizatio,” in Proceedings of the 14th Annual Conference on Advances in Neural Information Processing Systems (NIPS '01), Vancouver, BC, Canada, December 2001.
  13. B. F. Green, Jr., “Latent structure analysis and its relation to factor analysis,” Journal of the American Statistical Association, vol. 47, no. 257, pp. 71–76, 1952. View at Publisher · View at Google Scholar
  14. P. Smaragdis and B. Raj, “Shift-invariant probabilistic latent component analysis,” to appear in Journal of Machine Learning Research.
  15. P. Smaragdis, B. Raj, and M. Shashanka, “Supervised and semi-supervised separation of sounds from single-channel mixtures,” in Proceedings of the 7th International Conference on Independent Component Analysis and Blind Signal Separation (ICA '07), pp. 414–421, London, UK, September 2007.
  16. P. Smaragdis, B. Raj, and M. Shashanka, “A probabilistic latent variable model for acoustic modeling,” in Proceedings of the Advances in Models for Acoustic Processing Workshop (NIPS '06), Whistler, BC, Canada, December 2006.
  17. M. Shashanka, B. Raj, and P. Smaragdis, “Sparse overcomplete latent variable decomposition of counts data,” in Proceedings of the 21th Annual Conference on Neural Information Processing Systems (NIPS '07), Vancouver, BC, Canada, December 2007.
  18. M. Shashanka, Latent variable framework for modeling and separating single-channel acoustic sources, Ph.D. dissertation, Boston University, Boston, Mass, USA, 2007.
  19. E. Gaussier and C. Goutte, “Relation between plsa and nmf and implications,” in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '05), pp. 601–602, Salvador, Brazil, August 2005. View at Publisher · View at Google Scholar
  20. B. Raj and P. Smaragdis, “Latent variable decomposition of spectrograms for single channel speaker separation,” in Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA '05), pp. 17–20, New Paltz, NY, USA, October 2005. View at Publisher · View at Google Scholar
  21. M. Shashanka, B. Raj, and P. Smaragdis, “Probabilistic latent variable model for sparse decompositions of non-negative data,” to appear in IEEE Transactions on Pattern Analysis and Machine Intelligence.
  22. M. Welling and M. Weber, “Positive tensor factorization,” Pattern Recognition Letters, vol. 22, no. 12, pp. 1255–1261, 2001. View at Publisher · View at Google Scholar
  23. P. Smaragdis, B. Raj, and M. Shashanka, “Sparse and shift-invariant feature extraction from non-negative data,” in Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP '08), Las Vegas, Nev, USA, March-April 2008.
  24. P. Smaragdis, “Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs,” in Proceedings of the 5th International Conference on Independent Component Analysis and Blind Signal Separation (ICA '04), vol. 3195, pp. 494–499, Granada, Spain, September 2004.
  25. P. Smaragdis, “Convolutive speech bases and their application to supervised speech separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 1, pp. 1–12, 2007. View at Publisher · View at Google Scholar
  26. P. O. Hoyer, “Non-negative matrix factorization with sparseness constraints,” The Journal of Machine Learning Research, vol. 5, pp. 1457–1469, 2004.
  27. M. Morup and M. Schmidt, “Sparse non-negative matrix factor 2-d deconvolution,” Technical University of Denmark, Lyngby, Denmark, 2006.
  28. J. Eggert and E. Korner, “Sparse coding and NMF,” in Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN '04), vol. 4, pp. 2529–2533, Budapest, Hungary, July 2004. View at Publisher · View at Google Scholar
  29. D. Donoho, “For most large undetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution,” Communications on Pure and Applied Mathematics, vol. 59, no. 7, pp. 903–934, 2006.
  30. B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, no. 6583, pp. 607–609, 1996. View at Publisher · View at Google Scholar · View at PubMed
  31. M. V. S. Shashanka, B. Raj, and P. Smaragdis, “Sparse overcomplete decomposition for single channel speaker separation,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '07), vol. 2, pp. 641–644, Honolulu, Hawaii, USA, April 2007. View at Publisher · View at Google Scholar
  32. B. Raj, M. V. S. Shashanka, and P. Smaragdis, “Latent dirichlet decomposition for single channel speaker separation,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), vol. 5, Toulouse, France, May 2006. View at Publisher · View at Google Scholar
  33. D. Blei and J. Lafferty, “Correlated topic models,” in Proceedings of the 20th Annual Conference on Neural Information Processing Systems (NIPS '06), Vancouver, BC, Canada, December 2006.
  34. J. Canny, “Gap: a factor model for discrete data,” in Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '04), Sheffield, UK, July 2004.