Recently, a considerable growth of interest in projected gradient (PG) methods has been
observed due to their high efficiency in solving large-scale convex minimization problems
subject to linear constraints. Since the minimization problems underlying nonnegative
matrix factorization (NMF) of large matrices well matches this class of minimization
problems, we investigate and test some recent PG methods in the context of their applicability
to NMF. In particular, the paper focuses on the following modified methods:
projected Landweber, Barzilai-Borwein gradient projection, projected sequential subspace
optimization (PSESOP), interior-point Newton (IPN), and sequential coordinate-wise.
The proposed and implemented NMF PG algorithms are compared with respect to their
performance in terms of signal-to-interference ratio (SIR) and elapsed time, using a simple
benchmark of mixed partially dependent nonnegative signals.
1. Introduction and Problem Statement
Nonnegative
matrix factorization (NMF) finds such nonnegative factors (matrices) and with a that ,
given the observation matrix ,
the lower-rank ,
and possibly other statistical information on the observed data or
the factors to be estimated.
This method has found a variety of real-world
applications in the areas such as blind separation of images and nonnegative
signals [1–6], spectra recovering [7–10], pattern recognition and feature extraction [11–16], dimensionality reduction,
segmentation and clustering [17–32], language modeling, text
mining [25, 33], music transcription
[34], and neurobiology (gene separation) [35, 36].
Depending on an application, the estimated factors may
have different interpretation. For example, Lee and Seung [11] introduced NMF as a method
for decomposing an image (face) into parts-based representations (parts
reminiscent of features such as lips, eyes, nose, etc.). In blind source
separation (BSS) [1, 37, 38], the matrix represents the observed mixed (superposed)
signals or images, is a mixing operator, and is a matrix of true source signals or images.
Each row of or is a signal or 1D image representation, where is a number of observed mixed signals and is a number of hidden (source) components. The
index usually denotes a sample (discrete time
instant), where is the number of available samples. In BSS, we
usually have ,
and is known or can be relatively easily estimated
using SVD or PCA.
Our objective is to estimate the mixing matrix and sources subject to nonnegativity constraints of all
the entries, given and possibly the knowledge on a statistical
distribution of noisy disturbances.
Obviously, NMF is not unique in general case, and it
is characterized by a scale and permutation indeterminacies. These problems
have been addressed recently by many researchers [39–42], and in this paper, the problems will not be
discussed. However, we have shown by extensive computer simulations that in
practice with overwhelming probability we are able to achieve a unique
nonnegative factorization (neglecting unavoidable scaling and permutation
ambiguities) if data are sufficiently sparse and a suitable NMF algorithm is
applied [43]. This is
consistent with very recent theoretical results [40].
The noise distribution is strongly
application-dependent, however, in many BSS applications, a Gaussian noise is
expected. Here our considerations are restricted to this case, however, the
alternative NMF algorithms optimized to different distributions of the noise
exist and can be found, for example, in [37, 44, 45].
NMF was proposed by Paatero and Tapper [46, 47] but Lee and Seung [11, 48] highly popularized this
method by using simple multiplicative algorithms to perform NMF. An extensive
study on convergence of multiplicative algorithms for NMF can be found in
[49]. In general, the
multiplicative algorithms are known to be very slowly convergent for
large-scale problems. Due to this reason, there is a need to search more
efficient and fast algorithms for NMF. Many approaches have been proposed in
the literature to relax these problems. One of them is to apply projected
gradient (PG) algorithms [50–53] or projected alternating least-squares (ALS)
algorithms [33, 54, 55] instead of multiplicative ones. Lin [52] suggested applying the
Armijo rule to estimate the learning parameters in projected gradient updates
for NMF. The NMF PG algorithms proposed by us in [53] also address the issue with
selecting such a learning parameter that is the steepest descent and also keeps
some distance to a boundary of the nonnegative orthant (subspace of real
nonnegative numbers). Another very robust technique concerns exploiting the
information from the second-order Taylor expansion term of a cost function to
speed up the convergence. This approach was proposed in [45, 56], where the mixing matrix is updated with the projected Newton method,
and the sources in are computed with the projected least-squares
method (the fixed point algorithm).
In this paper, we extend our approach to NMF that we
have initialized in [53]. We have investigated, extended, and tested several
recently proposed PG algorithms such as the oblique projected Landweber
[57], Barzilai-Borwein
gradient projection [58], projected sequential subspace optimization [59, 60], interior-point Newton
[61], and sequential
coordinate-wise [62].
All the presented algorithms in this paper are quite efficient for solving
large-scale minimization problems subject to nonnegativity and sparsity
constraints.
The main objective of this paper is to develop,
extend, and/or modify some of the most promising PG algorithms to a standard
NMF problem and to find optimal conditions or parameters for such a class of
NMF algorithms. The second objective is to compare the performance and
complexity of these algorithms for NMF problems, and to discover or establish
the most efficient and promising algorithms. We would like to emphasize that
most of the discussed algorithms have not been implemented neither used till
now or even tested before for NMF problems, but they have been rather
considered for solving a standard system of algebraic equations: (for only ) where the matrix and the vectors are known. In this paper, we consider a much
more difficult and complicated problem in which we have two sets of parameters
and additional constraints of nonnegativity and/or sparsity. So it was not
clear till now whether such algorithms would work efficiently for NMF problems,
and if so, what kind of projected algorithms is the
most efficient? To our best knowledge only the Lin-PG NMF algorithm is
widely used and known for NMF problems. We have demonstrated experimentally that there
are several novel PG gradient algorithms which are much
more efficient and consistent than the Lin-PG algorithm.
In Section 2, we briefly discuss the PG approach to
NMF. Section 3 describes the tested algorithms. The experimental results are
illustrated in Section 4. Finally, some conclusions are given in Section 5.
2. Projected Gradient Algorithms
In contrast to the multiplicative algorithms, the
class of PG algorithms has additive updates. The algorithms discussed here
approximately solve nonnegative least squares (NNLS) problems with the basic
alternating minimization technique that is used in NMF:or in the equivalent matrix formwhere , , , , , ,
and usually .
The matrix is assumed to be a full-rank matrix, so there
exists a unique solution for any matrix since the NNLS problem is strictly convex
(with respect to one set of variables ).
The solution to (1) satisfies the Karush-Kuhn-Tucker (KKT)
conditions:or in the matrix notationwhere and are the corresponding gradient vector and
gradient matrix:
Similarly, the KKT conditions for the solution to (2), and the solution to (4) are as follows:or in the matrix notation:where and are the gradient vector and gradient matrix of
the objective function:
There are many approaches to solve the problems (1)
and (2), or equivalently (3) and (4). In this paper, we discuss selected
projected gradient methods that can be generally expressed by iterative
updates: where is a projection of onto a convex feasible set ,
namely, the nonnegative orthant (the subspace of nonnegative real numbers), and are descent directions for and ,
and and are learning rules, respectively.
The projection can be performed in several ways. One of the
simplest techniques is to replace all negative entries in by zero-values, or in practical cases, by a
small positive number to avoid numerical instabilities. Thus,However, this is not the only
way to carry out the projection .
It is typically more efficient to choose the learning rates and so as to preserve nonnegativity of the
solutions. The nonnegativity can be also maintained by solving least-squares
problems subject to the constraints (6) and (10). Here we present the exemplary
PG methods that work for NMF problems quite efficiently, and we implemented
them in the Matlab toolboxm, NMFLAB/NTFLAB, for signal and image processing
[43]. For simplicity,
we focus our considerations on updating the matrix ,
assuming that the updates for can be obtained in a similar way. Note that
the updates for can be readily obtained solving the transposed
system ,
having fixed (updated in the previous step).
3. Algorithms
3.1. Oblique Projected Landweber Method
The Landweber method [63] performs gradient-descent
minimization with the following iterative scheme:where the gradient is given by
(8), and the learning rate .
The updating formula assures an asymptotical convergence to the minimal-norm
least squares solution for the convergence radius defined bywhere is the maximal eigenvalue of .
Since is a nonnegative matrix, we have ,
where . Thus the modified Landweber iterations can be expressed by the
formulaIn the obliqueprojected
Landweber (OPL) method [57], which can be regarded as a particular case of the PG
iterative formula (12), the solution obtained with (14) in each iterative
step is projected onto the feasible set. Finally, we have Algorithm 1 for
updating .
3.2. Projected Gradient
One of the fundamental PG algorithms for NMF was
proposed by Lin in [52]. This algorithm, which we refer to as the Lin-PG,
uses the Armijo rule along the projection arc to determine the steplengths and in the iterative updates (12). For the
cost function being the squared Euclidean distance, and .
For computation of ,
such a value of is decided, on whichwhere is the first nonnegative integer that satisfies The parameters and decide about a convergence speed. In this
algorithm we set , experimentally as default.
The Matlab implementation of the Lin-PG algorithm is
given in [52].
3.3. Barzilai-Borwein Gradient Projection
The Barzilai-Borwein gradient projection method
[58, 64] is motivated by the
quasi-Newton approach, that is, the inverse of the Hessian is replaced with an
identity matrix ,
where the scalar is selected so that the inverse Hessian has
similar behavior as the true Hessian in the recent iteration. Thus,In comparison to, for example,
Lin's method [52],
this method does not ensure that the objective function decreases at every
iteration, but its general convergence has been proven analytically [58]. The general scheme of the
Barzilai-Borwein gradient projection algorithm for updating is in
Algorithm 2.
Since is a quadratic function, the line search
parameter can be derived in the following closed-form formula:
The Matlab implementation of the GPSR-BB algorithm,
which solves the system of multiple measurement vectors subject to
nonnegativity constraints, is given in Algorithm 4 (see also NMFLAB).
3.4. Projected Sequential Subspace Optimization
The projected
sequential subspace optimization (PSESOP) method [59, 60] carries out a projected
minimization of a smooth objective function over a subspace spanned by several
directions which include the current gradient and gradient from the previous
iterations, and the Nemirovski directions. Nemirovski [65] suggested that convex
smooth unconstrained optimization is optimal if the optimization is performed
over a subspace which includes the current gradient ,
the directions ,
and the linear combination of the previous gradients with the coefficients .
The directions should be orthogonal to the current gradient. Narkiss and
Zibulevsky [59] also
suggested to include another direction: ,
which is motivated by a natural extension of the conjugate gradient (CG) method
to a nonquadratic case. However, our practical observations showed that this
direction does not have a strong impact on the NMF components, thus we
neglected it in our NMF-PSESOP algorithm. Finally, we have Algorithm 3 for
updating which is a single column vector of .
Algorithm 3: (NMF-PSESOP).
The parameter denotes the number of previous iterates that
are taken into account to determine the current update.
The line search vector derived in a closed form for the objective function is as follows:The regularization parameter can
be set as a very small constant to avoid instabilities in inverting a
rank-deficient matrix in case that has zero-value or dependent columns.
3.5. Interior Point Newton Algorithm
The interior point Newton (IPN) algorithm [61] solves the NNLS problem (1)
by searching the solution satisfying the KKT conditions (5) which equivalently
can be expressed by the nonlinear equationswhere , ,
andApplying the Newton method to
(22), we have in the th iterative stepwhereIn [61], the entries of the matrix are defined byfor .
If the solution is degenerate, that is, and ,
the matrix may be singular. To avoid such a case, the
system of equations has been rescaled to the following form:with for .
In [61], the system
(27) is solved by the inexact Newton method, which leads to the following
updates:where , ,
and is a projection onto a feasible set .
The transformation of the normal matrix by the matrix in (27) means the system matrix is no longer symmetric and positive-definite.
There are many methods for handling such systems of linear equations, like QMR
[66], BiCG [67, 68], BiCGSTAB [69], or GMRES-like methods
[70], however, they
are more complicated and computationally demanding than, for example, the basic
CG algorithm [71]. To
apply the CG algorithm the system matrix in (27) must be converted to a
positive-definite symmetric matrix, which can be easily done with normal
equations. The methods like CGLS [72] or LSQR [73] are therefore suitable for such tasks. The
transformed system has the formwith and .
Since our cost function is quadratic, its minimization
in a single step is performed with combining the projected Newton step with the
constrained scaled Cauchy step that is given in the formAssuming , is chosen as being either the unconstrained
minimizer of the quadratic function or a scalar proportional to the distance to
the boundary along ,
whereThuswhere with .
For ,
the global convergence is achieved if , with
The usage of the constrained scaled Cauchy step leads
to the following updates: with , and are given by (30) and (35), respectively, and is the smaller square root (laying in ) of the quadratic equation:
The Matlab code of the IPN algorithm, which solves the
system subject to nonnegativity constraints, is given
in Algorithm 5. To solve the transformed system (32), we use the LSQR method
implemented in Matlab 7.0.
3.6. Sequential Coordinate-Wise Algorithm
The NNLS problem (1) can be expressed in terms of the
following quadratic problem (QP) [62]:wherewith and .
The sequential coordinate-wise algorithm (SCWA)
proposed first by Franc et al. [62] solves the QP problem given by (41) updating only
single variable in one iterative step. It should be noted that
the sequential updates can be easily done, if the function is equivalently rewritten aswhere ,
and
Considering the optimization of with respect to the selected variable ,
the following analytical solution can be derived:
Updating only single variable in one iterative step, we haveAny optimal solution to the QP
(41) satisfies the KKT conditions given by (5) and the stationarity condition
of the following Lagrange function:where is a vector of Lagrange multipliers (or dual variables) corresponding
to the vector .
Thus, .
In the SCWA, the Lagrange multipliers are updated in each iteration according
to the formulawhere is the th column of ,
and .
Finally, the SCWA can take the following updates:
4. Simulations
All the proposed algorithms were implemented in our
NMFLAB, and evaluated with the numerical tests related to typical BSS problems.
We used the synthetic benchmark of 4 partially dependent nonnegative signals
(with only samples) which are illustrated in Figure 1(a). The signals are mixed by random, uniformly
distributed nonnegative matrix with the condition number .
The matrix is displayed inThe mixing signals are shown in
Figure 1(b).
Figure 1: Dataset: (a)
original 4 source signals, (b) observed 8 mixed signals.
Because the number of variables in is much greater than in ,
that is, and ,
we test the projected gradient algorithms only for updating .
The variables in are updated with the standard projected fixed
point alternating least squares (FP-ALS) algorithm that is extensively analyzed
in [55].
In general, the FP-ALS algorithm solves the
least-squares problemwith the Moore-Penrose
pseudoinverse of a system matrix, that is, in our case, the matrix .
Since in NMF usually ,
we formulate normal equations as ,
and the least-squares solution of minimal -norm to the normal equations is where is the Moore-Penrose pseudoinverse of .
The projected FP-ALS algorithm is obtained with a simple
“half-rectified” projection, that is,
The alternating minimization is nonconvex in spite of
the cost function being convex with respect to one set of variables. Thus, most
NMF algorithms may get stuck in local minima, and hence, the initialization
plays a key role. In the performed tests, we applied the multistart initialization
described in [53] with
the following parameters: (number of restarts), (number of initial alternating steps), and (number of final alternating steps). Each
initial sample of and has been randomly generated from a uniform
distribution. Each algorithm has been tested for two cases of inner iterations,
that is, with and .
The inner iterations mean a number of iterative steps that are performed to
update only (with fixed ,
i.e., before going to the update of and vice versa). Additionally, the multilayer
technique [53, 54] with 3 layers () is applied.
The multilayer technique can be regarded as multistep
decomposition. In the first step, we perform the basic decomposition using any available NMF algorithm, where and with .
In the second stage, the results obtained from the first stage are used to
perform the similar decomposition: ,
where and ,
using the same or different update rules, and so on. We continue our
decomposition taking into account only the last achieved components. The process
can be repeated arbitrary many times until some stopping criteria are
satisfied. In each step, we usually obtain gradual improvements of the
performance. Thus, our model has the form with the basis matrix defined as .
Physically, this means that we build up a system that has many layers or
cascade connection of mixing subsystems.
There are many stopping criteria for terminating the
alternating steps. We stop the iterations if or the following condition is held, where stands for the number of alternating step, and .
Note that the condition (20) can be also used as a stopping criterion,
especially as the gradient is computed in each iteration of the PG algorithms.
The algorithms have been evaluated with the
signal-to-interference ratio (SIR) measures, calculated separately for each
source signal and each column in the mixing matrix. Since NMF suffers from
scale and permutation indeterminacies, the estimated components are adequately
permuted and rescaled. First, the source and estimated signals are normalized
to a uniform variance, and then the estimated signals are permuted to keep the
same order as the source signals. In NMFLAB [43], each estimated signal is compared to each source
signal, which results in the performance (SIR) matrix that is involved to make
the permutation matrix. Let and be the th source and its corresponding (reordered)
estimated signal, respectively. Analogically, let and be the th column of the true and its corresponding
estimated mixing matrix, respectively. Thus, the SIRs for the sources are given
byand similarly for each column in we have
We test the algorithms with the Monte Carlo (MC)
analysis, running each algorithm 100 times. Each run has been initialized with
the multistart procedure. The algorithms have been evaluated with the mean-SIR
values that are calculated as follows:for each MC sample. The
mean-SIRs for the worst (with the lowest mean-SIR values) and best (with the
highest mean-SIR values) samples are given in Table 1. The number means the number of inner iterations for
updating ,
and denotes the number of layers in the multilayer
technique [53, 54]. The notation means that the multilayer technique was not
used. The elapsed time [in seconds] is measured in Matlab, and it informs us in
some sense about a degree of complexity of the algorithm.
Table 1: Mean-SIRs [dB] obtained with 100 samples of Monte
Carlo analysis for the estimation of sources and columns of mixing matrix from
noise-free mixtures of signals in Figure
1. Sources
are estimated with the projected
pseudoinverse. The number of inner iterations for updating
is denoted by
,
and the number of layers (in the multilayer technique) by
.
The notation
best or
worst in parenthesis that follows the
algorithm name means that the mean-SIR value is calculated for the best or
worst sample from Monte Carlo analysis, respectively. In the last column, the
elapsed time [in seconds] is given for each algorithm with
and
.
For comparison, Table 1 contains also the results
obtained for the standard multiplicative NMF algorithm (denoted as M-NMF) that
minimizes the squared Euclidean distance. Additionally, the results of testing
the PG algorithms which were proposed in [53] have been also included. The acronyms Lin-PG, IPG,
RMRNSD refer to the following algorithms: projected gradient proposed by Lin
[52], interior-point
gradient, and regularized minimal residual norm steepest descent (the
regularized version of the MRNSD algorithm that was proposed by Nagy and
Strakos [74]). These
NMF algorithms have been implemented and investigated in [53] in the context of their
usefulness to BSS problems.
5. Conclusions
The performance of the proposed NMF algorithms can be
inferred from the results given in Table 1. In particular, the results show how
the algorithms are sensitive to initialization, or in other words, how easily
they fall in local minima. Also the complexity of the algorithms can be
estimated from the information on the elapsed time that is measured in Matlab.
It is easy to notice that our NMF-PSESOP algorithm
gives the best estimation (the sample which has the highest best-SIR value),
and it gives only slightly lower mean-SIR values than the Lin-PG algorithm.
Considering the elapsed time, the PL, GPSR-BB, SCWA, and IPG belong to the
fastest algorithms, while the Lin-PG and IPN algorithms are the slowest.
The multilayer technique generally improves the
performance and consistency of all the tested algorithms if the number of
observation is close to the number of nonnegative components. The highest
improvement can be observed for the NMF-PSESOP algorithm, especially when the
number of inner iterations is greater than one (typically, ).
In summary, the best and the most promising NMG-PG
algorithms are NMF-PSESOP, GPSR-BB, and IPG algorithms. However, the final
selection of the algorithm depends on a size of the problem to be solved.
Nevertheless, the projected gradient NMF algorithms seem to be much better (in
the sense of speed and performance) in our tests than the multiplicative
algorithms, provided that we can use the squared Euclidean cost function which
is optimal for data with a Gaussian noise.