Abstract

Recently, a considerable growth of interest in projected gradient (PG) methods has been observed due to their high efficiency in solving large-scale convex minimization problems subject to linear constraints. Since the minimization problems underlying nonnegative matrix factorization (NMF) of large matrices well matches this class of minimization problems, we investigate and test some recent PG methods in the context of their applicability to NMF. In particular, the paper focuses on the following modified methods: projected Landweber, Barzilai-Borwein gradient projection, projected sequential subspace optimization (PSESOP), interior-point Newton (IPN), and sequential coordinate-wise. The proposed and implemented NMF PG algorithms are compared with respect to their performance in terms of signal-to-interference ratio (SIR) and elapsed time, using a simple benchmark of mixed partially dependent nonnegative signals.

1. Introduction and Problem Statement

Nonnegative matrix factorization (NMF) finds such nonnegative factors (matrices) 𝐀=[𝑎𝑖𝑗]𝐼×𝐽 and 𝐗=[𝑥𝑗𝑡]𝐽×𝑇 with a 𝑎𝑖𝑗0,𝑥𝑗𝑡0 that 𝐘𝐀𝐗, given the observation matrix 𝐘=[𝑦𝑖𝑡]𝐼×𝑇, the lower-rank 𝐽, and possibly other statistical information on the observed data or the factors to be estimated.

This method has found a variety of real-world applications in the areas such as blind separation of images and nonnegative signals [16], spectra recovering [710], pattern recognition and feature extraction [1116], dimensionality reduction, segmentation and clustering [1732], language modeling, text mining [25, 33], music transcription [34], and neurobiology (gene separation) [35, 36].

Depending on an application, the estimated factors may have different interpretation. For example, Lee and Seung [11] introduced NMF as a method for decomposing an image (face) into parts-based representations (parts reminiscent of features such as lips, eyes, nose, etc.). In blind source separation (BSS) [1, 37, 38], the matrix 𝐘 represents the observed mixed (superposed) signals or images, 𝐀 is a mixing operator, and 𝐗 is a matrix of true source signals or images. Each row of 𝐘 or 𝐗 is a signal or 1D image representation, where 𝐼 is a number of observed mixed signals and 𝐽 is a number of hidden (source) components. The index 𝑡 usually denotes a sample (discrete time instant), where 𝑇 is the number of available samples. In BSS, we usually have 𝑇𝐼𝐽, and 𝐽 is known or can be relatively easily estimated using SVD or PCA.

Our objective is to estimate the mixing matrix 𝐀 and sources 𝐗 subject to nonnegativity constraints of all the entries, given 𝐘 and possibly the knowledge on a statistical distribution of noisy disturbances.

Obviously, NMF is not unique in general case, and it is characterized by a scale and permutation indeterminacies. These problems have been addressed recently by many researchers [3942], and in this paper, the problems will not be discussed. However, we have shown by extensive computer simulations that in practice with overwhelming probability we are able to achieve a unique nonnegative factorization (neglecting unavoidable scaling and permutation ambiguities) if data are sufficiently sparse and a suitable NMF algorithm is applied [43]. This is consistent with very recent theoretical results [40].

The noise distribution is strongly application-dependent, however, in many BSS applications, a Gaussian noise is expected. Here our considerations are restricted to this case, however, the alternative NMF algorithms optimized to different distributions of the noise exist and can be found, for example, in [37, 44, 45].

NMF was proposed by Paatero and Tapper [46, 47] but Lee and Seung [11, 48] highly popularized this method by using simple multiplicative algorithms to perform NMF. An extensive study on convergence of multiplicative algorithms for NMF can be found in [49]. In general, the multiplicative algorithms are known to be very slowly convergent for large-scale problems. Due to this reason, there is a need to search more efficient and fast algorithms for NMF. Many approaches have been proposed in the literature to relax these problems. One of them is to apply projected gradient (PG) algorithms [5053] or projected alternating least-squares (ALS) algorithms [33, 54, 55] instead of multiplicative ones. Lin [52] suggested applying the Armijo rule to estimate the learning parameters in projected gradient updates for NMF. The NMF PG algorithms proposed by us in [53] also address the issue with selecting such a learning parameter that is the steepest descent and also keeps some distance to a boundary of the nonnegative orthant (subspace of real nonnegative numbers). Another very robust technique concerns exploiting the information from the second-order Taylor expansion term of a cost function to speed up the convergence. This approach was proposed in [45, 56], where the mixing matrix 𝐀 is updated with the projected Newton method, and the sources in 𝐗 are computed with the projected least-squares method (the fixed point algorithm).

In this paper, we extend our approach to NMF that we have initialized in [53]. We have investigated, extended, and tested several recently proposed PG algorithms such as the oblique projected Landweber [57], Barzilai-Borwein gradient projection [58], projected sequential subspace optimization [59, 60], interior-point Newton [61], and sequential coordinate-wise [62]. All the presented algorithms in this paper are quite efficient for solving large-scale minimization problems subject to nonnegativity and sparsity constraints.

The main objective of this paper is to develop, extend, and/or modify some of the most promising PG algorithms to a standard NMF problem and to find optimal conditions or parameters for such a class of NMF algorithms. The second objective is to compare the performance and complexity of these algorithms for NMF problems, and to discover or establish the most efficient and promising algorithms. We would like to emphasize that most of the discussed algorithms have not been implemented neither used till now or even tested before for NMF problems, but they have been rather considered for solving a standard system of algebraic equations: 𝐀𝐱(𝑘)=𝐲(𝑘) (for only 𝑘=1) where the matrix 𝐀 and the vectors 𝐲 are known. In this paper, we consider a much more difficult and complicated problem in which we have two sets of parameters and additional constraints of nonnegativity and/or sparsity. So it was not clear till now whether such algorithms would work efficiently for NMF problems, and if so, what kind of projected algorithms is the most efficient? To our best knowledge only the Lin-PG NMF algorithm is widely used and known for NMF problems. We have demonstrated experimentally that there are several novel PG gradient algorithms which are much more efficient and consistent than the Lin-PG algorithm.

In Section 2, we briefly discuss the PG approach to NMF. Section 3 describes the tested algorithms. The experimental results are illustrated in Section 4. Finally, some conclusions are given in Section 5.

2. Projected Gradient Algorithms

In contrast to the multiplicative algorithms, the class of PG algorithms has additive updates. The algorithms discussed here approximately solve nonnegative least squares (NNLS) problems with the basic alternating minimization technique that is used in NMF:min𝐱𝑡0𝐷𝐹𝐲𝑡||||𝐀𝐱𝑡=12𝐲𝑡𝐀𝐱𝑡22,𝑡=1,,𝑇,(1)min𝐚𝑖0𝐷𝐹𝐲𝑖||||𝐗𝑇𝐚𝑖=12𝐲𝑖𝐗𝑇𝐚𝑖22,𝑖=1,,𝐼(2)or in the equivalent matrix formmin𝑥𝑗𝑡0𝐷𝐹||||1(𝐘𝐀𝐗)=2𝐘𝐀𝐗2𝐹,(3)min𝑎𝑖𝑗0𝐷𝐹𝐘𝑇||||𝐗𝑇𝐀𝑇=12𝐘𝑇𝐗𝑇𝐀𝑇2𝐹,(4)where 𝐀=[𝐚1,,𝐚𝐽]𝐼×𝐽, 𝐀𝑇=[𝐚1,,𝐚𝐼]𝐽×𝐼, 𝐗=[𝐱1,,𝐱𝑇]𝐽×𝑇, 𝐗𝑇=[𝐱1,,𝐱𝐽]𝑇×𝐽, 𝐘=[𝐲1,,𝐲𝑇]𝐼×𝑇, 𝐘𝑇=[𝐲1,,𝐲𝐼]𝐼×𝑇, and usually 𝐼𝐽. The matrix 𝐀 is assumed to be a full-rank matrix, so there exists a unique solution 𝐗𝐽×𝑇 for any matrix 𝐘 since the NNLS problem is strictly convex (with respect to one set of variables {𝐗}).

The solution 𝐱𝑡 to (1) satisfies the Karush-Kuhn-Tucker (KKT) conditions:𝐱𝑡0,𝐠𝑋𝐱𝑡0,𝐠𝑋𝐱𝑡𝑇𝐱𝑡=0,(5)or in the matrix notation𝐗0,𝐆𝑋𝐗𝐆0,tr𝑋𝐗𝑇𝐗=0,(6)where 𝐠𝑋 and 𝐆𝑋 are the corresponding gradient vector and gradient matrix:𝐠𝑋𝐱𝑡=𝐱𝑡𝐷𝐹𝐲𝑡||||𝐀𝐱𝑡=𝐀𝑇𝐀𝐱𝑡𝐲𝑡𝐆,(7)𝑋(𝐗)=𝐗𝐷𝐹||||(𝐘𝐀𝐗)=𝐀𝑇(𝐀𝐗𝐘).(8)

Similarly, the KKT conditions for the solution 𝐚 to (2), and the solution 𝐀 to (4) are as follows:𝐚𝑖0,𝐠𝐴𝐚𝑖0,𝐠𝐴𝐚𝑖𝑇𝐚𝑖=0,(9)or in the matrix notation:𝐀0,𝐆𝐴𝐀𝐀0,tr𝐆𝐴𝐀𝑇=0,(10)where 𝐠𝐴 and 𝐆𝐴 are the gradient vector and gradient matrix of the objective function:𝐠𝐴𝐚𝑡=𝐚𝑖𝐷𝐹𝐲𝑖||||𝐗𝑇𝐚𝑖=𝐗𝑇𝐚𝑖𝐲𝑖,𝐆𝐴(𝐀)=𝐀𝐷𝐹𝐘𝑇||||𝐗𝑇𝐀𝑇=(𝐀𝐗𝐘)𝐗𝑇.(11)

There are many approaches to solve the problems (1) and (2), or equivalently (3) and (4). In this paper, we discuss selected projected gradient methods that can be generally expressed by iterative updates:𝐗(𝑘+1)=𝑃Ω𝐗(𝑘)𝜂𝑋(𝑘)𝐏𝑋(𝑘),𝐀(𝑘+1)=𝑃Ω𝐀(𝑘)𝜂𝐴(𝑘)𝐏𝐴(𝑘),(12) where 𝑃Ω[𝜉] is a projection of 𝜉 onto a convex feasible set Ω={𝜉𝜉0}, namely, the nonnegative orthant + (the subspace of nonnegative real numbers), 𝐏𝑋(𝑘) and 𝐏𝐴(𝑘) are descent directions for 𝐗 and 𝐀, and 𝜂𝑋(𝑘) and 𝜂𝐴(𝑘) are learning rules, respectively.

The projection 𝑃Ω[𝜉] can be performed in several ways. One of the simplest techniques is to replace all negative entries in 𝜉 by zero-values, or in practical cases, by a small positive number 𝜖 to avoid numerical instabilities. Thus,𝑃[𝜉]=max𝜖,𝜉.(13)However, this is not the only way to carry out the projection 𝑃Ω(𝜉). It is typically more efficient to choose the learning rates 𝜂𝑋(𝑘) and 𝜂𝐴(𝑘) so as to preserve nonnegativity of the solutions. The nonnegativity can be also maintained by solving least-squares problems subject to the constraints (6) and (10). Here we present the exemplary PG methods that work for NMF problems quite efficiently, and we implemented them in the Matlab toolboxm, NMFLAB/NTFLAB, for signal and image processing [43]. For simplicity, we focus our considerations on updating the matrix 𝐗, assuming that the updates for 𝐀 can be obtained in a similar way. Note that the updates for 𝐀 can be readily obtained solving the transposed system 𝐗𝑇𝐀𝑇=𝐘𝑇, having 𝐗 fixed (updated in the previous step).

3. Algorithms

3.1. Oblique Projected Landweber Method

The Landweber method [63] performs gradient-descent minimization with the following iterative scheme:𝐗(𝑘+1)=𝐗(𝑘)𝜂𝐆𝑋(𝑘),(14)where the gradient is given by (8), and the learning rate 𝜂(0,𝜂max). The updating formula assures an asymptotical convergence to the minimal-norm least squares solution for the convergence radius defined by𝜂max=2𝜆max𝐀𝑇𝐀,(15)where 𝜆max(𝐀𝑇𝐀) is the maximal eigenvalue of 𝐀𝑇𝐀. Since 𝐀 is a nonnegative matrix, we have 𝜆max(𝐀𝑇𝐀)max𝑗[𝐀𝑇𝐀𝟏𝐽]𝑗, where 𝟏𝐽=[1,,1]𝑇𝐽. Thus the modified Landweber iterations can be expressed by the formula𝐗(𝑘+1)=𝑃Ω𝐗(𝑘)𝜂diag𝑗𝐆𝑋(𝑘),where𝜂𝑗=2𝐀𝑇𝐀𝟏𝐽𝑗.(16)In the obliqueprojected Landweber (OPL) method [57], which can be regarded as a particular case of the PG iterative formula (12), the solution obtained with (14) in each iterative step is projected onto the feasible set. Finally, we have Algorithm 1 for updating 𝐗.

Set A, X, % Initialization
𝜂 𝑗 ( 𝑋 ) = 2 ( 𝐀 𝑇 𝐀 𝟏 𝐽 ) 𝑗 ,
For 𝑘 = 1 , 2 , , % Inner iterations for X
𝐆 𝑋 = 𝐀 𝑇 ( 𝐀 𝐗 𝐘 ) , % Gradient with respect to X
𝐗 𝑃 Ω [ 𝐗 d i a g { 𝜂 𝑗 } 𝐆 𝑋 ] , % Updating
End

3.2. Projected Gradient

One of the fundamental PG algorithms for NMF was proposed by Lin in [52]. This algorithm, which we refer to as the Lin-PG, uses the Armijo rule along the projection arc to determine the steplengths 𝜂𝑋(𝑘) and 𝜂𝐴(𝑘) in the iterative updates (12). For the cost function being the squared Euclidean distance, 𝐏𝑋=(𝐀(𝑘))𝑇(𝐀(𝑘)𝐗(𝑘)𝐘) and 𝐏𝐴=(𝐀(𝑘)𝐗(𝑘+1)𝐘)(𝐗(𝑘+1))𝑇.

For computation of 𝐗, such a value of 𝜂𝑋 is decided, on which𝜂𝑋(𝑘)=𝛽𝑚𝑘,(17)where 𝑚𝑘 is the first nonnegative integer 𝑚 that satisfies𝐷𝐹𝐘||||𝐀𝐗(𝑘+1)𝐷𝐹𝐘||||𝐀𝐗(𝑘)𝜎𝑋𝐷𝐹𝐘||||𝐀𝐗(𝑘)𝑇𝐗(𝑘+1)𝐗(𝑘).(18) The parameters 𝛽(0,1) and 𝜎(0,1) decide about a convergence speed. In this algorithm we set 𝜎=0.01, 𝛽=0.1 experimentally as default.

The Matlab implementation of the Lin-PG algorithm is given in [52].

3.3. Barzilai-Borwein Gradient Projection

The Barzilai-Borwein gradient projection method [58, 64] is motivated by the quasi-Newton approach, that is, the inverse of the Hessian is replaced with an identity matrix 𝐇𝑘=𝛼𝑘𝐈, where the scalar 𝛼𝑘 is selected so that the inverse Hessian has similar behavior as the true Hessian in the recent iteration. Thus,𝐗(𝑘+1)𝐗(𝑘)𝛼𝑘𝑋𝐷𝐘𝐀(𝑘)𝐗(𝑘+1)𝑋𝐷𝐘𝐀(𝑘)𝐗(𝑘).(19)In comparison to, for example, Lin's method [52], this method does not ensure that the objective function decreases at every iteration, but its general convergence has been proven analytically [58]. The general scheme of the Barzilai-Borwein gradient projection algorithm for updating 𝐗 is in Algorithm 2.

Set 𝐀 , 𝐗 , 𝛼 m i n , 𝛼 m a x , 𝜶 ( 0 ) [ 𝛼 m i n , 𝛼 m a x ] 𝑇 %
Initialization
For 𝑘 = 1 , 2 , , % Inner iterations
Δ ( 𝑘 ) = 𝑃 Ω [ 𝐗 ( 𝑘 ) 𝛼 ( 𝑘 ) 𝑋 𝐷 𝐹 ( 𝐘 | | 𝐀 𝐗 ( 𝑘 ) ) ] 𝐗 ( 𝑘 ) ,
𝝀 ( 𝑘 ) = a r g m i n 𝜆 𝑡 ( 𝑘 ) [ 0 , 1 ] 𝐷 𝐹 ( 𝐘 | | 𝐀 ( 𝐗 + Δ ( 𝑘 ) d i a g { 𝝀 } ) ) ,
where 𝝀 = [ 𝜆 𝑡 ] 𝑇 ,
𝐗 ( 𝑘 + 1 ) = 𝐗 ( 𝑘 ) + Δ ( 𝑘 ) d i a g { 𝝀 } ,
𝜸 ( 𝑘 ) = d i a g { ( Δ ( 𝑘 ) ) 𝑇 𝐀 𝑇 𝐀 Δ ( 𝑘 ) } ,
If 𝛾 𝑡 ( 𝑘 ) = 0 : 𝛼 𝑡 ( 𝑘 + 1 ) = 𝛼 m a x ,
Else 𝛼 𝑡 ( 𝑘 + 1 ) = m i n { 𝛼 m a x , m a x { 𝛼 m i n ,
[ ( Δ ( 𝑘 ) ) 𝑇 Δ ( 𝑘 ) ] 𝑡 𝑡 / 𝛾 𝑡 ( 𝑘 ) } } ,
End
End % Inner iterations

Since 𝐷𝐹(𝐘||𝐀𝐗) is a quadratic function, the line search parameter 𝝀(𝑘) can be derived in the following closed-form formula:𝝀(𝑘)Δ=max0,min1,diag(𝑘)𝑇𝑋𝐷𝐹||||(𝐘𝐀𝐗)Δdiag(𝑘)𝑇𝐀𝑇𝐀Δ(𝑘).(20)

The Matlab implementation of the GPSR-BB algorithm, which solves the system 𝐀𝐗=𝐘 of multiple measurement vectors subject to nonnegativity constraints, is given in Algorithm 4 (see also NMFLAB).

3.4. Projected Sequential Subspace Optimization

The projected sequential subspace optimization (PSESOP) method [59, 60] carries out a projected minimization of a smooth objective function over a subspace spanned by several directions which include the current gradient and gradient from the previous iterations, and the Nemirovski directions. Nemirovski [65] suggested that convex smooth unconstrained optimization is optimal if the optimization is performed over a subspace which includes the current gradient 𝐠(𝐱), the directions 𝐝1(𝑘)=𝐱(𝑘)𝐱(0), and the linear combination of the previous gradients 𝐝2(𝑘)=𝑘1𝑛=0𝑤𝑛𝐠(𝐱𝑛) with the coefficients 𝑤𝑛,𝑛=0,,𝑘1. The directions should be orthogonal to the current gradient. Narkiss and Zibulevsky [59] also suggested to include another direction: 𝐩(𝑘)=𝐱(𝑘)𝐱(𝑘1), which is motivated by a natural extension of the conjugate gradient (CG) method to a nonquadratic case. However, our practical observations showed that this direction does not have a strong impact on the NMF components, thus we neglected it in our NMF-PSESOP algorithm. Finally, we have Algorithm 3 for updating 𝐱𝑡 which is a single column vector of 𝐗.

Set 𝐀 , 𝐱 𝑡 ( 0 ) , 𝑝 % Initialization
For 𝑘 = 1 , 2 , , % Inner iterations
𝐝 1 ( 𝑘 ) = 𝐱 ( 𝑘 ) 𝐱 ( 0 ) ,
𝐠 ( 𝑘 ) = 𝐱 𝑡 𝐷 𝐹 ( 𝐲 𝑡 | | 𝐀 𝐱 𝑡 ) ,
𝐆 ( 𝑝 ) = [ 𝐠 ( 𝑘 1 ) , 𝐠 ( 𝑘 2 ) , , 𝐠 ( 𝑘 𝑝 ) ] 𝐽 × 𝑝 ,
𝑤 𝑘 = 1 i f 𝑘 = 1 , 1 / 2 + 1 / 4 + 𝑤 2 𝑘 1 i f 𝑘 > 1 ,
𝐰 ( 𝑘 ) = [ 𝑤 𝑘 , 𝑤 𝑘 1 , , 𝑤 𝑘 𝑝 + 1 ] 𝑇 𝑝 ,
𝐝 2 ( 𝑘 ) = 𝐆 ( 𝑝 ) 𝐰 ( 𝑘 ) ,
𝐃 ( 𝑘 ) = [ 𝐝 1 ( 𝑘 ) , 𝐝 2 ( 𝑘 ) , 𝐠 ( 𝑘 ) , 𝐆 ( 𝑝 ) ] ,
𝜶 ( 𝑘 ) = a r g m i n 𝛼 𝐷 𝐹 ( 𝐲 𝑡 | | 𝐀 ( 𝐱 𝑡 ( 𝑘 ) + 𝐃 ( 𝑘 ) 𝜶 ( 𝑘 ) ) ) ,
𝐱 ( 𝑘 + 1 ) = 𝑃 Ω [ 𝐱 ( 𝑘 ) + 𝐃 ( 𝑘 ) 𝜶 ( 𝑘 ) ]
End % Inner iterations

% Barzilai-Borwein gradient projection (GPSR-BB) algorithm
%
function [X] = nmf_gpsr_bb(A,Y,X,no_iter)
%
% [X] = nmf_gpsr_bb(A,Y,X,no_iter) finds such matrix X that solves
% the equation AX = Y subject to nonnegativity constraints.
%
% INPUTS:
% A - system matrix of dimension [I by J]
% Y - matrix of observations [I by T]
% X - matrix of initial guess [J by T]
% no_iter - maximum number of iterations
%
% OUTPUTS:
% X - matrix of estimated sources [J by T]
%
% #########################################################################
% Parameters
alpha_min = 1E-8; alpha_max = 1;
alpha = .1*ones(1,size(Y,2));
B = A’*A; Yt = A’*Y;
for k=1:no_iter
G = B*X - Yt;
delta = max(eps, X - repmat(alpha,size(G,1),1).*G) - X;
deltaB = B*delta;
lambda = max(0, min(1, -sum(delta.*G,1)./(sum(delta.*deltaB,1) + eps)));
X = max(eps,X + delta.*repmat(lambda,size(delta,1),1));
gamma = sum(delta.*deltaB,1) + eps;
if gamma
alpha = min(alpha_max, max(alpha_min, sum(delta. ̂ 2,1)./gamma ));
else
alpha = alpha_max;
end
end

The parameter 𝑝 denotes the number of previous iterates that are taken into account to determine the current update.

The line search vector 𝜶(𝑘) derived in a closed form for the objective function 𝐷𝐹(𝐲𝑡||𝐀𝐱𝑡) is as follows:𝜶(𝑘)𝐃=(𝑘)𝑇𝐀𝑇𝐀𝐃(𝑘)+𝜆𝐈1𝐃(𝑘)𝑇𝐱𝑡𝐷𝐹𝐲𝑡||||𝐀𝐱𝑡.(21)The regularization parameter can be set as a very small constant to avoid instabilities in inverting a rank-deficient matrix in case that 𝐃(𝑘) has zero-value or dependent columns.

3.5. Interior Point Newton Algorithm

The interior point Newton (IPN) algorithm [61] solves the NNLS problem (1) by searching the solution satisfying the KKT conditions (5) which equivalently can be expressed by the nonlinear equations𝐃𝐱𝑡𝐠𝐱𝑡=0,(22)where 𝐃(𝐱𝑡)=diag{𝑑1(𝐱𝑡),,𝑑𝐽(𝐱𝑡)}, 𝐱𝑡0, and𝑑𝑗𝐱𝑡=𝑥𝑗𝑡if𝑔𝑗𝐱𝑡0,1otherwise.(23)Applying the Newton method to (22), we have in the 𝑘th iterative step𝐃𝑘𝐱𝑡𝐀𝑇𝐀+𝐄𝑘𝐱𝑡𝐩𝑘=𝐃𝑘𝐱𝑡𝐠𝑘𝐱𝑡,(24)where𝐄𝑘𝐱𝑡𝑒=diag1𝐱𝑡,,𝑒𝐽𝐱𝑡.(25)In [61], the entries of the matrix 𝐄𝑘(𝐱𝑡) are defined by𝑒𝑗𝐱𝑡=𝑔𝑗𝐱𝑡if0𝑔𝑗𝐱𝑡<𝑥𝛾𝑗𝑡𝑔,or𝑗𝐱𝑡𝛾>𝑥𝑗𝑡,1otherwise,(26)for 1<𝛾2.

If the solution is degenerate, that is, 𝑡=1,,𝑇,𝑗𝑥𝑗𝑡=0, and 𝑔𝑗𝑡=0, the matrix 𝐃𝑘(𝐱𝑡)𝐀𝑇𝐀+𝐄𝑘(𝐱𝑡) may be singular. To avoid such a case, the system of equations has been rescaled to the following form:𝐖𝑘𝐱𝑡𝐃𝑘𝐱𝑡𝐌𝑘𝐱𝑡𝐩𝑘=𝐖𝑘𝐱𝑡𝐃𝑘𝐱𝑡𝐠𝑘𝐱𝑡(27)with𝐌𝑘𝐱𝑡=𝐀𝑇𝐀+𝐃𝑘𝐱𝑡1𝐄𝑘𝐱𝑡,𝐖𝑘𝐱𝑡𝑤=diag1𝐱𝑡,,𝑤𝐽𝐱𝑡,𝑤𝑗𝐱𝑡=𝑑𝑗𝐱𝑡+𝑒𝑗𝐱𝑡1,(28) for 𝐱𝑡>0. In [61], the system (27) is solved by the inexact Newton method, which leads to the following updates:𝐖𝑘𝐱𝑡𝐃𝑘𝐱𝑡𝐌𝑘𝐱𝑡𝐩𝑘=𝐖𝑘𝐱𝑡𝐃𝑘𝐱𝑡𝐠𝑘𝐱𝑡+𝐫𝑘𝐱𝑡̂𝐩,(29)𝑘𝑃=max𝜎,1Ω𝐱𝑡(𝑘)+𝐩𝑘𝐱𝑡(𝑘)2𝑃Ω𝐱𝑡(𝑘)+𝐩𝑘𝐱𝑡(𝑘)𝐱,(30)𝑡(𝑘+1)=𝐱𝑡(𝑘)+̂𝐩𝑘,(31)where 𝜎(0,1), 𝐫𝑘(𝐱𝑡)=𝐀𝑇(𝐀𝐱𝑡𝐲𝑡), and 𝑃Ω[] is a projection onto a feasible set Ω.

The transformation of the normal matrix 𝐀𝑇𝐀 by the matrix 𝐖𝑘(𝐱𝑡)𝐃𝑘(𝐱𝑡) in (27) means the system matrix 𝐖𝑘(𝐱𝑡)𝐃𝑘(𝐱𝑡)𝐌𝑘(𝐱𝑡) is no longer symmetric and positive-definite. There are many methods for handling such systems of linear equations, like QMR [66], BiCG [67, 68], BiCGSTAB [69], or GMRES-like methods [70], however, they are more complicated and computationally demanding than, for example, the basic CG algorithm [71]. To apply the CG algorithm the system matrix in (27) must be converted to a positive-definite symmetric matrix, which can be easily done with normal equations. The methods like CGLS [72] or LSQR [73] are therefore suitable for such tasks. The transformed system has the form𝐙𝑘𝐱𝑡𝐩𝑘=𝐒𝑘𝐱𝑡𝐠𝑘𝐱𝑡+𝐫𝑘𝐱𝑡𝐒,(32)𝑘𝐱𝑡=𝐖𝑘𝐱𝑡𝐃𝑘𝐱𝑡,𝐙(33)𝑘𝐱𝑡=𝐒𝑘𝐱𝑡𝐌𝑘𝐱𝑡𝐒𝑘𝐱𝑡=𝐒𝑘𝐱𝑡𝐀𝑇𝐀𝐒𝑘𝐱𝑡+𝐖𝑘𝐱𝑡𝐄𝑘𝐱𝑡,(34)with 𝐩𝑘=𝐒𝑘1(𝐱𝑡)𝐩𝑘 and 𝐫𝑘=𝐒𝑘1(𝐱𝑡)𝐫𝑘(𝐱𝑡).

Since our cost function is quadratic, its minimization in a single step is performed with combining the projected Newton step with the constrained scaled Cauchy step that is given in the form𝐩𝑘(𝐶)=𝜏𝑘𝐃𝑘𝐱𝑡𝐠𝑘𝐱𝑡,𝜏𝑘>0.(35)Assuming 𝐱𝑡(𝑘)+𝐩𝑘(𝐶)>0, 𝜏𝑘 is chosen as being either the unconstrained minimizer of the quadratic function 𝜓𝑘(𝜏𝑘𝐃𝑘(𝐱𝑡)𝐠𝑘(𝐱𝑡)) or a scalar proportional to the distance to the boundary along 𝐃𝑘(𝐱𝑡)𝐠𝑘(𝐱𝑡), where𝜓𝑘1(𝐩)=2𝐩𝑇𝐌𝑘𝐱𝑡𝐩+𝐩𝑇𝐠𝑘𝐱𝑡=12𝐩𝑇𝐀𝑇𝐀+𝐃𝑘1𝐱𝑡𝐄𝑘𝐱𝑡𝐩+𝐩𝑇𝐀𝑇𝐀𝐱𝑡(𝑘)𝐲𝑡.(36)Thus𝜏𝑘=𝜏1=argmin𝜏𝜓𝑘𝜏𝑘𝐃𝑘𝐱𝑡𝐠𝑘𝐱𝑡if𝐱𝑡(𝑘)𝜏1𝐃𝑘𝐱𝑡𝐠𝑘𝐱𝑡𝜏>0,2=𝜃min𝑗𝑥(𝑘)𝑗𝑡𝐃𝑘𝐱𝑡𝐠𝑘𝐱𝑡𝑗𝐃𝑘𝐱𝑡𝐠𝑘𝐱𝑡𝑗>0otherwise,(37)where 𝜓𝑘(𝜏𝑘𝐃𝑘(𝐱𝑡)𝐠𝑘(𝐱𝑡))=(𝐠𝑘(𝐱𝑡))𝑇𝐃𝑘(𝐱𝑡)𝐠𝑘(𝐱𝑡)/(𝐃𝑘(𝐱𝑡)𝐠𝑘(𝐱𝑡))𝑇𝐌𝑘(𝐱𝑡)𝐃𝑘(𝐱𝑡)𝐠𝑘(𝐱𝑡) with 𝜃(0,1). For 𝜓𝑘(𝐩𝑘(𝐶))<0, the global convergence is achieved if red(𝐱𝑡(𝑘+1)𝐱𝑡(𝑘))𝛽, 𝛽(0,1) with𝜓red(𝐩)=𝑘(𝐩)𝜓𝑘𝐩𝑘(𝐶).(38)

The usage of the constrained scaled Cauchy step leads to the following updates:𝐬𝑡(𝑘)𝐩=𝑡𝑘(𝐶)̂𝐩𝑘+̂𝐩𝑘,𝐱𝑡(𝑘+1)=𝐱𝑡(𝑘)+𝐬𝑡(𝑘),(39) with 𝑡[0,1), ̂𝐩𝑘 and 𝐩𝑘(𝐶) are given by (30) and (35), respectively, and 𝑡 is the smaller square root (laying in (0,1)) of the quadratic equation:𝜋(𝑡)=𝜓𝑘𝑡𝐩𝑘(𝐶)̂𝐩𝑘+̂𝐩𝑘𝛽𝜓𝑘𝐩𝑘(𝐶)=0.(40)

The Matlab code of the IPN algorithm, which solves the system 𝐀𝐱𝑡=𝐲𝑡 subject to nonnegativity constraints, is given in Algorithm 5. To solve the transformed system (32), we use the LSQR method implemented in Matlab 7.0.

% Interior Point Newton (IPN) algorithm function
%
function [x] = nmf_ipn(A,y,x,no_iter)
%
% [x]=nmf_ipn(A,y,x,no_iter) finds such x that solves the equation Ax = y
% subject to nonnegativity constraints.
%
% INPUTS:
% A - system matrix of dimension [I by J]
% y - vector of observations [I by 1]
% x - vector of initial guess [J by 1]
% no_iter - maximum number of iterations
%
% OUTPUTS:
% x - vector of estimated sources [J by 1]
%
% #########################################################################
% Parameters
s = 1.8; theta = 0.5; rho = .1; beta = 1;
H = A’*A; yt = A’*y; J = size(x,1);
% Main loop
for k=1:no_iter
g = H*x - yt; d = ones(J,1); d(g >= 0) = x(g >= 0);
ek = zeros(J,1); ek(g >=0 & g < x. ̂ s) = g(g >=0 & g < x. ̂ s);
M = H + diag(ek./d);
dg = d.*g;
tau1 = (g’*dg)/(dg’*M*dg); tau_2vec = x./dg;
tau2 = theta*min(tau_2vec(dg > 0));
tau = tau1*ones(J,1); tau(x - tau1*dg <= 0) = tau2;
w = 1./(d + ek); sk = sqrt(w.*d); pc = - tau.*dg;
Z = repmat(sk,1,J).*M.*repmat(sk’,J,1);
rt = -g./sk;
[ p t , a g , r e l r e s , i t e r , r e s v e c ] = lsqr(Z,rt - g.*sk,1E-8);
p = pt.*sk;
phx = max(0, x + p) - x;
ph = max(rho, 1 - norm(phx))*phx;
Phi_pc = .5*pc’*M*pc + pc’*g; Phi_ph = .5*ph’*M*ph + ph’*g;
red_p = Phi_ph/Phi_pc; dp = pc - ph;
if red_p >= beta
t = 0;
else
ax = .5*dp’*M*dp; bx = dp’*(M*ph + g);
cx = Phi_ph - beta*Phi_pc;
Deltas = sqrt(bx ̂ 2 - 4*ax*cx);
t1 = .5*(-bx + Deltas)/ax; t2 = .5*(-bx - Deltas)/ax;
t1s = t1 > 0 & t1 < 1; t2s = t2 > 0 & t2 < 1; t = min(t1, t2);
if (t <= 0)
if t1s
t = t1s;
else
t = t2s;
end
end
end
sk = t*dp + ph;
x = x + sk;
end % for k

3.6. Sequential Coordinate-Wise Algorithm

The NNLS problem (1) can be expressed in terms of the following quadratic problem (QP) [62]:min𝐱𝑡0Ψ𝐱𝑡,(𝑡=1,,𝑇),(41)whereΨ𝐱𝑡=12𝐱𝑇𝑡𝐇𝐱𝑡+𝐜𝑇𝑡𝐱𝑡,(42)with 𝐇=𝐀𝑇𝐀 and 𝐜𝑡=𝐀𝑇𝐲𝑡.

The sequential coordinate-wise algorithm (SCWA) proposed first by Franc et al. [62] solves the QP problem given by (41) updating only single variable 𝑥𝑗𝑡 in one iterative step. It should be noted that the sequential updates can be easily done, if the function Ψ(𝐱𝑡) is equivalently rewritten asΨ𝐱𝑡=12𝑝𝑟𝑥𝑝𝑡𝑥𝑟𝑡𝐀𝑇𝐀𝑝𝑟+𝑝𝑥𝑝𝑡𝐀𝑇𝐲𝑡𝑝𝑡=12𝑥2𝑗𝑡𝐀𝑇𝐀𝑗𝑗+𝑥𝑗𝑡𝐀𝑇𝐲𝑡𝑗𝑡+𝑥𝑗𝑡𝑝𝑗𝑥𝑝𝑡𝐀𝑇𝐀𝑝𝑗+𝑝𝑗𝑥𝑝𝑡𝐀𝑇𝐲𝑡𝑝𝑡+12𝑝𝑗𝑟𝑗𝑥𝑝𝑡𝑥𝑟𝑡𝐀𝑇𝐀𝑝𝑟=12𝑥2𝑗𝑡𝑗𝑗+𝑥𝑗𝑡𝛽𝑗𝑡+𝛾𝑗𝑡,(43)where ={1,,𝐽}, and

𝑗𝑗=𝐀𝑇𝐀𝑗𝑗,𝛽𝑗𝑡=𝐀𝑇𝐲𝑡𝑗𝑡+𝑝𝑗𝑥𝑝𝑡𝐀𝑇𝐀𝑝𝑗=𝐀𝑇𝐀𝐱𝑡+𝐀𝑇𝐲𝑡𝑗𝑡𝐀𝑇𝐀𝑗𝑗𝑥𝑗𝑡,𝛾𝑗𝑡=𝑝𝑗𝑥𝑝𝑡𝐀𝑇𝐲𝑡𝑝𝑡+12𝑝𝑗𝑟𝑗𝑥𝑝𝑡𝑥𝑟𝑡𝐀𝑇𝐀𝑝𝑟.(44)Considering the optimization of Ψ(𝐱𝑡) with respect to the selected variable 𝑥𝑗𝑡, the following analytical solution can be derived:𝑥𝑗𝑡𝑥=argminΨ1𝑡,,𝑥𝑗𝑡,,𝑥𝐽𝑡𝑇1=argmin2𝑥2𝑗𝑡𝑗𝑗+𝑥𝑗𝑡𝛽𝑗𝑡+𝛾𝑗𝑡𝛽=max0,𝑗𝑡𝑗𝑗=max0,𝑥𝑗𝑡𝐀𝑇𝐀𝐱𝑡𝑗𝑡+𝐀𝑇𝐲𝑡𝑗𝑡𝐀𝑇𝐀𝑗𝑗.(45)

Updating only single variable 𝑥𝑗𝑡 in one iterative step, we have𝑥(𝑘+1)𝑝𝑡=𝑥(𝑘)𝑝𝑡,𝑝{𝑗},𝑥(𝑘+1)𝑗𝑡𝑥(𝑘)𝑗𝑡.(46)Any optimal solution to the QP (41) satisfies the KKT conditions given by (5) and the stationarity condition of the following Lagrange function:𝐱𝑡,𝝀𝑡=12𝐱𝑇𝑡𝐇𝐱𝑡+𝐜𝑇𝑡𝐱𝑡𝝀𝑇𝑡𝐱𝑡,(47)where 𝝀𝑡𝐽 is a vector of Lagrange multipliers (or dual variables) corresponding to the vector 𝐱𝑡. Thus, 𝐱𝑡(𝐱𝑡,𝝀𝑡)=𝐇𝐱𝑡+𝐜𝑡𝝀𝑡=𝟎. In the SCWA, the Lagrange multipliers are updated in each iteration according to the formula𝝀𝑡(𝑘+1)=𝝀𝑡(𝑘)+𝑥(𝑘+1)𝑗𝑡𝑥(𝑘)𝑗𝑡𝐡𝑗,(48)where 𝐡𝑗 is the 𝑗th column of 𝐇, and 𝝀𝑡(0)=𝐜𝑡.

Finally, the SCWA can take the following updates:𝑥(𝑘+1)𝑗𝑡=max0,𝑥(𝑘)𝑗𝑡𝜆𝑗(𝑘)𝐀𝑇𝐀𝑗𝑗,𝑥(𝑘+1)𝑝𝑡=𝑥(𝑘)𝑝𝑡𝝀,𝑝{𝑗}𝑡(𝑘+1)=𝝀𝑡(𝑘)+𝑥(𝑘+1)𝑗𝑡𝑥(𝑘)𝑗𝑡𝐡𝑗.(49)

4. Simulations

All the proposed algorithms were implemented in our NMFLAB, and evaluated with the numerical tests related to typical BSS problems. We used the synthetic benchmark of 4 partially dependent nonnegative signals (with only 𝑇=1000 samples) which are illustrated in Figure 1(a). The signals are mixed by random, uniformly distributed nonnegative matrix 𝐀8×4 with the condition number cond{𝐀}=4.11. The matrix 𝐀 is displayed in𝐀=0.06310.76660.01740.65960.26420.66610.81940.21410.99950.13090.62110.60210.21200.09540.56020.60490.49840.01490.24400.65950.29050.28820.82200.18340.67280.81670.26320.63650.95800.98550.75360.1703.(50)The mixing signals are shown in Figure 1(b).

Because the number of variables in 𝐗 is much greater than in 𝐀, that is, 𝐼×𝐽=32 and 𝐽×𝑇=4000, we test the projected gradient algorithms only for updating 𝐀. The variables in 𝐗 are updated with the standard projected fixed point alternating least squares (FP-ALS) algorithm that is extensively analyzed in [55].

In general, the FP-ALS algorithm solves the least-squares problem𝐗=argmin𝐗12𝐘𝐀𝐗2𝐹(51)with the Moore-Penrose pseudoinverse of a system matrix, that is, in our case, the matrix 𝐀. Since in NMF usually 𝐼𝐽, we formulate normal equations as 𝐀𝑇𝐀𝐗=𝐀𝑇𝐘, and the least-squares solution of minimal 𝑙2-norm to the normal equations is 𝐗LS=(𝐀𝑇𝐀)1𝐀𝑇𝐘=𝐀+𝐘, where 𝐀+ is the Moore-Penrose pseudoinverse of 𝐀. The projected FP-ALS algorithm is obtained with a simple “half-rectified” projection, that is,𝐗=𝑃Ω𝐀+𝐘.(52)

The alternating minimization is nonconvex in spite of the cost function being convex with respect to one set of variables. Thus, most NMF algorithms may get stuck in local minima, and hence, the initialization plays a key role. In the performed tests, we applied the multistart initialization described in [53] with the following parameters: 𝑁=10 (number of restarts), 𝐾𝑖=30 (number of initial alternating steps), and 𝐾𝑓=1000 (number of final alternating steps). Each initial sample of 𝐀 and 𝐗 has been randomly generated from a uniform distribution. Each algorithm has been tested for two cases of inner iterations, that is, with 𝑘=1 and 𝑘=5. The inner iterations mean a number of iterative steps that are performed to update only 𝐀 (with fixed 𝐗, i.e., before going to the update of 𝐗 and vice versa). Additionally, the multilayer technique [53, 54] with 3 layers (𝐿=3) is applied.

The multilayer technique can be regarded as multistep decomposition. In the first step, we perform the basic decomposition 𝐘=𝐀1𝐗1 using any available NMF algorithm, where 𝐀1𝐼×𝐽 and 𝐗1𝐽×𝑇 with 𝐼𝐽. In the second stage, the results obtained from the first stage are used to perform the similar decomposition: 𝐗1=𝐀2𝐗2, where 𝐀2𝐽×𝐽 and 𝐗2𝐽×𝑇, using the same or different update rules, and so on. We continue our decomposition taking into account only the last achieved components. The process can be repeated arbitrary many times until some stopping criteria are satisfied. In each step, we usually obtain gradual improvements of the performance. Thus, our model has the form 𝐘=𝐀1𝐀2𝐀𝐿𝐗𝐿 with the basis matrix defined as 𝐀=𝐀1𝐀2𝐀𝐿𝐼×𝐽. Physically, this means that we build up a system that has many layers or cascade connection of 𝐿 mixing subsystems.

There are many stopping criteria for terminating the alternating steps. We stop the iterations if 𝑠𝐾𝑓=1000 or the following condition 𝐀(𝑠)𝐀(𝑠1)𝐹<𝜖 is held, where 𝑠 stands for the number of alternating step, and 𝜖=105. Note that the condition (20) can be also used as a stopping criterion, especially as the gradient is computed in each iteration of the PG algorithms.

The algorithms have been evaluated with the signal-to-interference ratio (SIR) measures, calculated separately for each source signal and each column in the mixing matrix. Since NMF suffers from scale and permutation indeterminacies, the estimated components are adequately permuted and rescaled. First, the source and estimated signals are normalized to a uniform variance, and then the estimated signals are permuted to keep the same order as the source signals. In NMFLAB [43], each estimated signal is compared to each source signal, which results in the performance (SIR) matrix that is involved to make the permutation matrix. Let 𝐱𝑗 and ̂𝐱𝑗 be the 𝑗th source and its corresponding (reordered) estimated signal, respectively. Analogically, let 𝐚𝑗 and ̂𝐚𝑗 be the 𝑗th column of the true and its corresponding estimated mixing matrix, respectively. Thus, the SIRs for the sources are given bySIR𝑗(𝑋)̂𝐱=20log𝑗𝐱𝑗2𝐱𝑗2,𝑗=1,,𝐽,[dB](53)and similarly for each column in 𝐀 we haveSIR𝑗(𝐴)̂𝐚=20log𝑗𝐚𝑗2𝐚𝑗2,𝑗=1,,𝐽,[dB].(54)

We test the algorithms with the Monte Carlo (MC) analysis, running each algorithm 100 times. Each run has been initialized with the multistart procedure. The algorithms have been evaluated with the mean-SIR values that are calculated as follows:SIR𝑋=1𝐽𝐽𝑗=1SIR𝑗(𝑋),SIR𝐴=1𝐽𝐽𝑗=1SIR𝑗(𝐴),(55)for each MC sample. The mean-SIRs for the worst (with the lowest mean-SIR values) and best (with the highest mean-SIR values) samples are given in Table 1. The number 𝑘 means the number of inner iterations for updating 𝐀, and 𝐿 denotes the number of layers in the multilayer technique [53, 54]. The notation 𝐿=1 means that the multilayer technique was not used. The elapsed time [in seconds] is measured in Matlab, and it informs us in some sense about a degree of complexity of the algorithm.

For comparison, Table 1 contains also the results obtained for the standard multiplicative NMF algorithm (denoted as M-NMF) that minimizes the squared Euclidean distance. Additionally, the results of testing the PG algorithms which were proposed in [53] have been also included. The acronyms Lin-PG, IPG, RMRNSD refer to the following algorithms: projected gradient proposed by Lin [52], interior-point gradient, and regularized minimal residual norm steepest descent (the regularized version of the MRNSD algorithm that was proposed by Nagy and Strakos [74]). These NMF algorithms have been implemented and investigated in [53] in the context of their usefulness to BSS problems.

5. Conclusions

The performance of the proposed NMF algorithms can be inferred from the results given in Table 1. In particular, the results show how the algorithms are sensitive to initialization, or in other words, how easily they fall in local minima. Also the complexity of the algorithms can be estimated from the information on the elapsed time that is measured in Matlab.

It is easy to notice that our NMF-PSESOP algorithm gives the best estimation (the sample which has the highest best-SIR value), and it gives only slightly lower mean-SIR values than the Lin-PG algorithm. Considering the elapsed time, the PL, GPSR-BB, SCWA, and IPG belong to the fastest algorithms, while the Lin-PG and IPN algorithms are the slowest.

The multilayer technique generally improves the performance and consistency of all the tested algorithms if the number of observation is close to the number of nonnegative components. The highest improvement can be observed for the NMF-PSESOP algorithm, especially when the number of inner iterations is greater than one (typically, 𝑘=5).

In summary, the best and the most promising NMG-PG algorithms are NMF-PSESOP, GPSR-BB, and IPG algorithms. However, the final selection of the algorithm depends on a size of the problem to be solved. Nevertheless, the projected gradient NMF algorithms seem to be much better (in the sense of speed and performance) in our tests than the multiplicative algorithms, provided that we can use the squared Euclidean cost function which is optimal for data with a Gaussian noise.