Applications of Methods of Numerical Linear Algebra in Engineering 2016View this Special Issue
A General Solution to Least Squares Problems with Box Constraints and Its Applications
The main contribution of this paper is presenting a flexible solution to the box-constrained least squares problems. This solution is applicable to many existing problems, such as nonnegative matrix factorization, support vector machine, signal deconvolution, and computed tomography reconstruction. The key concept of the proposed algorithm is to replace the minimization of the cost function at each iteration by the minimization of a surrogate, leading to a guaranteed decrease in the cost function. In addition to the monotonicity, the proposed algorithm also owns a few good features including the self-constraint in the feasible region and the absence of a predetermined step size. This paper theoretically proves the global convergence for a special case of below-bounded constraints. Using the proposed mechanism, some valuable algorithms can be derived. The simulation results demonstrate that the proposed algorithm provides performance that is comparable to that of other commonly used methods in numerical experiment and computed tomography reconstruction.
Solving the linear system is a classic inverse problem, where and are vectors and is a matrix. This problem is applicable in many fields, including nonnegative matrix factorization (NMF), support vector machine (SVM), signal deconvolution, and medical image reconstruction (e.g., computed tomography (CT)) [1–5]. In many cases, we need to impose a box constraint on the problem. A classical approach to choosing is to minimize the least squares (LS) error between and : where and are given constant vectors and denotes the Euclidean distance.
In fact, (1) can be easily converted into an equivalent form by a linear transformation , which is convenient to study. In the following, we only focus on such LS problems and still denote variables with for simple notation; subsequently, the readers may generalize all the results of this paper to the generic form by themselves:where is a constant vector.
If we denote , then it is the widely used case, that is, the nonnegativity constrained LS problem:
There are many methods for solving nonnegativity constrained problems. A simple method for solving such problems is by using Of course, we must truncate the negative pixel values to satisfy the nonnegativity constraint. However, we will encounter two difficulties: perhaps is not invertible or is too large. A viable example for the minimization of (3) is the gradient-based method. The steepest descent method is perhaps the simplest technique to implement, which takes the negative gradient as the descent direction: where the superscript () denotes the th iteration and is the step size.
In addition, Lantéri et al. [7, 8] provided a general multiplicative algorithm, which was also a gradient-based method. Let be a positive function with positive values for any ; then, is also a descent direction. Therefore, the algorithm in a modified gradient form can be written as Again, let and be two positive functions for any , which satisfy . Taking , Lantéri et al. obtained a multiplicative update
The conjugate gradient (CG) method is also a popular method, and it is often implemented as an iterative algorithm, applicable to sparse systems that are too large to be handled by a direct implementation.
The above approaches are not effective for processing the upper-bounded constraints, and, generally, a truncation is imposed to the update rule, which may lead to a divergent modification.
In this paper, we consider a specific application of the surrogate-based methods  to a specific type of LS problem with box constraints. We derive a multiplicative update rule to iteratively and monotonically (in the sense of decreasing the cost function) solve the problem, similar to the EM algorithm . The nonnegativity constraints will be satisfied automatically, and the upper-bounded constraints are performed by an upper truncation. Meanwhile, the algorithm still has the monotonicity and does not require an adjustable step size. We provide a rigorous global convergence proof for the case of only nonnegativity constraints, which can be easily generalized to the below-constrained case. We demonstrate the power of this mechanism by deriving many existing and new algorithms. We also present some computer simulation results to show the desirable behavior of the proposed algorithm with respect to the convergence rate and the stability compared with existing methods.
A note about our notation: all vectors will be column vectors unless transposed to a row vector by a prime superscript . For a matrix or vector , means that any component of is equal to or greater than 0. For a matrix , and represent the th row and th column of , respectively.
Definition 1 (surrogate). Denote as a surrogate for at (fixed) if and .
Clearly, is decreasing under the update because
There are two obvious and important properties for a surrogate: additivity and transitivity. For the former, the sum of two surrogates is a surrogate of the sum of the two original functions. For the latter, the surrogate of a surrogate of a function is a surrogate of this function.
In the following, we proceed step by step: firstly, assume that and ; secondly, remove the restriction of ; and thirdly, remove the restriction of .
Let , where and are nonnegative vectors; then, . Note that we decompose instead of for convenient derivation, which can be seen below. We construct a surrogate by the convexity of . Denote that satisfy , and . They can be the convex combination coefficients such that
It is easy to verify that . If considering Jensen’s inequality and the convex combination coefficients , then is proven by the following inequality:
Take the partial derivatives of ; then, we can solve the one-dimensional equations to obtain a multiplicative update rule:
2.2. Any but
Let , , and , where and are two nonnegative matrices; then, we can construct the surrogate for as follows: where
It is easy to verify that . By the convexity of , it is clear that
Following the same process as in Section 2.1, we can construct surrogates for and . Let then, Note that and are relative to the constant terms, which do not work for the optimization problems; thus, we ignore them. We can now obtain a surrogate for at :
Take the partial derivatives of and :
Then, we solve the one-dimensional equations to obtain a multiplicative update rule:
2.3. Any and
As shown, (21) will ensure the monotonic decrease of the cost function and satisfy the nonnegativity constraints. In the following, it will be modified by a truncation at the end of each iteration to satisfy the box constraints: We will show that the truncation still ensures the monotonic decrease of the cost function.
It is easy to see the separability of the variables of (19); that is, , and every separated function has a quadratic form:
We further simplify it into a general form: where , , and . It is clear that minimizes the quadratic function. If , we change nothing. If , then it is easy to obtain that because on the open interval ( such that strictly monotonously decreases on that open interval. Thus, the proof is established.
In fact, we know that ; therefore, we may provide a generic solution to (2), which is the main result of this paper.
It is easy to generalize the constraints from to by the transformation , which is left for readers to complete by themselves.
The convergence proof with box constraints is very difficult; thus, we only focus on the below-bounded constraints, which can be viewed as a special case of box constraints with . Additionally, we consider the equivalence between the general below-constrained form and the nonnegative constrained one by a linear transformation ; then, we will theoretically prove the global convergence of the latter case for simplification, which is easy to extend to that of the former. In theory, the Kuhn-Tucker (KT) point will be a global solution if the cost function is convex. It is easy to see the convexity of . By Theorem 2.19 in , the KT conditions of (3) () are as follows:
We entertain several important and reasonable assumptions.
Assumption 3. For the iteration sequence , we assume that (1)the algorithm starts from a positive image;(2) for all ;(3) is a strictly convex function.
The first assumption forces the iterations to be positive, but the limit may be zero. The second condition is reasonable because suggests for any . Thus, the equation is irrelevant to , and, then, in is removable. The third one is indeed restrictive; however, it is important for the following derivation.
Lemma 4. The set of accumulation points of a bounded sequence with is connected and compact.
Proof. This is Theorem 28.1 from Ostrowski . The reader is kindly referred to this paper for the proof.
Lemma 5. The iteration sequence is bounded.
Proof. Because is a strictly convex function, is a positive definite symmetric matrix, and, thus, it can be factorized into by Cholesky’s method, where is an invertible matrix. We assume that is the sole minimum, and, then, we expand on using a Taylor series: Let ; then, for any constant , it is easy to know the boundedness of such that is bounded, which is equivalent to being bounded. Let ; then, because monotonically decreases. Therefore, is bounded.
Lemma 6. The sequence .
Proof. Take into account that Let . Because , Because monotonically decreases and is bounded from below, ; therefore, .
Lemma 7. If a subsequence , then as well.
Proof. By contradiction, if diverges, then it must have a convergent subsequence because of the boundness by Lemma 5. Let , and if we consider the two convergent subsequences and , then there must be a positive integer to make and when . By the triangle inequality, we can obtain the contradictive result to as
Lemma 8. At each iteration, one knows that
The global convergence will be proven from the three theorems below.
Theorem 9. Let be any convergent subsequence; then, meets the first KT condition (25).
Theorem 10. The entire sequence converges.
Proof. According to Lemmas 4, 5, and 6, the set of accumulation points of is connected and compact. If we can prove that the number of accumulation points is finite, then the desired result follows because a finite set can be connected only if it consists of a single point .
To prove the existence of a finite number of accumulation points, we consider any accumulation point . Given an integer set , where is the total number of components of , is a subset of . Let be the restrictions of to the set , which is a strictly convex function of the reduced variables. It follows that has a unique minimum (Theorem 9: ). This means that an accumulation point must correspond to a subset of . The number of subsets of is finite, and, thus, the number of accumulation points is also finite.
In Theorem 9, we prove that every accumulation point meets the first KT condition, by which the full sequence convergence is provided in Theorem 10. Naturally, the limit of satisfies the first KT condition. In the following, we will show that the second KT condition is satisfied.
Theorem 11. The limit of satisfies the second KT condition (26).
Proof. When , by contradiction, we assume that there is an that satisfies . Because , there exists an and a positive integer such that for ; then, Thus, we can obtain that , which is a contradiction to .
4. Solutions to Some Existing and New Examples
NMF has been widely used in pattern recognition, machine learning, and data mining. It decomposes the nonnegative matrix by the product of two other nonnegative matrices and . Lee and Seung  have proposed using the square of the Euclidean distance to measure the similarity between and :
Minimizing them under the constraints and , Lee and Seung  alternated between solving an optimization problem in the variables and then solving another one in the variables . As can be seen, and have the same roles.
To take , for instance, with and given, denote such that the surrogate function can be constructed: The update rule can be obtained by solving . Reversing the roles of and , we can similarly construct the surrogate for and acquire the update rule:
4.2. Variation of Linearized SVM
SVM attempts to separate points belonging to two given sets in real Euclidean space by a surface. Several practical applications of SVMs use nonlinear kernels, such as the polynomial and radial basis function kernels. However, in applications such as text classification, linear SVMs are still used because it has been observed that many text classification problems are linearly separable . Most literature on large-scale SVM training have targeted the linear SVM problem, citing this fact. A dual form of linear SVM is usually used because of the simple structure, which can be formulated as where is the input data, is the class label, is the penalty parameter, and represents the Lagrangian multiplier that needs to be optimized. Hsieh et al.  handled the constraint by removing it, which corresponded to removing the “intercept” from the classifier. For a simplified expression, we denote , and, then, we obtain the following simple formula:
We can solve this optimization problem using the proposed method. Let and , where and are two nonnegative matrices; then, where
We, respectively, construct surrogates for and . Let then,
We now obtain a surrogate for at :
We solve the one-dimensional equation and truncate it to obtain an update:
4.3. Nonnegative Image Deblurring
In many optical devices, the process of image blurring can be considered to be the result of convolution by a point spread function (PSF) [22, 23]. It is assumed that the degraded image is in the form , where is the true image and is the PSF matrix consisting of PSFs at every pixel. As noted in [24, 25], a simple way to approach the deconvolution problem is to find the least squares (LS) estimation between and . It is well known that the problem of restoring the original image from the noisy and degraded version is an ill-posed inverse problem: small perturbations in the data may result in an unacceptable result . The Tikhonov regularization method [27, 28] is a popular method that generally leads to a unique solution, which is formulated as Note that we must approve negative values existing in the matrix .
In , we assume that and are a nonnegative matrix and vector and that there is no restriction on ; then, we, respectively, construct surrogate functions for and using the method proposed in this paper such that we obtain a multiplicative update rule as follows:If , it is De Pierro’s ISRT (Image Space Reconstruction Technique) :
4.4. CT Reconstruction with Regularization
In CT reconstruction, let be the system probability matrix, projections, and a predetermined matrix. Then, a classical approach is to choose an (X-ray attenuation image) such that the LS error  between and is minimized: where serves as a penalty parameter. Many matrices have been used, such as the first- or second-order derivative matrix or wavelet basis matrix. Note that both and are a nonnegative matrix and vector; however, we must accept negative values existing in the matrix .
The alternating direction method of multipliers (ADMM) [29, 30] has been developed for solving optimization problems. This method decomposes the original problem into three subproblems and then sequentially solves them at each iteration.
Introducing the additional variable , a fixed penalty parameter , and the Lagrangian multiplier , a unified framework can then be given to solve the -norm regularized LS reconstruction problem .
Algorithm 12 (ADMM general framework). () .
The -update is the most difficult problem that minimizes with nonnegativity constraints. We can, respectively, construct surrogate functions for and ; then minimize the sum of the surrogates to update ; however, there is a high overlap ratio with the above, so we leave them to the readers.
5. Experimental Results
We compare the performance of the proposed surrogate method, with those of the PL method (6) and the CG method on simulated data and CT projection data. For the PL method, we use , where and denote the 1- and -norm of a matrix. The code for the CG method comes from , which is slightly modified to meet our criteria.
The experiments are performed on a HP Compaq PC with a 3.00 GHz Core i5 CPU and 4 GB memory. The algorithms are implemented in MATLAB 7.0. All of the algorithms are initiated by the same uniform image for a fair comparison.
The mean square error (MSE) is used to measure the similarity to the true solution, as given below:
The following criterion is applied to stop the iterative process: where is a difference tolerance.
5.1. Numerical Simulation
Here, we randomly generate and , where the former are uniformly distributed on and the latter are on . Then we obtain by . We will minimize under the box constraints . In the experiment, we generate , and once, but we randomly generate start points (uniformly distributed on ) 50 times to obtain a reliable averaged result.
Figure 1 presents a comparison of MSE and versus iteration number. As shown, CG rapidly finishes iteration to a static solution, so that after that, which can not be shown with a log-scaled -axis. However, such a phenomenon does not mean a good result. Instead, it is the worst result among the three algorithms by observing the MSE curve. We can conclude that CG is not suitable to the box-constrained LS problem. PL and the proposed method always decrease the curves of MSE and . However, the proposed method shows a superior performance than the PL method.
Table 1 shows the quantitative comparison of the three algorithms; then we can draw the same conclusion as above. This table further proves that the proposed method is rapid and stable, and CG is not suitable to the box-constrained LS problem.
Table 2 shows the computational time of the algorithms. As can be seen, they have similar computational complexity; however, the proposed algorithm requires only a little more time because of the nonnegative decomposition of and . In fact, from the above theoretical analysis, we can observe that the proposed algorithm is also some gradient-based method; thus, it has a similar computational complexity with the CG and PL methods.
5.2. CT Reconstruction
CT reconstruction is a medical imaging technique for creating a meaningful diagnostic image. A computer can take the input from the CT machine, run it through the formula, and return a set of images for a physician to examine. Here, we only consider the simplest mathematical model to pursue the structural image. A simulated thorax phantom with grids and 0.5 mm pixels, as shown in Figure 2, is used in the following experiments. There are many advantages to using simulated phantoms, including prior knowledge of the pixel values and the ability to control noise. The total attenuation value is approximately . For this case, the proposed algorithm is called ISRT, which is widely used in the field of CT reconstruction .
The system matrix is obtained using the “angle of view” method . From the system matrix, we forward project the phantom on the sinogram with 128 radial bins (bin size of 0.5 mm) and 128 angular views evenly spaced over . The noisy projections are obtained by the Poisson-random formula , where is the noise-free projection.
Figure 3 shows the reconstructions with 50 iterations for every algorithm. As shown, the CG’s image suffers from serious noise artifacts. The PL algorithm provides a reconstructed image that is slightly blurred compared with that of ISRT. The proposed ISRT provides the sharpest edge and clearest interior region.
Figure 4 quantitatively compares the histograms of the reconstructed images. The CG’s result still shows a clear turbulence around the true value. PL and ISRT provide two comparable curves, and neither is significantly better than the other.
We also present a comparison of the MSE curves and with increasing iteration number in Figure 5. As shown, CG exhibits an unsteady reconstruction property in which more iterations may lead to a worse result. Both PL and ISRT overcome this shortcoming so that more iterations always provide considerably better results. In addition, ISRT will provide a more rapid convergence than PL.
We present a special application of a surrogate-based method for solving box-constrained LS problems. A new update rule is developed to iteratively update variables, and this rule exhibits desirable properties, including monotonic decrease of the cost function, self-constraining in the feasible region, and no need to impose a step size. This algorithm covers many existing algorithms, as well as new ones, as the special examples. Targeting the special case of only below-bounded constraints, we provide a rigorous theoretical global convergence proof. We use the simulated data to evaluate the performance of the algorithm, demonstrating that the proposed algorithm provides a stabler and faster convergence than the PL and CG approaches.
In this paper, we only provide a global convergence proof for the below-bounded case, not the box-constrained case; thus, proving the convergence for the latter case will be the focus of future work.
The authors declare that they have no competing interests.
This work was supported by the National Natural Science Foundations of China (61302013 and 61372014), Science and Technology Plan of Liaoning Province of China (2014305001), and Fundamental Research Funds for the Central Universities of China (N130219001 and N141008001).
D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Proceedings of the Advances in Neural Information Processing Systems 13 (NIPS '00), pp. 556–562, Breckenridge, Colo, USA, 2000.View at: Google Scholar
W. I. Zangwill, Nonlinear Programming: A Unified Approach, Prentice-Hall, New York, NY, USA, 1969.View at: MathSciNet
A. R. De Pierro, “On the convergence of an EM-type algorithm for penlaized likelihood estimation in emission tomography,” IEEE Transactions on Medical Imaging, vol. 14, pp. 762–765, 1995.View at: Google Scholar
K. Lange and R. Carson, “EM reconstruction algorithms for emission and transmission tomography,” Journal of Computer Assisted Tomography, vol. 8, no. 2, pp. 306–316, 1984.View at: Google Scholar
A. M. Ostrowski, Solution of Equations in Euclidean and Banach Spaces, Academic Press, New York, NY, USA, 1973.View at: MathSciNet
B. Scholkopf, C. Burges, and A. Smola, “Making large-scale SVM learning practical,” in Advances in Kernel Methods, pp. 169–184, MIT Press, Cambridge, Mass, USA, 1998.View at: Google Scholar
J. C. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods Support Vector Learning, B. Schölkopf, C. J. C. Burges, and A. J. Smola, Eds., pp. 185–208, MIT Press, Cambridge, Mass, USA, 1999.View at: Google Scholar
C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan, “A dual coordinate descent method for large-scale linear SVM,” in Proceedings of the 25th International Conference on Machine Learning (ICML '08), pp. 408–415, Helsinki, Finland, July 2008.View at: Google Scholar
H. Andrews and B. Hunt, Digital Image Restoration, Prentice-Hall, Englewood Cliffs, NJ, USA, 1977.
A. N. Tikhonov and V. Y. Arsenin, Solution of Ill-Posed Problems, Winston & Sons, Washington, DC, USA, 1977.
S. Boyd, N. Parikh, E. Chu, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.View at: Google Scholar
G. H. Golub and C. F. Van Loan, Matrix Computations, Johns Hopins University Press, Baltimore, Md, USA, 1996.View at: MathSciNet