Abstract

This paper develops two novel and fast Riemannian second-order approaches for solving a class of matrix trace minimization problems with orthogonality constraints, which is widely applied in multivariate statistical analysis. The existing majorization method is guaranteed to converge but its convergence rate is at best linear. A hybrid Riemannian Newton-type algorithm with both global and quadratic convergence is proposed firstly. A Riemannian trust-region method based on the proposed Newton method is further provided. Some numerical tests and application to the least squares fitting of the DEDICOM model and the orthonormal INDSCAL model are given to demonstrate the efficiency of the proposed methods. Comparisons with some latest Riemannian gradient-type methods and some existing Riemannian second-order algorithms in the MATLAB toolbox Manopt are also presented.

1. Introduction

Many multivariate statistical techniques are based on fitting a model to observed data. In most cases, these models are fitted by least squares, and as a consequence, one has to minimize a least squares loss function possibly subject to certain constraints on the parameters in the model. Least squares loss functions can often be expressed in terms of matrix trace functions. Sometimes, these have straightforward closed-form solutions, but also often such solutions are not available. In such cases, one may resort to iterative algorithms. The main purpose of the present paper is to provide several efficient algorithms that can be used for optimizing such matrix trace functions.

A class of functions that would include all those of interest in multivariate statistical analysis can probably not be given, but in practice many statistical problems can be reformulated as special cases of the optimization of the general function [13]:where , , , ; is an unknown matrix, is a constant that does not depend on , and means the sum of the diagonal elements of . The matrices , , and at this point are arbitrary but with the required orders. In fact, the problems of least squares fitting of a model (involving a matrix of model parameters) to a data matrix , that is, the minimization of over (with denoting the Frobenius norm), can often be written in a form like that of (1). To give an example, consider the case where the data matrix has submatrices and the model has submatrices of the form ; then,

Then, minimization of (2) is equivalent to minimization of (1) by letting , , , and , .

The function can be minimized over arbitrary , but in many practical applications of multivariate statistics, constraints are imposed, the most frequent one being columnwise orthonormality: [110], where denotes the identity matrix of size . Another common constraint is that of reduced rank; that is, the rank of is forced to be less than or equal to (). The present paper discusses only the case of columnwise orthonormality. Therefore, the considered matrix trace minimization problem with constraints can be mathematically formulated by the following problem.

Problem 1. where , , and , are given, is a constant, and .
Next, we present several practical applications of Problem 1 in multivariate statistics. The first example was proposed in [1, 4] for a least squares fit of a model for linear dynamical systems. Following their notation, let be a fixed matrix, , , and be fixed matrices, and be fixed matrices, and be a variable . One of their subproblems is to minimize a functionover , subject to the constraint . This function can be expanded aswhere denotes a constant term not depending on . It is readily verified that is the special case of with replaced by , , , , , , and .
The second example is the least squares fitting of the “DEDICOM model” [1, 5], which can be written as minimizingwhere is an arbitrary fixed matrix and and are variable matrices of orders and , respectively, with columnwise orthonormal. More detailed description of this model will be given in Numerical Experiments section. Kiers et al. [6] have given an alternating least squares algorithm for fitting this model. The DEDICOM function, with considered fixed, can be rewritten aswhere is a constant not depending on . This is a special case of with , , .
The third example is to fit the “orthonormal INDSCAL model” (see [7], pp. 110, 118 and [1, 10]). Let be a fixed symmetric matrix of order , be a variable matrix of order , and be a variable diagonal matrix of order , . Fitting the orthonormal INDSCAL model is equivalent to minimizing the functionsubject to . As in Berge [8], can be minimized by alternatingly updating and , . Function , with , considered fixed, can be rewritten aswhere is a constant with respect to . This is the particular case of with , , , .
As a final example, we consider the simultaneous components analysis problem [3, 9] of minimizingover and , subject to . By means of an alternating least squares algorithm, they updated all parameter matrices in turn. Function , with , considered fixed, can be expanded aswhich, ignoring the constant , is a special case of with replaced by , , and .
Concerning numerical approach for solving Problem 1, actually, Kiers et al. [13] have proposed the well-known majorization algorithm, which is based on majorizing by a different function whose minimum is easily obtained. The majorizing function is chosen such that the parameter set for which the majorizing function attains its minimum will also decrease the function to be minimized. For a more detailed description of such strategies, see [13]. However, since each iteration of the majorization method requires a full eigenvalue decomposition and the rate of convergence is at best linear, the method can potentially be very slow. An average of computational results of 10 random tests of the majorization method on Problem 1 is reported in Table 1, where the constant and and the given matrices , , , , and are all generated randomly by means of (in MATLAB style) , , , , and . In Table 1, “n, p” means the dimension indexes and “CT.” and “IT.” mean the averaged total computing time in seconds and the averaged number of iterations to achieve the stopping criteria, respectively. “Grad.” means the averaged norm of Riemannian gradient ,” means the averaged difference of the objective function values, and “” means the averaged value of feasibility . We can observe from Table 1 and the convergence histories of the objective function value that are unreported here that the existing majorization algorithm is a monotonically convergent algorithm for minimizing the matrix trace function iteratively. However, it requires a large number of iterations to converge to within a given tolerance because the rate of convergence of this method is at best linear.
To overcome the slow linear rate of convergence, in this work we reconsider Problem 1 under the framework of Riemannian optimization, by noting that the feasible set, , with is referred to the well-known Stiefel manifold, an -dimensional embedded submanifold of the vector space . Riemannian optimization refers to optimization on Riemannian manifolds and has been extensively studied over decades. Optimization over the Stiefel manifold is an important special case of Riemannian optimization, which has recently aroused considerable research interests due to the wide applications in different fields such as the linear eigenvalue problem, the orthogonal Procrustes problem, the nearest low-rank correlation matrix problem, the Kohn–Sham total energy minimization, and singular value decomposition. Since optimization over the Stiefel manifold can be viewed as a general nonlinear optimization problem with constraints, many standard algorithms [11] in the Euclidean space can be generalized to manifold setting directly and have been explored and successfully applied to various applications, e.g., Riemannian steepest descent method [12], Riemannian curvilinear search method with Barzilai–Borwein (BB) steps [13], Riemannian Dai’s nonmonotone-based conjugate gradient method [14, 15], Riemannian Polak–Ribière–Polyak-based nonlinear conjugate gradient method [16, 17], and Riemannian Fletcher–Reeves-based conjugate gradient method [18]. However, as we know, gradient-type algorithms often perform reasonably well but might converge slowly when the generated iterates are close to an optimal solution. Usually, fast local convergence cannot be expected if only the gradient information is used. In the Euclidean space , it is well known that higher rates of convergence can be achieved by using second-order information on the cost function. The classical choice is Newton’s method; it plays a central role in the development of numerical techniques for optimization because of its simple formulation and its quadratic convergence properties. The history of Newton’s method on manifolds can be traced back to Gabay [19] who proposed a formulation for the method on embedded submanifolds of . Edelman et al. [20] proposed a formulation of Newton’s method on Stiefel manifold. Recently, Aihara and Sato [21] presented a matrix-free implementation of Riemannian Newton’s method on the Stiefel manifold. Sato [22] developed the Riemannian Newton’s method for the joint diagonalization problem on the Stiefel manifold, where the dimension of Newton’s equation is reduced and can be effectively solved by means of the Kronecker product and the vec and veck operators. Very recently, Hu et al. [23] proposed a regularized Newton method for optimization problems on Riemannian manifolds. They used a second-order approximation of the objective function in the Euclidean space to form a sequence of quadratic subproblems while keeping the manifold constraints. Numerical experiments and comparisons with other state-of-the-art methods indicated that their proposed algorithm is very promising. Another popular second-order algorithm on Stiefel manifold is the Riemannian trust-region (RTR) algorithm [2428], which has been successfully applied to various applications. In particular, Yang et al. [29] presented the RTR algorithm for model reduction of bilinear systems, where the error norm is treated as a cost function on the Stiefel manifold. Sato [30] developed the RTR algorithm to the join singular value decomposition of multiple rectangular matrices, which is formulated as a Riemannian optimization problem on the product of two Stiefel manifolds.
Motivated by these works, in this paper, we are interested in extending Riemannian Newton’s method and the Riemannian trust-region method to the underlying matrix trace minimization problem. We first derive the specific expression of Riemannian gradient and Hessian of the objective function of Problem 1. Specifically, we use the Kronecker product and the vectorization operators to reduce the dimension of the involved Newton’s equation and to transform the equation into a standard symmetric linear system . The resultant system can be efficiently solved by means of direct inversion or some well-known Krylov subspace methods, such as the conjugate residual method (page 182 of [31]). Because Newton’s method is not guaranteed to have global convergence, we need to prepare a suitable starting point that is sufficiently close to an optimal solution. Here, we applied OptStiefelGBB, a state-of-the-art algorithm proposed by Wen and Yin [13] which has the benefit of requiring very little memory and has been proved to be globally convergent, to obtain a suitable initial point. The resulting hybrid Riemannian Newton-type algorithm is globally and quadratically convergent. In the Euclidean space , it is well known that the pure Newton method converges only locally, and it cannot distinguish between local minima, local maxima, and saddle points. Compared to the pure Newton-type algorithm, the advantage of the trust-region algorithm is its more stable behavior [25]. For the RTR algorithm, convergence to stationary points is guaranteed for all initial points. By utilizing the Taylor expansion on Stiefel manifold, RTR algorithm constructs a trust-region subproblem on the tangent space. Here, we proposed a new trust-region subproblem, which has a lower computational cost than the original subproblem, based on our expression of the Riemannian Hessian of the objective function. The new trust-region subproblem can be solved by the classical truncated conjugate gradient, which is most popular due to its good properties and relatively cheap computational cost.
This paper is organized as follows. In Section 2, some basic geometric properties of the Stiefel manifold are given, and the representation matrix formula of Riemannian gradient and Hessian of the objective function are also derived. A hybrid Newton-type algorithm with globally and quadratically convergent is provided in Section 3. A Riemannian trust-region-based algorithm for Problem 1 is described in Section 4. Numerical experiments and application to the least squares fitting of the DEDICOM model and the orthonormal INDSCAL model and comparisons with some latest Riemannian gradient-type algorithms mentioned above and some existing second-order Riemannian algorithms in the MATALB toolbox Manopt are reported in Section 5. The conclusion is presented in Section 6.

2. Preliminaries

In this section, we first recall some notations, definitions, and basic properties of Riemannian manifolds used throughout the paper. The tangent space at on a manifold is denoted by . For manifolds and and a mapping , the differential of at is denoted by , which is a mapping from to . Given a smooth function on a manifold , the symbol is the extension of to the ambient Euclidean space . The symbols and denote the Euclidean and Riemannian gradients, respectively; i.e., given a smooth function on a manifold , and act on and , respectively. The symbol denotes the Riemannian Hessian. The concept of a retraction, which is a smooth map from the tangent bundle of into that approximates the exponential map to the first order, is given as follows.

Definition 1. (see [24], Definition 4.1.1). A retraction on a differentiable manifold is a smooth mapping from the tangent bundle onto satisfying the following two conditions (here denotes the restriction of to ):(1), where denotes the zero element of .(2)For any , it holds thatGiven a retraction and a smooth manifold , the general feasible algorithm framework on the manifold can be expressed aswhere is the step size at the -th iterate and is a tangent vector.
We next introduce some basic geometric properties of the involved Stiefel manifold , the reader is referred to [20, 24] for more details. The tangent space at can be expressed asSince the manifold is an embedded submanifold of the matrix Euclidean space , a natural metric for is the Euclidean metric , which induces a norm , where is the Frobenius norm of a matrix. In what follows, we denote by and the Riemannian metric and its induced norm on , respectively. Under the Riemannian metric, the orthogonal projection at onto is expressed aswhere denotes the symmetric part of , i.e., . We next introduce several different retraction operators on the Stiefel manifold at the current point for a given step size and descent direction (one can refer to [27] for more details).(1)Exponential map [20]:where is the QR decomposition of . This scheme needs to calculate an exponent of a -by- matrix and an decomposition of an -by- matrix.(2)Cayley transformation [13]:where and with . When , this scheme is much cheaper than the exponential map.(3)Polar decomposition (see [24], p. 58):The computational cost is lower than the Cayley transform but the Cayley transformation gives a better approximation to the exponential map.(4)QR decomposition (see [24], p. 59):where denotes the factor of the QR decomposition of the matrix in parentheses. It can be seen as an approximation of the polar decomposition. The main cost is the QR decomposition of a -by- matrix.The Riemannian gradient and Hessian of an objective function are basic concepts in Riemannian optimization; we next derive the representation matrix formulas of Riemannian gradient and Riemannian Hessian of the objective function in Problem 1. The Riemannian gradient, , of an objective function at is defined to be a unique tangent vector which satisfiesThe Hessian, , of at is defined to be a linear transformation of the tangent space through the covariant derivative of evaluated at :where the covariant derivative is defined through the Levi-Civita connection on (see [24], Definition 5.5.1).
In what follows, we define to be a function with the same form as defined in ; that is,The following two lemmas derive the explicit expressions of and .

Lemma 1. The Riemannian gradient of at can be expressed as

Proof. Since is a Riemannian submanifold of endowed with the induced metric, is equal to the orthogonal projection of the Euclidean gradient at onto . On the other hand, according to the matrix trace function differentiation, the Euclidean gradient at can be computed asHence, by using the projection given in (15), we obtain (23).

Lemma 2. Let be a tangent vector at , the Riemannian Hessian of at is expressed as a linear map on and given by

Proof. For , by simple calculation, the Euclidean Hessian can be written asBy using the classical expression of the Riemannian connection on a Riemannian submanifold of a Euclidean space (see [32], §5.3.3) and choosing , the Riemannian Hessian of on can also be expressed by using and the projection map (15). That is,Note that . We view the right-hand side of this relation as a product of two matrix functions of , rather than a composition of a map and a function. Then, (27) can be written as [32]in which we used the relation in the last equality of (28), and for any ,Furthermore, if is symmetric matrix, we haveThis relation can reduce the computational cost of in the last equality of (28). Indeed, it follows from (29) and (30) thatConsequently, by using (29) and (31), we obtain a more concrete expression for acting on asBy substituting (24) and (26) into (32), we obtain (25).
At the end of this section, we introduce the vec and veck operators, which are useful for rewriting a matrix equation by transforming the matrix into an unknown column vector, and some useful properties will be used in the sequel. For any , the vec operator is defined asFor any , where means the set of all -by- skew-symmetric matrices, the veck operator is defined asThe above definitions yield the following inner product equivalence:(i)For , , and , one can get where is the Kronecker product.(ii)There exists an vec-permutation matrix [33] such thatwhere is the matrix with a 1 in position and zeros elsewhere. From [33], we have . Because is a permutation matrix and so is orthogonal, then we have , and then(iii)For , let be the skew-symmetric parts of , i.e., , and we have(iv)For , we havewhere the matrix is defined as the following form:where is the th column of . Obviously, is standard column orthogonal, that is, . Furthermore, for , we have

3. Riemannian Newton’s Method for Problem 1

Since we have already obtained the matrix expressions of and and some other requisites for Riemannian optimization algorithms, in this section, we develop Riemannian Newton’s method for Problem 1. In Newton’s method, the search direction at is determined to be the solution to Newton’s equation

After is obtained, the candidate for the new iterate is then given bywhere the step size is set to one for simplicity. By substituting (23) and (25) into (43), complete Riemannian Newton’s equation for Problem 1 at can be written as

Now, the problem is how to solve the above Newton’s equation. Noting that it is complicated and difficult to solve because it must be solved for for given , i.e., must satisfy . To overcome these difficulties, we wish to obtain the representation matrix of Hess as a linear transformation on for arbitrarily fixed so that we can rewrite (45) into a standard linear equation, which permits many quite effective numerical approaches. To this end, we introduce the following lemma.

Lemma 3 (see [24], p. 42). An equivalent form of the tangent space to at is given bywhere is an arbitrary matrix that satisfies and .

From Lemma 3, one can easily check that the vector space of all tangent vectors has a dimension of . Let be the search direction at . From Lemma 3, can be expressed by

Since is also determined to be a unique tangent vector in , there exist unique matrices and which satisfy

The following proposition shows that we can write and by using and .

Proposition 1. Let and satisfy . If a tangent vector is expressed as (47), then the Hessian of the objective function acts on as with

Proof. From (25) and (48), we have the following equality:Note that and for any . Multiplying equation (51) by from the left and using the relations yieldsBy substituting (47) into (52) and using the relation , we obtain the expression of in (49) by using and . Similarly, we multiply (51) by from the left to obtainBy substituting (47) into (53) and using the relation , we obtain the expression of in (50).
Since the Hessian is a linear transformation with respect to , and, therefore, with respect to the set of matrices with and . Meanwhile, is also in the tangent space and can be uniquely expressed as with . Note that the dimension of the tangent space is ; then, we can regard that the Hessian is a linear transformation on that transforms a -dimension vector into another -dimension vector. The following lemma obtains the representation matrix of that linear transformation, in which we use the and operators taking matrices and skew-symmetric matrices into vectors.

Lemma 4. Let be a linear transformation on that acts on with aswhere and are given in equations (49) and (50). Then, the representation matrix of is given bywhere

Proof. From equations (49) and (50) together with (39)–(42), and are calculated as follows:This completes the proof.
The following proposition considers the structure of the representation matrix .

Proposition 2. The representation matrix is always symmetric.

Proof. Let and be two tangent vectors at which are expressed bywhere and . From the induced metric for , together with (35) and (36), one can getSince the Hessian is symmetric with respect to the induced inner product for [24], we haveThis is equivalent toHence, we obtainbecause and can be arbitrary tangent vectors at .
We are now in the position to solve Newton’s equation (45). From (48), Newton’s equation (45) can be rewritten asBy multiplying (63) from the left by and , (63) is equivalent toNote again that and for any . Applying the vec operation to the two equations of (64), we getTogether with Lemma 4, Newton’s equation (45) can be represented in the following form:where is given in (55), which is the representation symmetric matrix of Hess :We have thus derived an equivalent form, which is a standard symmetric linear equation, of Newton’s equation (45). After we solve this symmetric linear equation (66), we can obtain and ; then, we can easily reshape and . Therefore, we can calculate the solution of Newton’s equation (45).
Based on the above discussion, the corresponding Riemannian Newton’s method for solving Problem 1 can be described as follows.
If we know a good approximate solution of the problem in advance, Newton’s method works effectively. This is because Newton’s method generates locally but quadratically convergent sequences in general (Theorem 6.3.2 of [24, 34]).

Theorem 1. (local convergence). Consider Algorithm 1 with retraction as described in Section 2. Let be a critical point of ; . Assume that is nondegenerate at . Then, there exists a neighborhood of in such that for all , Algorithm 1 generates an infinite sequence converging quadratically to .

From [17, 21, 22], we have several remarks for Algorithm 1 as follows.

 (1) Choose an initial point . (2) fordo (3) Compute that satisfies and . (4) Compute the partition matrix by Lemma 4 and the vector by (68). Compute that satisfies the symmetric linear equation (69). (5) Partition as , where . (6) Construct and that satisfy and . (7) Compute