Abstract

This paper develops two novel and fast Riemannian second-order approaches for solving a class of matrix trace minimization problems with orthogonality constraints, which is widely applied in multivariate statistical analysis. The existing majorization method is guaranteed to converge but its convergence rate is at best linear. A hybrid Riemannian Newton-type algorithm with both global and quadratic convergence is proposed firstly. A Riemannian trust-region method based on the proposed Newton method is further provided. Some numerical tests and application to the least squares fitting of the DEDICOM model and the orthonormal INDSCAL model are given to demonstrate the efficiency of the proposed methods. Comparisons with some latest Riemannian gradient-type methods and some existing Riemannian second-order algorithms in the MATLAB toolbox Manopt are also presented.

1. Introduction

Many multivariate statistical techniques are based on fitting a model to observed data. In most cases, these models are fitted by least squares, and as a consequence, one has to minimize a least squares loss function possibly subject to certain constraints on the parameters in the model. Least squares loss functions can often be expressed in terms of matrix trace functions. Sometimes, these have straightforward closed-form solutions, but also often such solutions are not available. In such cases, one may resort to iterative algorithms. The main purpose of the present paper is to provide several efficient algorithms that can be used for optimizing such matrix trace functions.

A class of functions that would include all those of interest in multivariate statistical analysis can probably not be given, but in practice many statistical problems can be reformulated as special cases of the optimization of the general function [13]:where , , , ; is an unknown matrix, is a constant that does not depend on , and means the sum of the diagonal elements of . The matrices , , and at this point are arbitrary but with the required orders. In fact, the problems of least squares fitting of a model (involving a matrix of model parameters) to a data matrix , that is, the minimization of over (with denoting the Frobenius norm), can often be written in a form like that of (1). To give an example, consider the case where the data matrix has submatrices and the model has submatrices of the form ; then,

Then, minimization of (2) is equivalent to minimization of (1) by letting , , , and , .

The function can be minimized over arbitrary , but in many practical applications of multivariate statistics, constraints are imposed, the most frequent one being columnwise orthonormality: [110], where denotes the identity matrix of size . Another common constraint is that of reduced rank; that is, the rank of is forced to be less than or equal to (). The present paper discusses only the case of columnwise orthonormality. Therefore, the considered matrix trace minimization problem with constraints can be mathematically formulated by the following problem.

Problem 1. where , , and , are given, is a constant, and .
Next, we present several practical applications of Problem 1 in multivariate statistics. The first example was proposed in [1, 4] for a least squares fit of a model for linear dynamical systems. Following their notation, let be a fixed matrix, , , and be fixed matrices, and be fixed matrices, and be a variable . One of their subproblems is to minimize a functionover , subject to the constraint . This function can be expanded aswhere denotes a constant term not depending on . It is readily verified that is the special case of with replaced by , , , , , , and .
The second example is the least squares fitting of the “DEDICOM model” [1, 5], which can be written as minimizingwhere is an arbitrary fixed matrix and and are variable matrices of orders and , respectively, with columnwise orthonormal. More detailed description of this model will be given in Numerical Experiments section. Kiers et al. [6] have given an alternating least squares algorithm for fitting this model. The DEDICOM function, with considered fixed, can be rewritten aswhere is a constant not depending on . This is a special case of with , , .
The third example is to fit the “orthonormal INDSCAL model” (see [7], pp. 110, 118 and [1, 10]). Let be a fixed symmetric matrix of order , be a variable matrix of order , and be a variable diagonal matrix of order , . Fitting the orthonormal INDSCAL model is equivalent to minimizing the functionsubject to . As in Berge [8], can be minimized by alternatingly updating and , . Function , with , considered fixed, can be rewritten aswhere is a constant with respect to . This is the particular case of with , , , .
As a final example, we consider the simultaneous components analysis problem [3, 9] of minimizingover and , subject to . By means of an alternating least squares algorithm, they updated all parameter matrices in turn. Function , with , considered fixed, can be expanded aswhich, ignoring the constant , is a special case of with replaced by , , and .
Concerning numerical approach for solving Problem 1, actually, Kiers et al. [13] have proposed the well-known majorization algorithm, which is based on majorizing by a different function whose minimum is easily obtained. The majorizing function is chosen such that the parameter set for which the majorizing function attains its minimum will also decrease the function to be minimized. For a more detailed description of such strategies, see [13]. However, since each iteration of the majorization method requires a full eigenvalue decomposition and the rate of convergence is at best linear, the method can potentially be very slow. An average of computational results of 10 random tests of the majorization method on Problem 1 is reported in Table 1, where the constant and and the given matrices , , , , and are all generated randomly by means of (in MATLAB style) , , , , and . In Table 1, “n, p” means the dimension indexes and “CT.” and “IT.” mean the averaged total computing time in seconds and the averaged number of iterations to achieve the stopping criteria, respectively. “Grad.” means the averaged norm of Riemannian gradient ,” means the averaged difference of the objective function values, and “” means the averaged value of feasibility . We can observe from Table 1 and the convergence histories of the objective function value that are unreported here that the existing majorization algorithm is a monotonically convergent algorithm for minimizing the matrix trace function iteratively. However, it requires a large number of iterations to converge to within a given tolerance because the rate of convergence of this method is at best linear.
To overcome the slow linear rate of convergence, in this work we reconsider Problem 1 under the framework of Riemannian optimization, by noting that the feasible set, , with is referred to the well-known Stiefel manifold, an -dimensional embedded submanifold of the vector space . Riemannian optimization refers to optimization on Riemannian manifolds and has been extensively studied over decades. Optimization over the Stiefel manifold is an important special case of Riemannian optimization, which has recently aroused considerable research interests due to the wide applications in different fields such as the linear eigenvalue problem, the orthogonal Procrustes problem, the nearest low-rank correlation matrix problem, the Kohn–Sham total energy minimization, and singular value decomposition. Since optimization over the Stiefel manifold can be viewed as a general nonlinear optimization problem with constraints, many standard algorithms [11] in the Euclidean space can be generalized to manifold setting directly and have been explored and successfully applied to various applications, e.g., Riemannian steepest descent method [12], Riemannian curvilinear search method with Barzilai–Borwein (BB) steps [13], Riemannian Dai’s nonmonotone-based conjugate gradient method [14, 15], Riemannian Polak–Ribière–Polyak-based nonlinear conjugate gradient method [16, 17], and Riemannian Fletcher–Reeves-based conjugate gradient method [18]. However, as we know, gradient-type algorithms often perform reasonably well but might converge slowly when the generated iterates are close to an optimal solution. Usually, fast local convergence cannot be expected if only the gradient information is used. In the Euclidean space , it is well known that higher rates of convergence can be achieved by using second-order information on the cost function. The classical choice is Newton’s method; it plays a central role in the development of numerical techniques for optimization because of its simple formulation and its quadratic convergence properties. The history of Newton’s method on manifolds can be traced back to Gabay [19] who proposed a formulation for the method on embedded submanifolds of . Edelman et al. [20] proposed a formulation of Newton’s method on Stiefel manifold. Recently, Aihara and Sato [21] presented a matrix-free implementation of Riemannian Newton’s method on the Stiefel manifold. Sato [22] developed the Riemannian Newton’s method for the joint diagonalization problem on the Stiefel manifold, where the dimension of Newton’s equation is reduced and can be effectively solved by means of the Kronecker product and the vec and veck operators. Very recently, Hu et al. [23] proposed a regularized Newton method for optimization problems on Riemannian manifolds. They used a second-order approximation of the objective function in the Euclidean space to form a sequence of quadratic subproblems while keeping the manifold constraints. Numerical experiments and comparisons with other state-of-the-art methods indicated that their proposed algorithm is very promising. Another popular second-order algorithm on Stiefel manifold is the Riemannian trust-region (RTR) algorithm [2428], which has been successfully applied to various applications. In particular, Yang et al. [29] presented the RTR algorithm for model reduction of bilinear systems, where the error norm is treated as a cost function on the Stiefel manifold. Sato [30] developed the RTR algorithm to the join singular value decomposition of multiple rectangular matrices, which is formulated as a Riemannian optimization problem on the product of two Stiefel manifolds.
Motivated by these works, in this paper, we are interested in extending Riemannian Newton’s method and the Riemannian trust-region method to the underlying matrix trace minimization problem. We first derive the specific expression of Riemannian gradient and Hessian of the objective function of Problem 1. Specifically, we use the Kronecker product and the vectorization operators to reduce the dimension of the involved Newton’s equation and to transform the equation into a standard symmetric linear system . The resultant system can be efficiently solved by means of direct inversion or some well-known Krylov subspace methods, such as the conjugate residual method (page 182 of [31]). Because Newton’s method is not guaranteed to have global convergence, we need to prepare a suitable starting point that is sufficiently close to an optimal solution. Here, we applied OptStiefelGBB, a state-of-the-art algorithm proposed by Wen and Yin [13] which has the benefit of requiring very little memory and has been proved to be globally convergent, to obtain a suitable initial point. The resulting hybrid Riemannian Newton-type algorithm is globally and quadratically convergent. In the Euclidean space , it is well known that the pure Newton method converges only locally, and it cannot distinguish between local minima, local maxima, and saddle points. Compared to the pure Newton-type algorithm, the advantage of the trust-region algorithm is its more stable behavior [25]. For the RTR algorithm, convergence to stationary points is guaranteed for all initial points. By utilizing the Taylor expansion on Stiefel manifold, RTR algorithm constructs a trust-region subproblem on the tangent space. Here, we proposed a new trust-region subproblem, which has a lower computational cost than the original subproblem, based on our expression of the Riemannian Hessian of the objective function. The new trust-region subproblem can be solved by the classical truncated conjugate gradient, which is most popular due to its good properties and relatively cheap computational cost.
This paper is organized as follows. In Section 2, some basic geometric properties of the Stiefel manifold are given, and the representation matrix formula of Riemannian gradient and Hessian of the objective function are also derived. A hybrid Newton-type algorithm with globally and quadratically convergent is provided in Section 3. A Riemannian trust-region-based algorithm for Problem 1 is described in Section 4. Numerical experiments and application to the least squares fitting of the DEDICOM model and the orthonormal INDSCAL model and comparisons with some latest Riemannian gradient-type algorithms mentioned above and some existing second-order Riemannian algorithms in the MATALB toolbox Manopt are reported in Section 5. The conclusion is presented in Section 6.

2. Preliminaries

In this section, we first recall some notations, definitions, and basic properties of Riemannian manifolds used throughout the paper. The tangent space at on a manifold is denoted by . For manifolds and and a mapping , the differential of at is denoted by , which is a mapping from to . Given a smooth function on a manifold , the symbol is the extension of to the ambient Euclidean space . The symbols and denote the Euclidean and Riemannian gradients, respectively; i.e., given a smooth function on a manifold , and act on and , respectively. The symbol denotes the Riemannian Hessian. The concept of a retraction, which is a smooth map from the tangent bundle of into that approximates the exponential map to the first order, is given as follows.

Definition 1. (see [24], Definition 4.1.1). A retraction on a differentiable manifold is a smooth mapping from the tangent bundle onto satisfying the following two conditions (here denotes the restriction of to ):(1), where denotes the zero element of .(2)For any , it holds thatGiven a retraction and a smooth manifold , the general feasible algorithm framework on the manifold can be expressed aswhere is the step size at the -th iterate and is a tangent vector.
We next introduce some basic geometric properties of the involved Stiefel manifold , the reader is referred to [20, 24] for more details. The tangent space at can be expressed asSince the manifold is an embedded submanifold of the matrix Euclidean space , a natural metric for is the Euclidean metric , which induces a norm , where is the Frobenius norm of a matrix. In what follows, we denote by and the Riemannian metric and its induced norm on , respectively. Under the Riemannian metric, the orthogonal projection at onto is expressed aswhere denotes the symmetric part of , i.e., . We next introduce several different retraction operators on the Stiefel manifold at the current point for a given step size and descent direction (one can refer to [27] for more details).(1)Exponential map [20]:where is the QR decomposition of . This scheme needs to calculate an exponent of a -by- matrix and an decomposition of an -by- matrix.(2)Cayley transformation [13]:where and with . When , this scheme is much cheaper than the exponential map.(3)Polar decomposition (see [24], p. 58):The computational cost is lower than the Cayley transform but the Cayley transformation gives a better approximation to the exponential map.(4)QR decomposition (see [24], p. 59):where denotes the factor of the QR decomposition of the matrix in parentheses. It can be seen as an approximation of the polar decomposition. The main cost is the QR decomposition of a -by- matrix.The Riemannian gradient and Hessian of an objective function are basic concepts in Riemannian optimization; we next derive the representation matrix formulas of Riemannian gradient and Riemannian Hessian of the objective function in Problem 1. The Riemannian gradient, , of an objective function at is defined to be a unique tangent vector which satisfiesThe Hessian, , of at is defined to be a linear transformation of the tangent space through the covariant derivative of evaluated at :where the covariant derivative is defined through the Levi-Civita connection on (see [24], Definition 5.5.1).
In what follows, we define to be a function with the same form as defined in ; that is,The following two lemmas derive the explicit expressions of and .

Lemma 1. The Riemannian gradient of at can be expressed as

Proof. Since is a Riemannian submanifold of endowed with the induced metric, is equal to the orthogonal projection of the Euclidean gradient at onto . On the other hand, according to the matrix trace function differentiation, the Euclidean gradient at can be computed asHence, by using the projection given in (15), we obtain (23).

Lemma 2. Let be a tangent vector at , the Riemannian Hessian of at is expressed as a linear map on and given by

Proof. For , by simple calculation, the Euclidean Hessian can be written asBy using the classical expression of the Riemannian connection on a Riemannian submanifold of a Euclidean space (see [32], §5.3.3) and choosing , the Riemannian Hessian of on can also be expressed by using and the projection map (15). That is,Note that . We view the right-hand side of this relation as a product of two matrix functions of , rather than a composition of a map and a function. Then, (27) can be written as [32]in which we used the relation in the last equality of (28), and for any ,Furthermore, if is symmetric matrix, we haveThis relation can reduce the computational cost of in the last equality of (28). Indeed, it follows from (29) and (30) thatConsequently, by using (29) and (31), we obtain a more concrete expression for acting on asBy substituting (24) and (26) into (32), we obtain (25).
At the end of this section, we introduce the vec and veck operators, which are useful for rewriting a matrix equation by transforming the matrix into an unknown column vector, and some useful properties will be used in the sequel. For any , the vec operator is defined asFor any , where means the set of all -by- skew-symmetric matrices, the veck operator is defined asThe above definitions yield the following inner product equivalence:(i)For , , and , one can get where is the Kronecker product.(ii)There exists an vec-permutation matrix [33] such thatwhere is the matrix with a 1 in position and zeros elsewhere. From [33], we have . Because is a permutation matrix and so is orthogonal, then we have , and then(iii)For , let be the skew-symmetric parts of , i.e., , and we have(iv)For , we havewhere the matrix is defined as the following form:where is the th column of . Obviously, is standard column orthogonal, that is, . Furthermore, for , we have

3. Riemannian Newton’s Method for Problem 1

Since we have already obtained the matrix expressions of and and some other requisites for Riemannian optimization algorithms, in this section, we develop Riemannian Newton’s method for Problem 1. In Newton’s method, the search direction at is determined to be the solution to Newton’s equation

After is obtained, the candidate for the new iterate is then given bywhere the step size is set to one for simplicity. By substituting (23) and (25) into (43), complete Riemannian Newton’s equation for Problem 1 at can be written as

Now, the problem is how to solve the above Newton’s equation. Noting that it is complicated and difficult to solve because it must be solved for for given , i.e., must satisfy . To overcome these difficulties, we wish to obtain the representation matrix of Hess as a linear transformation on for arbitrarily fixed so that we can rewrite (45) into a standard linear equation, which permits many quite effective numerical approaches. To this end, we introduce the following lemma.

Lemma 3 (see [24], p. 42). An equivalent form of the tangent space to at is given bywhere is an arbitrary matrix that satisfies and .

From Lemma 3, one can easily check that the vector space of all tangent vectors has a dimension of . Let be the search direction at . From Lemma 3, can be expressed by

Since is also determined to be a unique tangent vector in , there exist unique matrices and which satisfy

The following proposition shows that we can write and by using and .

Proposition 1. Let and satisfy . If a tangent vector is expressed as (47), then the Hessian of the objective function acts on as with

Proof. From (25) and (48), we have the following equality:Note that and for any . Multiplying equation (51) by from the left and using the relations yieldsBy substituting (47) into (52) and using the relation , we obtain the expression of in (49) by using and . Similarly, we multiply (51) by from the left to obtainBy substituting (47) into (53) and using the relation , we obtain the expression of in (50).
Since the Hessian is a linear transformation with respect to , and, therefore, with respect to the set of matrices with and . Meanwhile, is also in the tangent space and can be uniquely expressed as with . Note that the dimension of the tangent space is ; then, we can regard that the Hessian is a linear transformation on that transforms a -dimension vector into another -dimension vector. The following lemma obtains the representation matrix of that linear transformation, in which we use the and operators taking matrices and skew-symmetric matrices into vectors.

Lemma 4. Let be a linear transformation on that acts on with aswhere and are given in equations (49) and (50). Then, the representation matrix of is given bywhere

Proof. From equations (49) and (50) together with (39)–(42), and are calculated as follows:This completes the proof.
The following proposition considers the structure of the representation matrix .

Proposition 2. The representation matrix is always symmetric.

Proof. Let and be two tangent vectors at which are expressed bywhere and . From the induced metric for , together with (35) and (36), one can getSince the Hessian is symmetric with respect to the induced inner product for [24], we haveThis is equivalent toHence, we obtainbecause and can be arbitrary tangent vectors at .
We are now in the position to solve Newton’s equation (45). From (48), Newton’s equation (45) can be rewritten asBy multiplying (63) from the left by and , (63) is equivalent toNote again that and for any . Applying the vec operation to the two equations of (64), we getTogether with Lemma 4, Newton’s equation (45) can be represented in the following form:where is given in (55), which is the representation symmetric matrix of Hess :We have thus derived an equivalent form, which is a standard symmetric linear equation, of Newton’s equation (45). After we solve this symmetric linear equation (66), we can obtain and ; then, we can easily reshape and . Therefore, we can calculate the solution of Newton’s equation (45).
Based on the above discussion, the corresponding Riemannian Newton’s method for solving Problem 1 can be described as follows.
If we know a good approximate solution of the problem in advance, Newton’s method works effectively. This is because Newton’s method generates locally but quadratically convergent sequences in general (Theorem 6.3.2 of [24, 34]).

Theorem 1. (local convergence). Consider Algorithm 1 with retraction as described in Section 2. Let be a critical point of ; . Assume that is nondegenerate at . Then, there exists a neighborhood of in such that for all , Algorithm 1 generates an infinite sequence converging quadratically to .

From [17, 21, 22], we have several remarks for Algorithm 1 as follows.

(1)Choose an initial point .
(2)fordo
(3) Compute that satisfies and .
(4) Compute the partition matrix by Lemma 4 and the vector by (68). Compute that satisfies the symmetric linear equation (69).
(5) Partition as , where .
(6) Construct and that satisfy and .
(7) Compute .
(8) Compute the next iterate , where is a retraction on .
(9)end for

Remark 1. In the Step 3 of Algorithm 1, we can arbitrarily fix to satisfy and . Such can be computed by the full QR decompositions. In practice, using the MATLAB’s qr function with input returns an orthogonal matrix such that where is a upper triangular matrix. If we partition with and , then . Therefore, we can choose as .

Remark 2. The symmetric linear equation (69) can be solved by direct inversion if is invertible. However, if the dimension of the problem is large, it is difficult to equate by direct inversion. In this case, some Krylov subspace methods such as the conjugate residual (CR) method (page 182 of [31]) can be used to solve (69). The CR method is used to solve linear equations of the form where is an invertible and symmetric matrix and is nonzero. Note that the system matrix is only required to be symmetric, not symmetric positive definite. However, we should point out that other solvers can also be applied to solve (69) because there are no specific structural requirements for the variable . When is positive definite, the conjugate gradient (CG) method can be used. Moreover, if we consider (69) as just a linear equation of standard form, then the representation matrix is not necessarily symmetric. In this case, more Krylov subspace methods can be used, such as the generalized minimal residual (GMRES) method and biconjugate gradient stabilized (BiCGSTAB) method. Specifically, we use the CR method for all the numerical experiments in Section 5.

Remark 3. In general, the Newton vector , solution of (43), is not necessarily a descent direction of . Indeed, we havewhich is not guaranteed to be negative without additional assumptions on the operator . A sufficient condition for to be a descent direction is that be positive definite, i.e., for all . The positive definiteness of the representation matrix can be used to check the positive definiteness of the Hessian , which can be used to check whether a critical point obtained by Algorithm 1 is a local minimum. We note that for any with , ,Therefore, if the symmetric matrix is positive definite, then is positive definite, and is a local minimum. By using the partition form (55) of , we can easily derive a necessary condition for the positive definiteness of as and are positive definite. We can obtain from this fact an easy way to check the necessity for the positive definiteness as all the diagonal components of and are positive. We next consider the case where Algorithm 1 arrives at a critical point of and assume that is semipositive definite. Then, is positive definite if and only if . Therefore, under the assumption of semipositive definiteness of , the condition of positive definiteness of is necessary and sufficient for and .

Remark 4. Similar to unconstrained optimization in the Euclidean space, for some constant is a reasonable stopping criterion for Problem 1. In fact, the Lagrangian of Problem 1 iswhere is a symmetric matrix representing the Lagrange multiply. Then, the first-order optimality conditions in the Euclidean sense areFrom the second equation, we have . Since must be symmetric, we further obtain . Under the conditions , is equivalent to , since it follows from (15), (23), and (24) thatThus, first-order critical points in the Euclidean sense can be interpreted as stationary points in the Riemannian sense.
Because Newton’s method is not guaranteed to have global convergence, we need to prepare a suitable starting point that is sufficiently close to an optimal solution. Here, we apply a state-of-the-art Riemannian gradient-type algorithm, OptStiefelGBB, to obtain a suitable initial point. OptStiefelGBB, proposed by Wen and Yin [13], is simple and has the benefit of requiring very little memory. The retraction adapted in OptStiefelGBB is the Cayley transformation (17), and an initial step size computed by the Barzilai–Borwein (BB) method [35] is used to speed up the convergence. Since the BB step size does not necessarily decrease the objective value at every iteration, it may invalidate convergence, and a nonmonotone line search method based on a strategy in [36] is also used to guarantee global convergence.
To speed up the convergence, we propose a hybrid method for Problem 1. The strategy is as follows: we start to perform the OptStiefelGBB until is sufficiently small. Then, we switch the method to Newton’s one. One possibility of the switching criterion can be set aswhere . The algorithm is stated as follows.
Since the cost function in Problem 1 is smooth and the Stiefel manifold is compact, OptStiefelGBB applied to Problem 1 generates a sequence converging globally to a critical point of [13]. On the other hand, according to Theorem 1, Newton’s method generates a quadratically convergent sequence if the initial point is sufficiently close to an optimal solution. Therefore, we know that the proposed hybrid algorithm is globally and quadratically convergent if is sufficiently small.

Proposition 3. If is sufficiently small, Algorithm 2 with the switching criterion (74) generates a globally and quadratically convergent sequence.

(1)Choose an initial point and a parameter . Set .
(2)whiledo
(3) Perform OptStiefelGBB. All parameters needed in OptStiefelGBB are the same as those in [13].
(4).
(5)end while
(6)Set and .
(7)Perform Steps 2–9 in Algorithm 1.

4. Riemannian Trust-Region Method for Problem 1

With the preliminary results discussed in Section 2, we now state the Riemannian trust-region method for solving Problem 1. The trust-region method is an iterative method for minimizing a cost function. At each iteration step, a quadratic model of the cost function is obtained. This model is assumed to be suitable in a region (the trust region) around the current iterate. Then, an update is computed as the minimizer of the model in the trust region. The quality of the trial update is evaluated; it is consequently accepted or rejected, and the trust-region radius is adjusted.

In a Euclidean space , if and is the inner product in , the trust-region subproblem for finding the update for the current iterate is given bywhere denotes the directional derivative of the function at in the direction of and is the trust-region radius. The quality of the model is evaluated by means of the quotient

If is close to 0 or negative, then the model is very inaccurate; i.e., the step must be rejected, and the trust-region radius must be reduced. If is larger but still small, the step is accepted, and the trust-region radius is reduced. Finally, if is close to 1, then there is a good correspondence between the model and the cost function; the step is accepted, and the trust-region radius can be increased.

In the Riemannian trust-region method, at the -th iteration , by utilizing the Taylor expansion on manifold, we consider the following trust-region subproblem on the tangent space:

From (47) and (48), we have

On the other hand, since , then from Lemma 3, can also be uniquely expressed as with and . And

Then, the Riemannian trust-region subproblem (77) can be written as

In order to provide a guidance for selecting the new trust-region radius , we introduce the quotientwhich is also used to judge the acceptance or rejection of the candidate . Due to the fact that , can be given by

As in the Euclidean space setting, the constants and are compared with the ratio and the result determines the trust-region radius in the next iteration. Except that, the constant is used to measure . The trust-region step will be taken as the next iteration if , and rejected, otherwise. Specifically, we present the Riemannian trust-region method for Problem 1 as follows.

The trust-region subproblem (80) is solved iteratively and forms the inner iteration of Algorithm 3. Note that (80) is a minimization problem in and could then be solved by many classical approaches. A widely used approach is the truncated conjugate gradient algorithm (tCG) due to the following reasons:(1)The tCG is a Krylov subspace-based solver in which if the initial guess and , then the sequences and generated by tCG always satisfy and , and hence the approximation .(2)The sequence is strictly decreasing while is strictly increasing, which makes the RTR a descent approach and guarantees the global convergence (see [[24], Proposition 7.3.2.]).

(1)Choose parameters , , , and an initial point .
(2)fordo
(3) Obtain by solving the trust-region subproblem (80).
(4) Evaluate from (82).
(5)ifthen
(6)  .
(7)else if and then
(8)  .
(9)else
(10)  .
(11)end if
(12)ifthen
(13)  .
(14)else
(15)  .
(16)end if
(17)end for

The pesudocode of tCG is presented in Algorithm 3 for completeness. Note that we use indices in superscript without round brackets to denote the evolution of and within the inner iteration, while superscripts with round brackets are used in the outer iteration. As suggested by Absil et al. [24], the algorithm can stop in either after a fixed number of iterations or by the criterion:where are real parameters and . In our numerical testing, we set and .

Remark 5. Fix , and let be the th iterate of the tCG method for the subproblem expressed by (77). In the existing RTR method, which directly solves the subproblem (77), is updated as a tangent vector in , while in Algorithm 4, we update and , where . That is, if the tCG method for (77) terminates with and , we only have to compute . We do not have to compute for . Furthermore, in the existing RTR method, for some must to computed at each iteration of the tCG method. However, the proposed Algorithm 4 needs only and , where . Therefore, if the number of iterations in the inner tCG method needed for solving the trust-region subproblems is sufficiently large, the proposed Algorithm 4 may have a shorter total computational time than the existing RTR method which directly solves the subproblem (77). These facts imply that our proposed method can reduce the computational cost.
The convergence of the general RTR-tCG on a smooth Riemannian manifold has been extensively analyzed in [24, 37] where both the global convergence and local convergence rate have been established under appropriate assumptions. Therefore, to understand the performance of the RTR-tCG method for solving Problem 1, we only need to check these assumptions, which leads to the following conclusions.

(1)Initialization: set and ; and ; and and .
(2)fordo
(3)if (negative curvature) then
(4)  Compute such that and minimize in (80) and satisfy .
(5)  Return and .
(6)else
(7)  Set .
(8)  Set and .
(9)  if (exceeded trust-region) then
(10)   Compute such that and satisfy .
(11)   Return and .
(12)  else
(13)   Set and .
(14)   Set .
(15)   Set and .
(16)  end if
(17)end if
(18)end for

Theorem 2 (global convergence). Let be a sequence of iterates generated by Algorithms 3 and 4, and let an iterate be accepted if with . Then,

In fact, this conclusion is a straightforward result of Theorem 4.4 and Corollary 4.6 of [37], based on the fact that the cost function and the adopted retraction as described in Section 2 are smooth, and the involved manifold is a smooth and compact Riemannian manifold.

Theorem 3 (local convergence speed). Consider Algorithms 3 and 4 with retraction as in Section 2 and stopping criterion in Algorithm 4 as in (83). Let be a (nondegenerate) local minimum of . Then, there exists such that for all sequences generated by the algorithm converging to , there exists such that for all ,with as in (83), and “dist” defines the Riemannian distance on (see [24], p. 46).

5. Numerical Experiments

In this section, we report the numerical performance of the proposed algorithms for Problem 1. All the numerical experiments were completed on a personal computer with a Intel(R) Core(TM)2 Quad of 2.33 GHz CPU and 3.00 GB of RAM equipped with MATLAB R2019b. We report numerical experiments and application to the least squares fitting of the DEDICOM model and the orthonormal INDSCAL model. To illustrate the efficiency of our algorithms, we also compare Algorithms 2 and 3 with some latest Riemannian conjugate gradient- (CG-) type methods which are all applicable to Problem 1 with necessary modifications and some existing Riemannian second-order algorithms in the MATALB toolbox Manopt.

5.1. Algorithmic Issues

Throughout our experiments, in the implementation of Algorithm 2, the parameter in (74) is set to be . In the implementation of the inner iteration of Algorithm 2, the CR method to the symmetric linear equation (69), the initial matrix is all set to be 0 with suitable size at each repetition of Newton’s method, and were stopped when the relative residual 2-norms became less than . Here, the residual is , where denotes the th approximation of CR at the th repetition of Newton’s method. The largest number of CR method is set to be 1000. In the implementation of Algorithm 3, OptStiefel is also applied to obtain a suitable initial point, and the parameters are set to be , and .

The stopping criteria for Algorithms 2 and 3 are set to bewith . For all the compared Riemannian gradient-type algorithms, we use the same stopping criterion as that in OptStiefelGBB [13]: we let algorithms run up to iterations and stop it at iteration if or and orfor some constants , , , and , where

The default values of , , , and are , , 5, and 2000, respectively. Throughout, “IT.,” “CT.,” “Grad.,” and “Obj.” mean the number of iterations, the total computing time in seconds, the norm of Riemannian gradient , and the objective function value at the final iterate by implementing the proposed algorithms, respectively.

5.2. Numerical Experiments and Application to the Least Squares Fitting of the DEDICOM Model

DEDICOM is a model proposed by Harshman [5] for the analysis of asymmetric data. According to the DEDICOM model, a square data matrix , containing entries representing the (asymmetric) relation of object to object , is decomposed aswhere is an by matrix of weights for the objects on dimensions or aspects, is a square matrix of order , representing (asymmetric) relations among the dimensions, and is an error matrix with entries representing the part of the relation of object to object that is not explained by the model. The objective of fitting this model to the data is to explain the data by means of relations among as small a number of dimensions as possible. These dimensions can be considered as “aspects” of the objects. The “loadings” of the objects on these aspects are given by matrix , which is constrained to be columnwise orthonormal. The entries in matrix indicate the importance of the aspects for the objects. The dimensionality of and and hence the number of aspects to be determined are to be based on some external criterion, defined by the user.

The DEDICOM model has to be fit in the least squares sense over matrices and of order by and by , respectively. The loss function that is to be minimized can be written asover matrix and , subject to the constraint . Kiers and ten Berge [1, 2] have given an algorithm for this problem based on alternating least squares method that can be interpreted as alternately updating and . Because , the minimum of over for fixed is given by . Minimizing over , for fixed , is equivalent to minimizingwhich is a particular case of Problem 1 with , , and . An algorithm based on majorization was provided in [2] for updating this subproblem on . Because majorization algorithm requires a large number of iterations to converge within a given tolerance, we here test numerical performance of our proposed algorithms for updating .

Without loss of generality, the square data matrix in the DEDICOM model is given by and the fixed is given by . The dimensions and were set to 500 and 5, respectively. The iterations of Algorithms 2 and 3 were started with or , where the parameter was set to 10, , and , and is a solution of problem generated by majorization algorithm.

Figure 1 shows the convergence histories of the proposed Algorithm 2 after switching to Newton’s method (Algorithm 1), and Algorithm 3 after OptStiefel is performed to generate a suitable initial point for different choices of initial matrix . The number of iterations is plotted on the horizontal axis, and the of norm of the Riemannian gradient is plotted on the vertical axis in these figures. More numerical results are reported in Table 2, in which the term “IT.” in Algorithms 2 and 3 means the number of iterations required by OptStiefelGBB to satisfy the switching criteria and the number of extra iterations required by Algorithm 2 or Algorithm 3 to achieve the stopping criteria and “Obj.” means the objective function value at the final iterate. From Figure 1 and Table 2, we observe that once arrives in the appropriate region determined by , Algorithms 2 and 3 generate sequences which quickly converge. We can also see that the effectiveness of Algorithms 2 and 3 was not affected by different choices of initial matrix , and the convergence speed of these two algorithm is not very sensitive to the choice of . This is in accordance with the global convergence of these two algorithms established in Proposition 3 and Theorem 2, respectively. For this reason, we use in the subsequent examples.

Next, we also show the detailed numerical results when applying the CR method to the symmetric linear equation (69) directly in the implementation of step 7 of Algorithm 2. Figure 2 shows the convergence histories of the CR method for (69). The number of iterations of the CR method is plotted on the horizontal axis, and the log10 of the relative residual norm, , is plotted on the vertical axis in the figures. More numerical results are reported in Table 3. We can observe from Figure 2 and Table 3 that the CR method is effective for solving the Newton equation (69) as the symmetric linear system, and that Newton’s method, (69), generates locally quadratically convergent sequences.

Finally, we numerically compare Algorithm 3 in which the modified trust-region subproblems (80) are solved by Algorithm 4 with the existing trust-region method (denoted by Existing RTR) in MATLAB toolbox Manopt [38]. We fix . For each , we compute the total computing time needed for convergence. Table 4 and Figure 3 show that the performance of the proposed Algorithm 3 is slightly better than that of the existing method in terms of computational time. This phenomenon becomes more pronounced as the matrix dimension increases.

5.3. Numerical Comparison with Latest Riemannian CG Methods for the Least Squares Fitting of the Orthonormal INDSCAL Model

The orthonormal INDSCAL model is used to consider a three-way array consisting of symmetric slices [7, 8, 10]. In the model, these slices are composed by the doubly centered dissimilarity matrices. The orthonormal INDSCAL model decomposes each slice aswhere is a matrix (assumed to be columnwise orthonormal), is a diagonal matrix, and is a matrix, containing the errors of the model fit. That means all slices share a common loading matrix and differ from each other only by the diagonal elements of called idiosyncratic saliences. The orthonormal INDSCAL model, appeared initially in psychometric literature, is now widely used in social sciences, marketing research, etc.

The orthonormal INDSCAL problem seeks for such that the model fits the data in least squares sense, i.e., for given and fixed symmetric matrices , , minimizes the function:where and denotes the set of all diagonal matrices. This optimization problem has no direct analytical solution. The standard numerical solution of this problem is given by an alternating least squares algorithm [1, 8, 10]. Minimizing over , for fixed , , is equivalent to minimizingwhich is also a particular case of Problem 1 with , , , . A majorization-based algorithm was provided in [1] for updating this subproblem on . Here, we test our proposed algorithms for this subproblem.

To illustrate the efficiency of our algorithms, we also compare them with the existing Riemannian trust-region algorithm (denoted by existing RTR) and the existing Riemannian BFGS algorithm (denoted by existing RBFGS) in MATLAB toolbox Manopt [38] and some latest Riemannian conjugate gradient- (CG-) type methods which are all applicable with necessary modifications. To this end, we first introduce the iterative framework of each compared Riemannian CG method. For solving Problem 1, update the iterated using the following scheme starting in with ,where is the step size and iswhere is a vector transport on [24], which is a smooth mapping from the product of tangent bundles to the tangent bundle , where is the Whitney sum. There are several expressions to update the parameter of equation (1), and some of the most popular expressions are of the Fletcher–Reeves method:and of the Polak–Ribière method:

In this experiment, we compare Algorithms 2 and 3 with the following Riemannian CG methods. We note that the differences between these Riemannian CG methods are mainly due to their specific ways of updating the parameters and and constructing retractions and vector transports.(i)“RGPRP-CG”: the Riemannian geometric Polak–Ribière–Polay CG method used in [16] designed to solve Problem 1. The retraction adapted is the -based retraction (19), and the vector transport is constructed by using the orthogonal projection operator onto the tangent space, which has the formThe linear search is the Armijo-type line search and the CG direction is given bywherewith as in (3) and .(ii)“RFR-CG”: the Riemannian Fletcher–Reeves CG method used in [18] designed to solve Problem 1, with the same QR-based retraction and vector transport as “GPRP.” The step size is determined by such thatwhere is the initial step length, , and . The CG direction is given aswith as in (2) and .(iii)“RCG-Cayley”: the Riemannian case of Dai’s nonlinear CG method used in [14] designed to solve Problem 1 by using the Cayley transformation (17) to construct the retraction and the differentiated retraction to construct the vector transportwhere , in which and with . The step size is determined by the following nonmonotone line search condition instead of the Wolfe conditions:where and is some positive integer. The parameter is computed as with(iv)“RCG-QR”: a modification of “RCG-Cayley,” by using the -based retraction and the corresponding differentiated retraction as a vector transport (see p. 173 of [24]).(v)“RCG-Pol”: another modification of “RCG-Cayley” by using the polar decomposition (18) to construct a retraction and naturally using the corresponding differentiated retraction as a vector transport (see p. 173 of [24]).

In our experiments, the index . We consider to be correlation matrices of the form

In this case, the INDSCAL can be seen as simultaneous principal component analysis of several correlation matrices [10]. We use the diagonal matrices and . All starting points were generated randomly by . In particular, similar to [16, 18], a reasonable initial step length in the “RGPRR-CG” and “RFR-CG” methods can be set to

All the parameters in the compared Riemannian CG methods are chosen in a standard way as in their corresponding literature.

Figures 4 and 5 show the convergence histories of the considered algorithms with different problem sizes . The number of iterations is plotted on the horizontal axis, and the of norm of the Riemannian gradient is plotted on the vertical axis in these figures. The convergence curves generated by Algorithm 2 or Algorithm 3 are composed of two pieces, one of which is an initial part of the curve generated by OptStiefelGBB, and the second piece is the vertical line segment generated by Newton’s method. These curves show that the switching of OptStiefel to Newton’s one or trust region’s one accelerates drastically the speed of convergence. The convergence curve of OptStiefel shows that the convergence of the sequence generated by the OptStiefel is very slow. If the sequence generating method is switched to Newton’s method or trust-region method after the iterate gets into the -neighborhood of the target, the generated sequence reaches, in a few steps, the final point within the required accuracy. More numerical results are reported in Table 5, where the term “IT.” in Algorithms 2 and 3 is the same as that in Table 3 and “Obj.,” means the objective function value at the final iterate. We can observe from Table 5 that Algorithm 3, with the implementation of OptStiefel to generate a suitable initial point, works very effectively for the orthonormal INDSCAL fitting problem and always performs much better than the existing Riemannian BFGS algorithm and all the compared Riemannian gradient-types methods in terms of total computing time and the accuracy of the solution. On the other hand, we note that the OptStiefel part of Algorithm 3 seems to be like preconditioning. In most cases, the preconditioning scheme can bring a significant reduction in terms of iteration steps than the existing Riemannian RTR method.

We can also see from Table 5 that when the problem size is large, Algorithm 2 uses much more computing time than that of other methods. This is because Newton’s method involve an inner iteration, the CR method to the symmetric linear equation (69), it consumes a lot of time to get an approximate solution to the equation, especially when the system dimension is large. Even though Newton’s method takes the longest time per iteration among the considered algorithms, the convergence is very quick as a whole.

6. Conclusion

We have dealt with the trace function minimization problem with orthogonal constraints as an optimization problem on the manifold and have developed Riemannian Newton’s method and Riemannian trust-region method for the problem. For diminishing the difficulty in solving Newton’s equation in Newton’s method, we have computed the representation matrix of the Hessian of the objective function using the vectorization operators and have reduced Newton’s equation into a standard symmetric linear equation with dimension reduction. In addition, we have proposed a new trust-region method in which trust-region subproblems are solved by the truncated conjugate gradient method based on our expressions of the Hessian of the objective function. Furthermore, we have performed numerical experiments to verify that our present algorithms almost always perform much better than some Riemannian gradient-type methods.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

Part of this work was done when the first author was visiting the Department of Mathematics, Southern Illinois University Carbondale. He is grateful for providing good working conditions. The authors thank Prof. Mingqing Xiao for discussing the practical application of Problem 1 in multivariate statistics. This research was supported by the National Natural Science Foundation of China (11761024, 11561015, and 11961012), the Natural Science Foundation of Guangxi Province (2016GXNSFAA380074, 2016GXNSFFA380009, and 2017GXNSF-BA198082), the GUET Excellent Graduate Thesis Program (17YJPYSS24 and 2019YJSPY03), and the GUET Graduate Innovation Project (2020YJSCX02).