The Alternating Direction Method of Multipliers for Sufficient Dimension Reduction

Ma, Sheng; Jiang, Qin; Ku, Zaiqiang

doi:https://doi.org/10.1155/2024/3692883

Journal of Mathematics

On this page

Abstract Introduction Appendix Data Availability Conflicts of Interest References Copyright Related Articles

Research Article | Open Access

Volume 2024 | Article ID 3692883 | https://doi.org/10.1155/2024/3692883

The Alternating Direction Method of Multipliers for Sufficient Dimension Reduction

Sheng Ma,¹Qin Jiang,¹and Zaiqiang Ku¹

Academic Editor: Barbara Martinucci

Received04 Nov 2023

Revised17 Apr 2024

Accepted24 Apr 2024

Published13 May 2024

Abstract

The minimum average variance estimation (MAVE) method has proven to be an effective approach to sufficient dimension reduction. In this study, we apply the computationally efficient optimization algorithm named alternating direction method of multipliers (ADMM) to a particular approach (MAVE or minimum average variance estimation) to the problem of sufficient dimension reduction (SDR). Under some assumptions, we prove that the iterative sequence generated by ADMM converges to some point of the associated augmented Lagrangian function. Moreover, that point is stationary. It also presents some numerical simulations on synthetic data to demonstrate the computational efficiency of the algorithm.

1. Introduction

It is well known that dimension reduction is a highly efficient way in visualization and statistical analysis of high-dimensional data. It is assumed that the predictor vector affects the response variable only through a few linear combinations with Thus, it implies that all the information of about is summarized by , with . The column vectors of span the subspace . We call this subspace the sufficient dimension reduction (SDR) subspace (see Li [1], Cook [2]). To estimate , we do dimension reduction with the smallest possible value of

In the literature of dimension reduction, two most popular problems have been widely studied. One question is how does the conditional distribution of vary with . Specifically, it aims at seeking a few linear combinations such that

Then, we can find the subspace satisfying (1). If is a SDR subspace, we call it the central subspace (CS) and denote it by . In many real applications, is of most interest to the researchers, so the objective of the problem is to seek out a matrix , satisfying

We call the intersection of all SDR subspaces, the central mean space (CMS), and denote it by , if it is still a SDR subspace. Cook [2] gave more discussions about the CS and the CMS, including Cook and Li [3].

There were various methods to recover or in the references. For example, Li and Duan [4] studied the ordinary least squares. Later, Li [1] studied sliced inverse regression. Cook and Weisberg [5] put forward sliced inverse variance estimation. Then, Li [6] studied principal Hessian directions. Then, Xia et al. [7] also considered minimum average variance estimation (MAVE). Li and Wang [8] considered directional regression, too. Then, density MAVE was of interest to Xia [9]. Sliced regression-based MAVE was considered by Wang and Xia [10]. Efficient semiparametric estimation was studied by Ma and Zhu [11–13]. Most of these approaches used the principle of inverse regression with the well-known limitation—the need for a linearity condition on the covariates. MAVE and some other semiparametric methods do not have strong hypotheses on the design of covariates. Moreover, they have wider applicability. Furthermore, in many situations, they yield more accurate estimators of the reduced dimensional space, see references Ma and Zhu [11], Xia et al. [7], and Xia [9]. We know MAVE is not robust to outliers in the response because of the use of least squares and is more computationally intensive due to the use of kernel smoothing. The MAVE-type methods estimate the column space of through minimizing the loss criterion described later in (14) with the orthogonality constraint . Thus, the implementation of the MAVE-type methods involves nonconvex optimization problems based on the nonconvexity with the orthogonality constraint. Therefore, heuristic algorithms are often used to have no convergence guarantees in the literature.

Recently, many researchers adopted ADMM algorithm to high-dimensional statistics and machine learning, see Yin et al. [14], Xue et al. [15], Zhang and Zou [16], Gu et al. [17], and Kapla et al. [18], and many other topics, such as matrix completion, tensor completion, and sparse recovery, see Li et al. [19, 20], Liu et al. [21], and Shi et al. [22]. The main advantages of the ADMM algorithm are its flexibility at simplifying a diversity of optimization problems and its good convergence property, see Boyd et al. [23]. In this study, we demonstrate that the ADMM algorithm can be adapted to solve the optimization problem of the aforementioned MAVE-type methods. Moreover, we prove that the proposed algorithm will converge to some point that is stationary, through Wang et al. [24]’s theory for ADMM on nonconvex problems. To our knowledge, for MAVE, the proposed ADMM algorithm is the first one with the convergence property. Details refer to Theorem 2 in the study. In addition, Zhang et al. [25] proposed a robust estimation through regularization with case-specific parameters to achieve robust estimation and outlier detection simultaneously.

In the rest of this article, we present the proposed ADMM algorithm and prove its convergence properties in Section 2. Then, numerical simulations are conducted to illustrate the proposed algorithm in Section 3. Finally, a brief discussion is concluded in Section 4. Some technical details are relegated to Appendix.

2. The Proposed Algorithm Based on ADMM

The following ADMM algorithm is applicable to all the MAVE-type approaches. For simplicity, we only present ADMM algorithm to estimate the CMS problem. Xia et al. [7] considered the following model:for dimension reduction. Here, the matrix satisfies with some . The function is smooth, but it is not known, with E() = 0. Here, we focus on the estimation of by assuming that is known.

Consider the simple case . Let be the solution of

It is known that the conditional variance, for given , should befor . Thus, from the conditional expectation, we get

So, we know the expression (6) is equivalent to

This is called MAVE.

For a sample {()}, , let

For any given , we know local linear expansion, at , can be expressed as follows:Here, and = . In addition,

Obviously, the residuals we want to find are the following expressions:

Through the spirit of the local estimation that is linear smoothing, it is natural to estimate , using the following approximation:Here, are the weights satisfying . Two choices of the weights were given in Xia et al. [7]. The estimation of is the minimum point of expression (6). Of course, the same is true for . Then, we know the estimation of is the minimum value of the expression (6). That is,

Due to the expressions (4), (7), and (13), the MAVE method estimates by solving the following minimization problem:where = . Xia et al. [7] solved the minimizer of (14) by an iterative algorithm for and . Although the iterative algorithm seems feasible, it is difficult to establish its convergence property due to the nonconvexity of the orthogonality constraint . Zhu [26] studied optimization problems on Stiefel manifold and mentioned that problem (14) can be solved.

Machine learning, computer vision, and statistics are all research hotspots. In the fields of them, there are many optimization problems that are structured convex, such as Zanni et al. [27]. For solving various convex or nonconvex problems that arise in the fields of machine learning, computer vision, and statistics, the alternating direction method with multipliers (ADMM) has been a powerful and successful method, since Gabay and Mercier [28] introduced the ADMM, and then, its convergence properties for convex objective functions have been extensively studied. Many researchers successfully applied ADMM to solve these problems. For example, Boyd et al. [23] gave the recent survey paper. Liu et al. [29] and Cascarano et al. [30] also studied them.

Recently, many researchers successfully used some variants of ADMM to solve some previous nonconvex problems. For example, Hong et al. [31] discussed the convergence properties of variants of ADMM and then applied them to nonconvex problems. Moreover, they established the iteration complexity of ADMM. For the multiblock separable optimization problems, Guo et al. [32] studied the case of linear constraints and no convexity of the related component functions. For the iteration sequence generated by ADMM, they drew a conclusion that each clustering point of it is a critical point. For the multiblock proximal ADMM’s two linearized variants, Jiang et al. [33] studied their iteration complexity. Especially, for minimizing an objective function, lack of convexity, and possible smoothness and constrained by coupled linear identities, Wang et al. [24] studied the convergence of ADMM. To solve the optimization problems with separability and nonconvexity, Jia et al. [34] considered the convergence rate of the ADMM.

In this study, we are interested in the following ADMM algorithm to optimize problem (14) under the framework of Wang et al. [24]. General ADMM flow is referred to the original paper, such as Gabay and Mercier [28], for better understanding.

Let . Then, we can reformulate the problem of minimizing (8) as that of minimizingwhere the function is simply the indicator function of the set , i.e., = 0 if or if andwith . The function is proper, differentiable, and nonconvex.

The inner product of and is, as usual, written to be . Here, is the trace operator. Then, the following function,is the Lagrangian function of (15), where is the penalty parameter, is the Lagrange multiplier, and is the Frobenius norm.

We can compute the estimations of through the following ADMM algorithm.

For a given value of , , and at step , the iterative process is as follows:

Obviously, the function,is equivalent to (10). Then, by the definition of and simple algebraic manipulation, we obtain the explicit formfor .

In (19), after throwing away the terms independent of , one has

Since the right term of above expression is the quadratic function of , has an explicit form. Specifically, we can obtain by the gradient of the right term of above expression. By some algebraic manipulation, we have

Then, we obtain bywhere denotes vectorization for any matrix and

The function,is equivalent to (12). The solution of the function above is given in the following lemma.

Lemma 1. For any matrix with rank , its projection, , onto is defined as the solution ofThen, we have .

In Appendix, we give the Proof of Lemma 1. We apply Lemma 1 to the update step of and obtain

Finally, the update of is given in (21). The stopping rule is , for some small value , where for any matrix . We summarize the above procedure in Algorithm 1.

	Require: Initial values , , and ;
	While stopping criterion is not satisfied do
	Compute by (23);
	Compute by (26);
	Compute by (30);
	Compute by (21);
	;
	end while
	return, , , .

We next derive the convergence property of the proposed ADMM algorithm.

Theorem 2. For any sufficiently large , there is one limit point at least, for the sequence {, , } that is obtained by Algorithm 1. Moreover, each limit point is a stationary point of the associated augmented Lagrangian function .

The above result is based on the ADMM theory on nonconvex problems established by Wang et al. [24]. The proof of our theorem is relegated to Appendix.

3. Numerical Simulations

Numerical simulation studies are done to examine the numerical performances of the algorithm. Matlab is used for the numerical tests. The estimation accuracy is assessed by two different measures: the norm distance and the trace correlation . In addition, the norm distance refers to the canonical distance between the projection matrix and . That is, the norm distance is defined by The performance of the ADMM algorithm is studied in the first two examples, for single models and multiple index models . The third example aims to explore how different penalty parameters affect estimation accuracy.

Example 1. In this example, two single index models are considered.(i)The vector is independent uniformly distributed [0, 1], and we generate the data also by the model [35](ii)The vector is independent uniformly distributed [0, 1] and we generate the data by the modelwhere = and = . This model was investigated by Carroll et al. [36].
The error has the standard normal distribution. The true parameters of models (18) and (19) are = and = .

Example 2. In this example, two multiple index models are studied.(i) and we generate the data, by using the model [1]where the resulting parameters are = , = , and .(ii) and we generate the data, by using the model [7]where = , = , and .We consider the sample size = 100, 200, and 400. In each case, the simulation is repeated 1000 times. The penalty parameter is = 1. The initial values = = are obtained by Xia et al.’s (2002) OPG method.
The simulation results are shown in the following Tables 1 and 2.
Tables 1 and 2 report the two different measures of estimation accuracy, their corresponding standard errors (in parentheses), and the running times. We run the desired numerical simulation on an Intel(R) Core(TM) i7-5600U CPU at 2.60 GHz with 8 GB memory. The results suggest that the ADMM algorithm can provide slightly more accurate estimates than the MAVE method, according to the two different measures: the norm distance and the trace correlation. We also see that the ADMM algorithm is slightly faster than the MAVE method of Xia et al. [7] in most cases, according to the running times.

Example 3. This example examines the performance of the ADMM algorithm for different penalty parameters.
The data are generated from model (19) and model (20), and is, respectively, taken as 0.01, 0.5, 5, 10, 20, and 50.
For Example 3, 1000 replications are simulated with = 200. We display the results in Table 3. For simplicity, one only presents the mean and standard deviations of the Frobenius norm distance and the running times. Obviously, the mean and standard deviations of the Frobenius norm distance and the running times have very little difference when is taken as 0.01, 0.5, 5, 10, 20, and 50, respectively. These results reveal that the ADMM algorithm is almost insensitive to the choice of penalty parameter. Furthermore, we can find that the results based on the penalty parameters = 10 and 20 are slightly faster than the other results.

4. Concluding Remarks

Xia and his students have proposed an efficient direct algorithm to achieve MAVE and contributed it into an R package “MAVE.” The detailed description can be seen in the website https://CRAN.R-project.org/package=MAVE.

In this study, we develop an ADMM algorithm to solve the MAVE-type methods and establish its convergence properties. The computational efficiency of the algorithm has been demonstrated by numerical experiments. It is noteworthy that the proposed ADMM algorithm is applicable to all the MAVE-type approaches. For example, for survival data, Xia et al. [37] proposed the hMAVE method for survival data, which can be regarded as a censored vision of Xia et al. [7]’s MAVE method. In this setting, our ADMM algorithm can also be used. It has been proved that the ADMM algorithm is quite flexible in handling many large-scale statistical problems, see Boyd et al. [23]. Therefore, it is desirable to deal with sufficient dimension reduction for high or ultrahigh-dimensional data by our ADMM algorithm.

Appendix

We first give the Proof of Lemma 1.

Proof of Lemma 1. We use the method of Lagrange multipliers to show this result. Consider the Lagrangian functionwhere are the Lagrange multipliers with . Then, we haveIt follows from (A.3) thatCombining (A.4) with (A.3), we obtainCombining (A.4) with (A.5), the solution of (16) isFurthermore, it is easy to show that .
For readers’ convenience, we add the Assumptions A1–A5 of Wang et al. [24].
The model in the first scenario is (A1) Denote . Obviously, is coercive on . In addition, for any continuous , A1 holds trivially since is bounded. (A2) . Here, denotes the image of a matrix. (A3) For any fixed is Lipschitz continuous on Moreover, it has a unique minimizer. (A4) Set Here, the function is differentiable and Lipschitz. The function is lower semicontinuous. Moreover, has restricted prox-regularity, which can be read in Definition 2 of Wang et al. [24]. (A5) The function is also Lipschitz and differentiable.

Proof of Theorem 2. Based on Theorem 2 of Wang et al. [24], Assumptions A1–A5 are verified to prove this theorem.
Since the feasible set is bounded and in (9) is a continuous function on , Assumption A1 holds.
Note that, in our settings, the coefficient matrices of the linear equality constraints in (9) are identity matrices. Thus, Assumptions A2 and A3 hold.
The assumption A4 holds because (see, in A4 of Wang et al. [24]) is lower semicontinuous.
The assumption A5 holds because (see, in (7) of Wang et al. [24]) is continuous.

Data Availability

No underlying data were collected or produced in this study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

K.-C. Li, “Sliced inverse regression for dimension reduction,” Journal of the American Statistical Association, vol. 86, pp. 316–342, 1991.
View at: Google Scholar
R.-D. Cook, Regression Graphics: Ideas for Studying Regressions through Graphics, John Wiley & Sons, Hoboken, NJ, USA, 1998.
R.-D. Cook and B. Li, “Dimension reduction for conditional mean in regression,” Annals of Statistics, vol. 30, no. 2, pp. 455–474, 2002.
View at: Publisher Site | Google Scholar
K.-C. Li and N. Duan, “Regression analysis under link violation,” Annals of Statistics, vol. 17, no. 3, pp. 1009–1052, 1989.
View at: Publisher Site | Google Scholar
R.-D. Cook and S. Weisberg, “Sliced inverse regression for dimension reduction: comment,” Journal of the American Statistical Association, vol. 86, no. 414, pp. 328–332, 1991.
View at: Publisher Site | Google Scholar
K.-C. Li, “On principal hessian directions for data visualization and dimension reduction: another application of stein's lemma,” Journal of the American Statistical Association, vol. 87, no. 420, pp. 1025–1039, 1992.
View at: Publisher Site | Google Scholar
Y. Xia, H. Tong, W.-K. Li, and L.-X. Zhu, “An adaptive estimation of dimension reduction space,” Journal of the Royal Statistical Society-Series B: Statistical Methodology, vol. 64, no. 3, pp. 363–410, 2002.
View at: Publisher Site | Google Scholar
B. Li and S.-L. Wang, “On directional regression for dimension reduction,” Journal of the American Statistical Association, vol. 102, no. 479, pp. 997–1008, 2007.
View at: Publisher Site | Google Scholar
Y. Xia, “A constructive approach to the estimation of dimension reduction directions,” Annals of Statistics, vol. 35, no. 6, pp. 2654–2690, 2007.
View at: Publisher Site | Google Scholar
H. Wang and Y. Xia, “Sliced regression for dimension reduction,” Journal of the American Statistical Association, vol. 103, no. 482, pp. 811–821, 2008.
View at: Publisher Site | Google Scholar
Y.-Y. Ma and L.-P. Zhu, “A semiparametric approach to dimension reduction,” Journal of the American Statistical Association, vol. 107, no. 497, pp. 168–179, 2012.
View at: Publisher Site | Google Scholar
Y.-Y. Ma and L.-P. Zhu, “Efficient estimation in sufficient dimension reduction,” Annals of Statistics, vol. 41, no. 1, pp. 250–268, 2013.
View at: Publisher Site | Google Scholar
Y.-Y. Ma and L.-P. Zhu, “On estimation efficiency of the central mean subspace,” Journal of the Royal Statistical Society-Series B: Statistical Methodology, vol. 76, no. 5, pp. 885–901, 2014.
View at: Publisher Site | Google Scholar
W. Yin, S. Osher, D. Goldfarb, and J. Darbon, “Bregman iterative algorithms for ℓ₁-Minimization with applications to compressed sensing,” SIAM Journal on Imaging Sciences, vol. 1, pp. 143–168, 2008.
View at: Publisher Site | Google Scholar
L. Xue, S. Ma, and H. Zou, “Positive-definite ℓ₁-penalized estimation of large covariance matrices,” Journal of the American Statistical Association, vol. 107, no. 500, pp. 1480–1491, 2012.
View at: Publisher Site | Google Scholar
T. Zhang and H. Zou, “Sparse precision matrix estimation via lasso penalized d-trace loss,” Biometrika, vol. 101, no. 1, pp. 103–120, 2014.
View at: Publisher Site | Google Scholar
Y.-W. Gu, J. Fan, L.-C. Kong, S.-Q. Ma, and H. Zou, “ADMM for high-dimensional sparse penalized quantile regression,” Technometrics, vol. 60, no. 3, pp. 319–331, 2018.
View at: Publisher Site | Google Scholar
D. Kapla, L. Fertl, and E. Bura, “Fusing sufficient dimension reduction with neural networks,” Computational Statistics & Data Analysis, vol. 168, 2022.
View at: Publisher Site | Google Scholar
X.-P. Li and H.-C. So, “Robust low-rank tensor completion based on tensor ring rank via ℓ_p,ϵ,” IEEE Transactions on Signal Processing, vol. 69, pp. 3685–3698, 2021.
View at: Publisher Site | Google Scholar
X.-P. Li, Z.-L. Shi, C.-S. Leung, and H.-C. So, “Sparse index tracking with K-sparsity or ϵ-deviation constraint via ℓ₀-norm minimization,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 12, pp. 10930–10943, 2023.
View at: Publisher Site | Google Scholar
Q. Liu, X.-P. Li, H. Cao, and Y.-T. Wu, “From simulated to visual data: a robust low-rank tensor completion approach using ℓ_p-regression for outlier resistance,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 6, pp. 3462–3474, 2022.
View at: Publisher Site | Google Scholar
Z.-L. Shi, X.-P. Li, C.-S. Leung, and H.-C. So, “Cardinality constrained portfolio optimization via alternating direction method of multipliers,” IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 2, pp. 2901–2909, 2024.
View at: Publisher Site | Google Scholar
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine Learning, vol. 3, pp. 1–122, 2010.
View at: Publisher Site | Google Scholar
Y. Wang, W. Yin, and J.-S. Zeng, “Global convergence of ADMM in nonconvex nonsmooth optimization,” Journal of Scientific Computing, vol. 78, no. 1, pp. 29–63, 2019.
View at: Publisher Site | Google Scholar
J. Zhang, Q. Wang, and D. Mays, “Robust mave through nonconvex penalized regression,” Computational Statistics & Data Analysis, vol. 160, pp. 107247–107247, 2021.
View at: Google Scholar
X. Zhu, “A Riemannian conjugate gradient method for optimization on the Stiefel manifold,” Computational Optimization and Applications, vol. 67, no. 1, pp. 73–110, 2017.
View at: Publisher Site | Google Scholar
L. Zanni, A. Benfenati, M. Bertero, and V. Ruggiero, “Numerical methods for parameter estimation in Poisson data inversion,” Journal of Mathematical Imaging and Vision, vol. 52, no. 3, pp. 397–413, 2015.
View at: Publisher Site | Google Scholar
D. Gabay and B. Mercier, “A dual algorithm for the solution of nonlinear variational problems via finite element approximation,” Computers & Mathematics with Applications, vol. 2, no. 1, pp. 17–40, 1976.
View at: Publisher Site | Google Scholar
Y.-Y. Liu, F.-H. Shang, H.-Y. Liu, L. Kong, L.-C. Jiao, and Z.-C. Lin, “Accelerated variance reduction stochastic admm for large-scale machine learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 12, pp. 4242–4255, 2021.
View at: Publisher Site | Google Scholar
P. Cascarano, A. Sebastiani, M.-C. Comes, G. Franchini, and F. Porta, “Combining weighted total variation and deep image prior for natural and medical image restoration via admm,” in Proceedings of the 2021 21st International Conference on Computational Science and Its Applications (ICCSA), pp. 39–46, Cagliari, Italy, September 2021.
View at: Google Scholar
M. Hong, Z.-Q. Luo, and M. Razaviyayn, “Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems,” SIAM Journal on Optimization, vol. 26, no. 1, pp. 337–364, 2016.
View at: Publisher Site | Google Scholar
K. Guo, D. Han, D. Z.-W. Wang, and T.-T. Wu, “Convergence of ADMM for multi-block nonconvex separable optimization models,” Frontiers of Mathematics in China, vol. 12, no. 5, pp. 1139–1162, 2017.
View at: Publisher Site | Google Scholar
B. Jiang, T.-Y. Lin, S.-Q. Ma, and S.-Z. Zhang, “Structured nonconvex and nonsmooth optimization: algorithms and iteration complexity analysis,” Computational Optimization and Applications, vol. 72, no. 1, pp. 115–157, 2019.
View at: Publisher Site | Google Scholar
Z.-H. Jia, X. Gao, X.-J. Cai, and D. Han, “Local linear convergence of the alternating direction method of multipliers for nonconvex separable optimization problems,” Journal of Optimization Theory and Applications, vol. 188, pp. 1–25, 2021.
View at: Publisher Site | Google Scholar
W. Hardle, P. Hall, and H. Ichimura, “Optimal smoothing in single-index models,” Annals of Statistics, vol. 21, no. 1, pp. 157–178, 1993.
View at: Publisher Site | Google Scholar
R.-J. Carroll, J.-Q. Fan, I. Gijbels, and M.-P. Wand, “Generalized partially linear single-index models,” Journal of the American Statistical Association, vol. 92, no. 438, pp. 477–489, 1997.
View at: Publisher Site | Google Scholar
Y. Xia, D. Zhang, and J. Xu, “Dimension reduction and semiparametric estimation of survival models,” Journal of the American Statistical Association, vol. 105, no. 489, pp. 278–290, 2010.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2024 Sheng Ma et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

81

Downloads

44

Citations