Alternating Direction Multiplier Method for Matrix <span class="nowrap"><svg xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg" style="vertical-align:-5.7429pt" id="M1" height="18.0035pt" version="1.1" viewBox="-0.0657574 -12.2606 19.5928 18.0035" width="19.5928pt"><g transform="matrix(.017,0,0,-0.017,0,0)"><path id="g113-109" d="M238 681C243 705 239 712 230 712C217 712 156 682 75 674L70 648H105C148 648 153 641 144 598L39 110C18 11 35 -12 55 -12C90 -12 166 36 221 103L205 125C174 93 130 65 118 65C112 65 108 68 114 96L238 681Z"/></g><g transform="matrix(.012,0,0,-0.012,4.187,4.134)"><path id="g50-51" d="M414 144C384 79 371 75 317 75H135L276 221C367 316 408 376 408 465C408 570 327 635 237 635C179 635 131 609 100 575L42 494L67 471C94 510 138 565 205 565C277 565 321 517 321 435C321 348 258 270 195 195C146 137 88 81 33 26V0H411C423 44 433 88 446 135L414 144Z"/></g><g transform="matrix(.012,0,0,-0.012,10.037,4.134)"><path id="g50-45" d="M98 134C72 134 46 117 46 90C46 73 55 65 60 64C95 55 124 32 124 -4C124 -42 95 -68 44 -89L57 -123C122 -104 194 -60 194 22C194 94 136 134 98 134Z"/></g><g transform="matrix(.012,0,0,-0.012,12.883,4.134)"><path id="g50-50" d="M389 0V32C297 38 291 46 291 118V635C234 613 175 595 109 583V556L161 554C203 552 207 547 207 497V118C207 46 201 38 110 32V0H389Z"/></g></svg>-</span>Norm Optimization in Multitask Feature Learning Problems

Hu, Yaping; Liu, Liying; Wang, Yujie

doi:https://doi.org/10.1155/2020/4864296

Mathematical Problems in Engineering

On this page

Abstract Introduction Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Machine Learning and its Applications in Image Restoration

View this Special Issue

Research Article | Open Access

Volume 2020 | Article ID 4864296 | https://doi.org/10.1155/2020/4864296

Alternating Direction Multiplier Method for Matrix -Norm Optimization in Multitask Feature Learning Problems

Yaping Hu,¹Liying Liu,¹and Yujie Wang¹

Academic Editor: Gonglin Yuan

Received06 Jul 2020

Accepted06 Aug 2020

Published26 Aug 2020

Abstract

The joint feature selection problem can be resolved by solving a matrix -norm minimization problem. For -norm regularization, one of the most fascinating features is that some similar sparsity structures can be employed by multiple predictors. However, the nonsmooth nature of the problem brings great challenges to the problem. In this paper, an alternating direction multiplier method combined with the spectral gradient method is proposed for solving the matrix -norm optimization problem involved with multitask feature learning. Numerical experiments show the effectiveness of the proposed algorithm.

1. Introduction

Because of its widespread application in high-dimensional sparse learning, the feature selection problem has been concerned widely by the machine learning community in multitask feature learning and become a hot research field in recent years. The purpose of multitask feature learning is to learn the shared information between related tasks, so as to promote the learning effect. Learning multiple related tasks simultaneously is much more efficient compared to single-task learning particularly [1, 2]. For feature selection tasks in multitask learning, using mixed -norm can produce joint sparsity in the feature layer and task layer. In particular, -norm is sometimes more advantageous because it often leads to more sparse solutions. In multitask learning by Obosinski et al. [3] and Argyriou et al. [4], the regularization of -norm is introduced for the first time. In recent years, a lot of research studies have been carried out on it. Multiple predictors are encouraged to share similar parameter sparse patterns from different tasks [3–5], which is a very attractive feature of the -norm regularized problem. When the objective function is convex, the -norm regularized problem is convex and has a global optimal solution. However, the optimization problem is difficult to solve because of the nonsmoothness of -norm regularization. The method in Liu et al. [6] transforms the minimization problem of -norm into two equivalent convex smooth optimization problems and then minimizes them by Nesterov’s accelerated gradient method [7]. For the -norm regularized problem, a proximal alternating direction method was presented recently by Xiao et al. [8]. Hu et al. [9] proposed inexact accelerated proximal gradient algorithms to solve the -norm regularization.

The training set of tasks is given by , where for the task, denotes the sample, denotes the number of training samples, denotes the corresponding response, and the total number of training samples is . The matrix is the data for the task, and , is the sparse feature for the task. , , and are the joint learning features for the task in multitask learning. It is encouraged to set elements of several rows in to be zero to select features globally. According to Argyriou et al. [4], the problem of -norm minimization can be described asin which matrix is defined aswhere denotes the row element of matrix and denotes the column element of matrix . The first term measures the loss caused by matrix which is based on the training data samples of and , and the second term is the regularization term in (1), where is the regularization parameter which can be used to keep a balance between the two terms to minimize.

As described in [10], the alternate direction multiplier method (ADMM) is a natural method in the field of large-scale data distribution machine learning and big data-related optimization because it can process the objective function separately and synchronously, and it has aroused widespread attention in the past few years. ADMM method is widely used in a lot of fields, such as image restoration [11], machine learning [12], and compressed sensing [13]. This widespread application has sparked a strong interest in further understanding the theoretical nature of the ADMM (see [14–17]).

Barzilai and Borwein in [18] first proposed the spectral gradient method to solve the strict convex quadratic minimization problems. Due to efficiency and computational cheapness, BB method has caused wide attention in the area of optimization. Raydan [19] developed this method to solve general unconstrained optimization problems. Recently, the BB method has been successfully extended for solving the nonsmooth convex optimization problem [20].

In this paper, an ADMM with the spectral gradient method is proposed to solve the -norm regularization problem in the area of multitask learning. We first add a new auxiliary variable to the augmented Lagrangian form of (1), then iteratively minimize the augmented Lagrangian function in which an exact method is used to solve one subproblem, and the spectral gradient method is employed to solve the other subproblems. Experimental results show that the proposed ADMM-BB method is competitive, fast, and efficient.

The rest of the paper is arranged as follows. Section 2 introduces the ADMM method for solving (1). Section 3 explains how to find the solution to the subproblems generated by each iteration and gives a practical ADMM using the spectral gradient algorithm. Section 4 gives the numerical results of the simulation data set and the real data set and compares them with other methods. Finally, Section 5 summarizes and concludes this article.

2. ADMM for -Norm Minimization

The -norm matrix minimization problem has the following standard form:where is a mapping defined based on matrix vector multiplication for each learning task, i.e., . By introducing auxiliary variable , problem (3) is equivalently transformed into a linearly constrained convex programming problem:

The augmented Lagrangian function of problem (4) is defined aswhere is the penalty parameter, means the standard trace inner product, for and in , symbol “Tr” represents the trace, i.e., the sum of the diagonal elements of a squared matrix which is also equal to the sum of the eigenvalues. For any matrix , is defined as the Frobenius norm:where is the element of matrix so that . For solving (5), the iterative scheme of the alternating direction method of multipliers iswhere is the Lagrange multiplier.

The alternating direction multiplier method for solving problem (4) can be expressed as follows.

Algorithm 1. ADMM for -norm minimization problem. Step 1: find via where represents the subgradient operator of the convex function . Step 2: solve via Step 3: compute the multiplier by (9).The following result shows that the optimal solution set of -norm matrix minimization problem (3) is bounded (see [9]).

Lemma 1. For each , the optimal solution set of (3) is bounded, and for any , we have

The global convergence property of Algorithm 1 holds directly based on the results developed by Bertsekas and Tsitsiklis [21], Chapter 3, p. 256 for general convex programming problems.

Theorem 1. Let be the sequence generated by Algorithm 1 with . Then, is bounded, and every limit of is an optimal solution of equivalent problem (4).

3. ADMM-BB Method for -Norm Minimization

Section 2 gives the theoretical alternating direction multiplier method of the -norm minimization problem. However, a key issue has not yet been resolved: how to solve subproblems (7) and (8) efficiently? This problem is fundamentally important because if it is difficult to solve each subproblem, this method will not be useful anyway. In this paper, an exact method is used to solve (7), and the spectral gradient method is employed to solve (8).

Given and , we have

Let . Equation (14) has the solution of the formwhich indicates that involved problem (15) can be broken down into independent -dimensional subproblems:

Clearly, the optimal solution can be obtained in the direction , and it has the form of the formula in which is a parameter. Based on developing a Lagrangian dual form, subproblem (16) has a closed-form solution (see, e.g., [22, 23]) which can be explicitly expressed aswhere . Therefore, the closed-form solution of (10) is given as follows:

Next, we analyze another subproblem (8). For fixed , let

Now, we investigate how to use the spectral gradient method to solve the corresponding problem:

The function is convex and everywhere differentiable with

In order to distinguish the superscript in Algorithm 1, we apply the subscripts in the iteration of this subproblem. Spectral gradient method is defined bywhere is given bywhere and .

Now, the spectral gradient method for (20) can be described as given in Algorithm 2.

Algorithm 2. The spectral gradient method. Step 0: given , , , and . Step 1: termination criterion: stop if satisfies termination condition . Otherwise, go to the next step. Step 2: compute by (23) if . Let . Step 3: let and go to Step 1.Finally, by adopting a relaxation factor , the multiplier update formula in Algorithm 1 is replaced byGlowinski in [24] first suggested the instruction of , and it has shown better performance in numerical experiments [25].
Now, a practical ready-to-implement version of the ADMM (7)–(9) can be described as follows.

Algorithm 3. ADMM-BB for the -norm minimization problem. Step 0: let and be given. Let be arbitrary. Let be the initial estimated Lagrange multipliers. Let . Step 1: when the stopping criterion holds, then stop; otherwise, continue. Step 2: compute by (18). Step 3: compute by solving the following problem with the spectral gradient method: Step 4: compute by (24). Step 5: let and go to Step 1.Based on the conclusions of Bertsekas and Tsitsiklis ([21], Chapter 3, Proposition 4.2) and Glowinski ([24], Chapter VI, Theorem 5.1), for Algorithm 3, the following convergence conclusion holds.

Theorem 2. Suppose that has a saddle point . Let be the sequence generated by Algorithm 3 with and . Then,

Moreover, if is a weak cluster point of , then is a saddle point of .

4. Experiments

In this section, we will give the numerical experimental results of ADMM-BB to solve matrix -norm minimization problem (3). The experiments are carried out by MATLAB R2018b running on a computer with 2.8 GHz Intel Pentium CPU and 8 GB of low voltage memory.

Based on simulated data and real data, we conducted two types of numerical experiments to study the performance of the ADMM-BB method. In the test, we compared the ADMM-BB method with the IADM-MFL method [8] because the IADM-MFL method is well known and gives a feasible way to find a solution to the joint feature selection problem in the area of multitask learning. For each test function, starting with the origin point, when the distance between adjacent iteration points is less than a given constant , the algorithm stops, i.e.,

We choose in the following series of experiments.

Example 1. As [4], the simulation data sets are created by using a 5-dimensional zero-mean Gaussian distribution with a covariance matrix that equals to diag {1, 0.64, 0.49, 0.36, 0.25}, which can be denoted by . For all , expand it to 20 irrelevant dimensions by adding zero elements. The training data are a random Gaussian matrix generated by Matlab command . Using and , get the outputs aswhere Gaussian noise is described by a mean of 0 and a standard deviation of 1.e − 2. In each performed method, denotes the optimal solution of matrix -norm minimization problem (3). To measure the quality of to original , we set the relative error as follows:We will analyze the performance of both methods with different number of dimensions and tasks because they will certainly affect the performance of each algorithm as an important factor. The numerical results are shown in Table 1, which contains the CPU time required in seconds (TIME), the total number of iterations (ITER), the total number of tasks (t), the dimension of the test data (n), and the dimension of the outputs (m).

From the results in Table 1, it can be seen that although both methods have successfully terminated, the number of iterations and CPU time of the ADMM-BB method are much less than IADM-MFL.

Then, the parameters involved in the methods are specified as , , , , and . For each solving algorithm, we evaluate the objective function values and test error rate, and the convergence behavior of these algorithms is shown in Figure 1. To visually compare the convergence speed of the algorithms, the four subgraphs in Figure 1 show the change of the function value and relative error with the number of iterations and CPU time for both ADMM-BB and IADM-MFL algorithms. We present the relative error and the objective function values plotted against the number of iterations in the first row in Figure 1 and present the relative error and the objective function values plotted against the computational time in the second row. Figure 1 shows that although both ADMM-BB and IADM-MFL algorithms generate decreasing sequences and converge to the same function value as well as relative error, the performance of ADMM-BB is better than IADM-MFL in terms of iteration numbers and CPU time.

(a)

(b)

(c)

(d)

Example 2. In this test, we demonstrate the performance of the proposed algorithms on a real data set. is a text categorization data set, in which every 10 tasks correspond to subcategories of the arts category. The data set can be downloaded from http://www.dmoz.org/. In order to learn the joint feature between tasks, we randomly select data from each task for training and sample 20%, 30%, 40%, 50%, 60% and 70%, respectively, of the data set and then test the two methods at the same time. Except for , the other parameters are the same as in the proceeding example for both ADMM-BB and IADM-MFL methods. The corresponding numerical results are summarized in Table 2.

From Table 2, we can see that ADMM-BB is an effective method and works better on these problems.

5. Conclusion

The convergence theory for the alternating direction multiplier method for the convex optimization problem has been well established by Bertsekas and Tsitsiklis [21] and Glowinski [24]. The main purpose of this paper is to demonstrate that this method is robust for the matrix -norm regularized minimization problem. The key element is the practical efficiency of the alternating direction multiplier method by using the spectral gradient method in this paper. The corresponding numerical results verify the encouraging efficiency of the proposed method in solving the joint feature selection problem.

Data Availability

The data used to support the findings of this study are available in tables in this paper and can also be obtained from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the Scientific Research Project of Tianjin Education Commission (no. 2019KJ232).

References

R. K. Ando and T. Zhang, “A framework for learning predictive structures from multiple tasks and unlabeled data,” Journal of Machine Learning Research, vol. 6, pp. 1817–1853, 2005.
View at: Google Scholar
T. Evgeniou, C. A. Micchelli, and M. Pontil, “Learning multiple tasks with kernel methods,” Journal of Machine Learning Research, vol. 6, pp. 615–637, 2005.
View at: Google Scholar
G. Obozinski, B. Taskar, and M. I. Jordan, “Multi-Task Feature Selection,” Tech. Rep., University of California, Berkeley, CA, USA, 2006, Technical Report.
View at: Google Scholar
A. Argyriou, T. Evgeniou, and M. Pontil, “Convex multi-task feature learning,” Machine Learning, vol. 73, no. 3, pp. 243–272, 2008.
View at: Publisher Site | Google Scholar
F. Nie, H. Huang, X. Cai, and C. Ding, “Efficient and robust feature selection via joint -norms minimization,” in Proceedings of the Neural Information Processing Systems Foundation, Vancouver, Canada, December 2010.
View at: Google Scholar
J. Liu, S. Ji, and J. Ye, “Multi-task feature learning Via efficient -norm minimization,” in Proceedings of the UAI 2009 Conference, Montreal, Canada, 2009.
View at: Google Scholar
Y. Nesterov, Gradient Methods for Minimizing Composite Objective Function, Center for Operations Research and Econometrics (CORE), Louvain-la-Neuve, Belgium, 2007.
Y. Xiao, S. Wu, S.-Y. Wu, and B.-S. He, “A proximal alternating direction method for $\ell_{2, 1} $-norm least squares problem in multi-task feature learning,” Journal of Industrial & Management Optimization, vol. 8, no. 4, pp. 1057–1069, 2012.
View at: Publisher Site | Google Scholar
Y. Hu, Z. Wei, and G. Yuan, “Inexact accelerated proximal gradient algorithms for matrix -norm minimization problem in multi-task feature learning,” Statistics, Optimization and Information Computing, vol. 2, no. 4, pp. 352–367, 2014.
View at: Publisher Site | Google Scholar
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
View at: Publisher Site | Google Scholar
T. Goldstein and S. Osher, “The split bregman method for L1-regularized problems,” SIAM Journal on Imaging Sciences, vol. 2, no. 2, pp. 323–343, 2009.
View at: Publisher Site | Google Scholar
P. A. Forero, A. Cano, and G. B. Giannakis, “Consensus-based distributed support vector machines,” Journal of Machine Learning Research, vol. 99, pp. 1663–1707, 2010.
View at: Google Scholar
J. Yang and Y. Zhang, “Alternating direction algorithms for $\ell_1$-Problems in compressive sensing,” SIAM Journal on Scientific Computing, vol. 33, no. 1, pp. 250–278, 2011.
View at: Publisher Site | Google Scholar
G. Banjac, P. Goulart, B. Stellato, and S. Boyd, “Infeasibility detection in the alternating direction method of multipliers for convex optimization,” Journal of Optimization Theory and Applications, vol. 183, no. 2, pp. 490–519, 2019.
View at: Publisher Site | Google Scholar
D. Boley, “Local linear convergence of the alternating direction method of multipliers on quadratic or linear programs,” SIAM Journal on Optimization, vol. 23, no. 4, pp. 2183–2207, 2013.
View at: Publisher Site | Google Scholar
W. Deng and W. Yin, “On the global and linear convergence of the generalized alternating direction method of multipliers,” Journal of Scientific Computing, vol. 66, no. 3, pp. 889–916, 2016.
View at: Publisher Site | Google Scholar
J. Jian, Y. Zhang, and M. Chao, “A regularized alternating direction method of multipliers for a class of nonconvex problems,” Journal of Inequalities and Applications, vol. 193, 2019.
View at: Publisher Site | Google Scholar
J. Barzilai and J. M. Borwein, “Two-point step size gradient methods,” IMA Journal of Numerical Analysis, vol. 8, no. 1, pp. 141–148, 1988.
View at: Publisher Site | Google Scholar
M. Raydan, “The barzilai and borwein gradient method for the large scale unconstrained minimization problem,” SIAM Journal on Optimization, vol. 7, no. 1, pp. 26–33, 1997.
View at: Publisher Site | Google Scholar
G. Yuan and Z. Wei, “The barzilai and borwein gradient method with nonmonotone line search for nonsmooth convex optimization problems,” Mathematical Modelling and Analysis, vol. 17, no. 2, pp. 203–216, 2012.
View at: Publisher Site | Google Scholar
D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods, Prentice-Hall, Englewood Cliffs, NJ, USA, 1989.
J. Duchi and Y. Singer, “Efficient online and batch learning using forward backward splitting,” Journal of Machine Learning Research, vol. 10, pp. 2899–2934, 2009.
View at: Google Scholar
M. Kowalski, “Sparse regression using mixed norms,” Applied and Computational Harmonic Analysis, vol. 27, no. 3, pp. 303–324, 2009.
View at: Publisher Site | Google Scholar
R. Glowinski, Numerical Methods for Nonlinear Variational Problems, Springer, New York City, NY, USA, 1984.
B. He, S. L. Wang, and H. Yang, “A modified variable-penalty alternating directions method for monotone variational inequalities,” Journal of Computational Mathematics, vol. 21, no. 4, pp. 495–504, 2003.
View at: Google Scholar

Copyright

Copyright © 2020 Yaping Hu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

681

Downloads

659

Citations