Abstract

The joint feature selection problem can be resolved by solving a matrix -norm minimization problem. For -norm regularization, one of the most fascinating features is that some similar sparsity structures can be employed by multiple predictors. However, the nonsmooth nature of the problem brings great challenges to the problem. In this paper, an alternating direction multiplier method combined with the spectral gradient method is proposed for solving the matrix -norm optimization problem involved with multitask feature learning. Numerical experiments show the effectiveness of the proposed algorithm.

1. Introduction

Because of its widespread application in high-dimensional sparse learning, the feature selection problem has been concerned widely by the machine learning community in multitask feature learning and become a hot research field in recent years. The purpose of multitask feature learning is to learn the shared information between related tasks, so as to promote the learning effect. Learning multiple related tasks simultaneously is much more efficient compared to single-task learning particularly [1, 2]. For feature selection tasks in multitask learning, using mixed -norm can produce joint sparsity in the feature layer and task layer. In particular, -norm is sometimes more advantageous because it often leads to more sparse solutions. In multitask learning by Obosinski et al. [3] and Argyriou et al. [4], the regularization of -norm is introduced for the first time. In recent years, a lot of research studies have been carried out on it. Multiple predictors are encouraged to share similar parameter sparse patterns from different tasks [35], which is a very attractive feature of the -norm regularized problem. When the objective function is convex, the -norm regularized problem is convex and has a global optimal solution. However, the optimization problem is difficult to solve because of the nonsmoothness of -norm regularization. The method in Liu et al. [6] transforms the minimization problem of -norm into two equivalent convex smooth optimization problems and then minimizes them by Nesterov’s accelerated gradient method [7]. For the -norm regularized problem, a proximal alternating direction method was presented recently by Xiao et al. [8]. Hu et al. [9] proposed inexact accelerated proximal gradient algorithms to solve the -norm regularization.

The training set of tasks is given by , where for the task, denotes the sample, denotes the number of training samples, denotes the corresponding response, and the total number of training samples is . The matrix is the data for the task, and , is the sparse feature for the task. , , and are the joint learning features for the task in multitask learning. It is encouraged to set elements of several rows in to be zero to select features globally. According to Argyriou et al. [4], the problem of -norm minimization can be described asin which matrix is defined aswhere denotes the row element of matrix and denotes the column element of matrix . The first term measures the loss caused by matrix which is based on the training data samples of and , and the second term is the regularization term in (1), where is the regularization parameter which can be used to keep a balance between the two terms to minimize.

As described in [10], the alternate direction multiplier method (ADMM) is a natural method in the field of large-scale data distribution machine learning and big data-related optimization because it can process the objective function separately and synchronously, and it has aroused widespread attention in the past few years. ADMM method is widely used in a lot of fields, such as image restoration [11], machine learning [12], and compressed sensing [13]. This widespread application has sparked a strong interest in further understanding the theoretical nature of the ADMM (see [1417]).

Barzilai and Borwein in [18] first proposed the spectral gradient method to solve the strict convex quadratic minimization problems. Due to efficiency and computational cheapness, BB method has caused wide attention in the area of optimization. Raydan [19] developed this method to solve general unconstrained optimization problems. Recently, the BB method has been successfully extended for solving the nonsmooth convex optimization problem [20].

In this paper, an ADMM with the spectral gradient method is proposed to solve the -norm regularization problem in the area of multitask learning. We first add a new auxiliary variable to the augmented Lagrangian form of (1), then iteratively minimize the augmented Lagrangian function in which an exact method is used to solve one subproblem, and the spectral gradient method is employed to solve the other subproblems. Experimental results show that the proposed ADMM-BB method is competitive, fast, and efficient.

The rest of the paper is arranged as follows. Section 2 introduces the ADMM method for solving (1). Section 3 explains how to find the solution to the subproblems generated by each iteration and gives a practical ADMM using the spectral gradient algorithm. Section 4 gives the numerical results of the simulation data set and the real data set and compares them with other methods. Finally, Section 5 summarizes and concludes this article.

2. ADMM for -Norm Minimization

The -norm matrix minimization problem has the following standard form:where is a mapping defined based on matrix vector multiplication for each learning task, i.e., . By introducing auxiliary variable , problem (3) is equivalently transformed into a linearly constrained convex programming problem:

The augmented Lagrangian function of problem (4) is defined aswhere is the penalty parameter, means the standard trace inner product, for and in , symbol “Tr” represents the trace, i.e., the sum of the diagonal elements of a squared matrix which is also equal to the sum of the eigenvalues. For any matrix , is defined as the Frobenius norm:where is the element of matrix so that . For solving (5), the iterative scheme of the alternating direction method of multipliers iswhere is the Lagrange multiplier.

The alternating direction multiplier method for solving problem (4) can be expressed as follows.

Algorithm 1. ADMM for -norm minimization problem.Step 1: find viawhere represents the subgradient operator of the convex function .Step 2: solve viaStep 3: compute the multiplier by (9).The following result shows that the optimal solution set of -norm matrix minimization problem (3) is bounded (see [9]).

Lemma 1. For each , the optimal solution set of (3) is bounded, and for any , we have

The global convergence property of Algorithm 1 holds directly based on the results developed by Bertsekas and Tsitsiklis [21], Chapter 3, p. 256 for general convex programming problems.

Theorem 1. Let be the sequence generated by Algorithm 1 with . Then, is bounded, and every limit of is an optimal solution of equivalent problem (4).

3. ADMM-BB Method for -Norm Minimization

Section 2 gives the theoretical alternating direction multiplier method of the -norm minimization problem. However, a key issue has not yet been resolved: how to solve subproblems (7) and (8) efficiently? This problem is fundamentally important because if it is difficult to solve each subproblem, this method will not be useful anyway. In this paper, an exact method is used to solve (7), and the spectral gradient method is employed to solve (8).

Given and , we have

Let . Equation (14) has the solution of the formwhich indicates that involved problem (15) can be broken down into independent -dimensional subproblems:

Clearly, the optimal solution can be obtained in the direction , and it has the form of the formula in which is a parameter. Based on developing a Lagrangian dual form, subproblem (16) has a closed-form solution (see, e.g., [22, 23]) which can be explicitly expressed aswhere . Therefore, the closed-form solution of (10) is given as follows:

Next, we analyze another subproblem (8). For fixed , let

Now, we investigate how to use the spectral gradient method to solve the corresponding problem:

The function is convex and everywhere differentiable with

In order to distinguish the superscript in Algorithm 1, we apply the subscripts in the iteration of this subproblem. Spectral gradient method is defined bywhere is given bywhere and .

Now, the spectral gradient method for (20) can be described as given in Algorithm 2.

Algorithm 2. The spectral gradient method.Step 0: given , , , and .Step 1: termination criterion: stop if satisfies termination condition . Otherwise, go to the next step.Step 2: compute by (23) if . Let .Step 3: let and go to Step 1.Finally, by adopting a relaxation factor , the multiplier update formula in Algorithm 1 is replaced byGlowinski in [24] first suggested the instruction of , and it has shown better performance in numerical experiments [25].
Now, a practical ready-to-implement version of the ADMM (7)–(9) can be described as follows.

Algorithm 3. ADMM-BB for the -norm minimization problem.Step 0: let and be given. Let be arbitrary. Let be the initial estimated Lagrange multipliers. Let .Step 1: when the stopping criterion holds, then stop; otherwise, continue.Step 2: compute by (18).Step 3: compute by solving the following problem with the spectral gradient method:Step 4: compute by (24).Step 5: let and go to Step 1.Based on the conclusions of Bertsekas and Tsitsiklis ([21], Chapter 3, Proposition 4.2) and Glowinski ([24], Chapter VI, Theorem 5.1), for Algorithm 3, the following convergence conclusion holds.

Theorem 2. Suppose that has a saddle point . Let be the sequence generated by Algorithm 3 with and . Then,

Moreover, if is a weak cluster point of , then is a saddle point of .

4. Experiments

In this section, we will give the numerical experimental results of ADMM-BB to solve matrix -norm minimization problem (3). The experiments are carried out by MATLAB R2018b running on a computer with 2.8 GHz Intel Pentium CPU and 8 GB of low voltage memory.

Based on simulated data and real data, we conducted two types of numerical experiments to study the performance of the ADMM-BB method. In the test, we compared the ADMM-BB method with the IADM-MFL method [8] because the IADM-MFL method is well known and gives a feasible way to find a solution to the joint feature selection problem in the area of multitask learning. For each test function, starting with the origin point, when the distance between adjacent iteration points is less than a given constant , the algorithm stops, i.e.,

We choose in the following series of experiments.

Example 1. As [4], the simulation data sets are created by using a 5-dimensional zero-mean Gaussian distribution with a covariance matrix that equals to diag {1, 0.64, 0.49, 0.36, 0.25}, which can be denoted by . For all , expand it to 20 irrelevant dimensions by adding zero elements. The training data are a random Gaussian matrix generated by Matlab command . Using and , get the outputs aswhere Gaussian noise is described by a mean of 0 and a standard deviation of 1.e − 2. In each performed method, denotes the optimal solution of matrix -norm minimization problem (3). To measure the quality of to original , we set the relative error as follows:We will analyze the performance of both methods with different number of dimensions and tasks because they will certainly affect the performance of each algorithm as an important factor. The numerical results are shown in Table 1, which contains the CPU time required in seconds (TIME), the total number of iterations (ITER), the total number of tasks (t), the dimension of the test data (n), and the dimension of the outputs (m).

From the results in Table 1, it can be seen that although both methods have successfully terminated, the number of iterations and CPU time of the ADMM-BB method are much less than IADM-MFL.

Then, the parameters involved in the methods are specified as , , , , and . For each solving algorithm, we evaluate the objective function values and test error rate, and the convergence behavior of these algorithms is shown in Figure 1. To visually compare the convergence speed of the algorithms, the four subgraphs in Figure 1 show the change of the function value and relative error with the number of iterations and CPU time for both ADMM-BB and IADM-MFL algorithms. We present the relative error and the objective function values plotted against the number of iterations in the first row in Figure 1 and present the relative error and the objective function values plotted against the computational time in the second row. Figure 1 shows that although both ADMM-BB and IADM-MFL algorithms generate decreasing sequences and converge to the same function value as well as relative error, the performance of ADMM-BB is better than IADM-MFL in terms of iteration numbers and CPU time.

Example 2. In this test, we demonstrate the performance of the proposed algorithms on a real data set. is a text categorization data set, in which every 10 tasks correspond to subcategories of the arts category. The data set can be downloaded from http://www.dmoz.org/. In order to learn the joint feature between tasks, we randomly select data from each task for training and sample 20%, 30%, 40%, 50%, 60% and 70%, respectively, of the data set and then test the two methods at the same time. Except for , the other parameters are the same as in the proceeding example for both ADMM-BB and IADM-MFL methods. The corresponding numerical results are summarized in Table 2.

From Table 2, we can see that ADMM-BB is an effective method and works better on these problems.

5. Conclusion

The convergence theory for the alternating direction multiplier method for the convex optimization problem has been well established by Bertsekas and Tsitsiklis [21] and Glowinski [24]. The main purpose of this paper is to demonstrate that this method is robust for the matrix -norm regularized minimization problem. The key element is the practical efficiency of the alternating direction multiplier method by using the spectral gradient method in this paper. The corresponding numerical results verify the encouraging efficiency of the proposed method in solving the joint feature selection problem.

Data Availability

The data used to support the findings of this study are available in tables in this paper and can also be obtained from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the Scientific Research Project of Tianjin Education Commission (no. 2019KJ232).