Coordinate Descent Based Hierarchical Interactive Lasso Penalized Logistic Regression and Its Application to Classification Problems

Wang, Jin-Jia; Lu, Yang

doi:https://doi.org/10.1155/2014/430201

Mathematical Problems in Engineering

On this page

Abstract Introduction Analysis Conclusion Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2014 | Article ID 430201 | https://doi.org/10.1155/2014/430201

Coordinate Descent Based Hierarchical Interactive Lasso Penalized Logistic Regression and Its Application to Classification Problems

Jin-Jia Wang¹and Yang Lu¹

Academic Editor: Wei-Chiang Hong

Received20 Aug 2014

Revised01 Dec 2014

Accepted01 Dec 2014

Published16 Dec 2014

Abstract

We present the hierarchical interactive lasso penalized logistic regression using the coordinate descent algorithm based on the hierarchy theory and variables interactions. We define the interaction model based on the geometric algebra and hierarchical constraint conditions and then use the coordinate descent algorithm to solve for the coefficients of the hierarchical interactive lasso model. We provide the results of some experiments based on UCI datasets, Madelon datasets from NIPS2003, and daily activities of the elder. The experimental results show that the variable interactions and hierarchy contribute significantly to the classification. The hierarchical interactive lasso has the advantages of the lasso and interactive lasso.

1. Introduction

Sparse linear models (such as the lasso) are a remarkable success of the regression analysis of high-dimensional data [1]. The lasso is a least squares regression with the L1 penalty function. It can also be extended to the generalized linear model [2], for example, the logistic regression with L1 penalty used for classification [3]. In the lasso model, the response variable is assumed to be a linear weighted sum of the predictor variables, and the optimization problem used to find the weighting coefficients can be solved by the coordinate descent algorithm [4]. If, in the analysis of high-dimensional data, the response variable cannot be explained by a linear weighted sum of predictor variables, a higher-order model and quadratic model need to be used. In most cases, this suggests the presence of variable interactions [5]. The presence of such interactions is considered important, as, for example, the interaction between single nucleotide polymorphisms (SNPs) plays an important role in the diagnosis of cancer and other diseases [6]. While the linear model has some advantages, such as good interpretability and simple calculations, the variable interaction models are considered to be a focus of the modern research [7].

There are three types of methods used in the hierarchical interaction models. The first one is a multistep method. This method is based on removing or adding the best predictor variables or interaction variables in each iteration. Once the predictor variables corresponding to the interaction variables are in the model, the interaction variables must be in the model as well [8]. Alternatively, we can consider the variable selection before the interaction selection [9]. Usually, the modified LARS algorithm is used in such models to solve the interaction model [10]. The second type is the Bayes model method. This approach improves the random search variable selection method for the hierarchical interaction model [11]. The third type is based on optimization. The sparse interaction model is formulated as a nonconvex optimization problem [12] and further expressed as a convex optimization problem, such as all-pair lasso [13] or interaction group lasso [14].

In the literature on the sparse structures [15], composite absolute penalties (CAP) can also obtain the sparseness of the group and interaction, but the interaction coefficient is penalized twice [16]. To solve for the hierarchical sparseness in the nonlinear interaction problem, the existing literature [17] has introduced the VANISH method. The logic regression method considers the binary variable high-level interaction [18]. The existing literature [19] uses a simple recursive approach to select the interaction variables from high-dimensional data. The literature [20] proposed a genetic algorithm using selection to choose interaction variables in high-dimensional data.

The literature [13] presents a hierarchical interactive lasso method for regression and provides a method of model coefficients estimation using KKT conditions and the Lagrange multiplier method. Based on the literature [13] and our past work, the authors propose the concept of geometric algebra interaction and coordinate descent algorithm for the hierarchical interactive lasso penalized logistic regression. We used experimental data including 4 kinds of datasets from the UCI machine learning database, one Madelon datasets from NIPS2003, and one daily life activity recognition datasets. The experimental results reveal the outstanding advantages of the hierarchical interactive lasso method compared to the lasso and interactive lasso methods. The innovations include the following. (1) We use geometric algebra to explain variable interaction; (2) we derive an improved coordinate descent algorithm to solve the hierarchical interactive lasso penalized logistic regression; (3) we use the hierarchical interactive lasso for the classification problem.

2. The Variable Interaction Theory of Geometric Algebra

Definition 1. If the function cannot be represented as a sum of independent functions, , then , in the function are said to have interaction.

A popular explanation of Definition 1 is that if a response variable cannot be represented as a linear weighted sum of the prediction variables, it is probably because there are interactions between the variables.

Interactions between variables can be easily explained by the geometric algebra theory. Figure 1 is a diagram showing all subspace in geometric algebra. The 1-vectors, namely, order-1 main variables, can represent a dimensional subspace of the original data. That is, dimensional base of the original data is projected on the 1-vectors. The 2-vectors show the interaction between two variables. The simplest 2-vectors coefficient can be the product of two 1-vectors. In the literature [13] our proposed area feature is considered as one of the interactions. In the literature [20] our proposed orthocenter feature is considered as one of the interactions. Higher-order interactions are represented by -vectors. In this paper, we only study area interactions between 1-vectors. This method can also be extended to the nonlinear complex function interactions or higher order.

3. The Binary Logistic Regression Based on Interaction and Hierarchy

The outcome variable in the binary logistic model is denoted by ; the income variables are predictors . are order-1 main variables, and the pairwise are interactions variables between order-1 main variables. The binary logistic model has the form where , the main variables coefficients are , the interaction variables coefficients are , is 1, and satisfies .

Assume that the training samples are , ,

Our goal is to select a feature subset from the order-1 main variables (dimension ) and order-2 interaction variables (dimension ). We then estimate the coefficient values for nonzero model parameters. We can obtain the probability of two classes as follows:

The maximum likelihood estimation is used to estimate the unknown model parameters, which make the likelihood function of independent observations the largest. We define Then, the logarithmic likelihood function of (4) is

We use the second-order Taylor expansion at the current estimated value for (5) and obtain the subproblem as follows: where , .

The proof that (5) implies (6) is presented in the Appendix A.

In order to obtain the sparse solution for the main variable coefficients and interaction coefficients, the penalty function is used to enhance the stability of the interactive model:

We focus on those interactions that have large main variable coefficients. Such restrictions are known as “hierarchy.” The mathematical expression for them is or . So, we add the constraints enforcing the hierarchy into (7) as follows: where is the th column of . If , then , , so , and . The new constraint guarantees the hierarchy, but we cannot obtain a convex solution because (8) is not convex. So, instead of we use , . And the corresponding convex relaxation of (8) is as follows: where , , , and .

4. Coordinate Descent Algorithm and KKT Conditions

The basic idea of the coordinate descent algorithm is to convert multivariate problems into multiple single variable subproblems. It allows optimizing only one-dimensional variables at a time. The solution can be updated in a cycle. We solve (9) using the coordinate descent algorithm.

The Lagrange function corresponding to (9) is as follows: where and and are the dual variables corresponding to the hierarchical constraint and the nonnegative constraints. Formula (10) can be decomposed into subproblems:

The solution of (12) as a convex problem can be obtained by a set of optimality conditions, known as the KKT (Karush-Kuhn-Tucker) conditions. This is the key advantage of our approach.

The stationary conditions of (12) according to KKT are , . The complementary conditions are , , , , , . We assume that . For our problem, the KKT conditions can be written as follows: where , , are from (6), and denotes the soft-threshold operator defined by .

The proof of both expressions (13) can be found in Appendix B.

Now, we define . Then The remaining KKT conditions only involve , , . Observing that is nonincreasing with respect to and is piecewise linear, it is easy to get the solution for .

In conclusion, the overall idea of the coordinate descent algorithm is that the minimization of (9) is equivalent to the minimization of (10). Formula (10) can be decomposed into independent formula (12). Formula (12) can be solved as (13). The final coefficient optimization iteration formula is as follows: where is the estimated value of the th main variable coefficient after iterations and is the estimation of the interaction coefficient between the th variable and the th variable after iterations.

5. The Experimental Results and Analysis

5.1. The Experimental Results and Analysis of Four UCI Datasets

There are four UCI study database datasets, which include the breast-cancer-Wisconsin datasets, Ionosphere datasets, Liver_disorders datasets, and Sonar datasets, as shown in Table 1.

We do the 10-fold cross-validation (10-CV) experiments in the paper for 20 times using , where . Besides we complete an experiment employing the interactive hierarchical lasso logistic regression method. The results include the number of nonzero variable coefficients, average error rate of the 10-CV, standard deviation (SD), CPU time, and the value of lambda () estimated. The results are shown in Table 2.

The results of the 10-CV based on the four datasets are presented in Figures 2 to 5 using the proposed method. In the figures, the horizontal axis represents the logarithmic value of and the vertical axis is the error rate of the 10-CV. Besides, the horizontal axis at the top of each figure represents the number of nonzero variable coefficients corresponding to the value.

The results for the breast-cancer-Wisconsin datasets are shown in Figure 2. The minimum error rate is 0.03 and the number of selected variables is more than 11. The results for the Ionosphere experimental datasets are shown in Figure 3. When the number of selected variables is 101, the lowest error rate is 0.28 with the smaller standard deviation. The number of the selected variables is larger than the original dimension, so the interactions provide the classified information. The results for the Liver_disorders datasets are presented in Figure 4. If the number of selected variables is 25, the lowest error rate can reach 0.26, while the standard deviation is 0.02. Finally, the results for the Sonar datasets are presented in Figure 5. When more than 80 variables are selected, the minimum error rate is 0.14.

In what follows, we compare our method to the existing literature [13]. The classification results and training time of our method are better than those shown in the literature [13]. The experimental results of the lasso, all-pair lasso, and conventional pattern recognition methods with 10-fold cross-validation of 20 times in the four UCI datasets are listed, respectively, in Tables 3, 4, and 5. Conventional pattern recognition methods include support vector machine (SVM), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), -nearest neighborhood (-NN), and decision tree (DT) methods. The lasso is a method that considers the main variables without the interaction variables. All-pair lasso is a method that considers the main variables and interaction variables but without the hierarchy. The experimental results show that our model is better in classification results and more stable. This highlights the advantage of the variable interactions and hierarchy.

5.2. The Experimental Results for High-Dimensional Small Sample Data

The Madelon datasets from NIPS2003 was used to evaluate our method. The sample numbers of the training, validation, and testing of the data were, respectively, 2000, 600, and 1800. The class number is 2. The variable dimension is 500. So the interactive dimension is 124750. You can find more information about the datasets following the link http://www.nipsfsc.ecs.soton.ac.uk/; you can also download the datasets and see the results of challenges, balance error rates, and the area under the curve. The model is trained by using a training set. The model parameters are selected by using a validation set. Also, the prediction results of the final model using the test set are uploaded online. The classification score of the final model is obtained. Our results are shown in Table 6. The results show that our method is slightly better than the lasso and all-pair lasso. This implies that the interactions may also be important in the Madelon datasets.

5.3. Activity Recognition (AR) Using Inertial Sensors of Smartphones

Anguita et al. collected sensor data of smartphones [10]. They used the support vector machine (SVM) method to solve the classification problem of the daily life activity recognition. These results play an extremely significant role in disability and elderly care. Datasets can be downloaded following the literature [10]. 30 volunteers aged 19–48 years participated in the study. Each person performed six activities wearing the smartphone on the waist. To obtain the data class label, experiments were conducted using video recording. The smartphone used in the experiments had built-in accelerometer and gyroscope for measuring 3D linear acceleration and angular acceleration. The sampling frequency was 50 Hz, which is more than enough for capturing human movements.

We use the datasets to evaluate our method. We use the upstairs and downstairs movements as two active classes. The training sets have 986 samples and 1,073 samples, respectively. The test sets have 420 samples and 471 samples, respectively. The variable dimension is 561, which includes the time and frequency from sensor signals.

Experimental results of the three lasso methods and some pattern recognition methods are shown in Table 7. The results show that our method is better than the pattern recognition methods, since it takes the variable selection and interaction into account. Our method achieves the best classification results with less training and testing time.

5.4. The Numerical Simulation Results and Discussion

Now, suppose that the number and dimension of the samples are , . We take interactions into consideration and provide the following three kinds of simulation based on formula (1).(1)The real model is hierarchical: or , . There are 10 nonzero elements in and 20 nonzero elements in .(2)The real model only includes interactive variables: , . There are 20 nonzero elements in .(3)The real model only includes main variables: , . There are 10 nonzero elements in .

The SNR of the main variables is 1.5, and SNR of the interaction variables is 1. The experiment results of 100 times are shown in Figure 6.

(a) Hierarchical interaction

(b) Interaction variances only

(c) Main variances only

When the real model is hierarchical, our method is the best, and the lasso is the worst. This is shown in Figure 6(a). When the real model only includes interaction variances, the interactive lasso is the best, and our method takes the second place, while the lasso is still the worst, as shown in Figure 6(b). The reason for this result is that when our method fits the model, the interaction variables are considered to be main variables. When the real model only includes main variables, the lasso is the best, and our method still takes the second place, and the all-pair lasso is the worst, as shown in Figure 6(c).

We believe that many actual classification problems could be hierarchical and interactive. They contain both main variables and interaction variables. Our method fits in this kind of situation.

6. Conclusion

Taking into consideration the interaction between variables, the hierarchical interactive lasso penalized logistic regression using the coordinate descent algorithm is derived. We provide the model definition, constraint condition, and the convex relaxation condition for the model. We obtain a solution for the coefficients of the proposed model based on the convex optimization and coordinate descent algorithm. We further provide experimental results based on four UCI datasets, NIPS2003 feature selection challenge datasets, and true daily life activities identification datasets. The results show that the interaction widely exists in the classification models. They also demonstrate that the variable interaction contributes to the response. The classification performance of our method is superior to the lasso, all-pair lasso, and some pattern recognition methods. It turns out that the variable interaction and hierarchy are two important factors. Our further research is planned as follows: other convex optimization methods including the generalized gradient descent method or alternating direction multiplier method, the hierarchical interactive lasso penalized multiclass logistic regression method, the elastic net method, or the hierarchical group lasso method. The application of the multisensor interaction in the daily life activities of the elderly is a new way of using of our method.

Appendices

A. Proofs from (5) to (6)

For notational convenience, we have written instead of . A logarithmic likelihood function of (4) is as follows:

First, we give the first- and second-order partial derivative and mixed partial derivative of (4) with respect to and :

Then (A.1) is expanded by using Taylor series with respect to the expended point : where

B. Proofs from (12) to (13)

First, there are three cases in calculating :(1), : (2), : (3), :

We derive

Secondly, we give the following by the stationary condition :

Supposing we have

We discuss three cases about value:(1), (2), (3).

We drive

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (nos. 61273019, 61473339), China Postdoctoral Science Foundation (2014M561202), Hebei Postdoctoral Science Foundation Special Fund Project, and Hebei Top Young Talents Support Program.

References

R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society B, vol. 58, no. 1, pp. 267–288, 1996.
View at: Google Scholar | MathSciNet
M. Y. Park and T. Hastie, “ $L_{1}$ -regularization path algorithm for generalized linear models,” Journal of the Royal Statistical Society B: Statistical Methodology, vol. 69, no. 4, pp. 659–677, 2007.
View at: Publisher Site | Google Scholar | MathSciNet
T. T. Wu, Y. F. Chen, T. Hastie, E. Sobel, and K. Lange, “Genome-wide association analysis by lasso penalized logistic regression,” Bioinformatics, vol. 25, no. 6, pp. 714–721, 2009.
View at: Publisher Site | Google Scholar
J. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for generalized linear models via coordinate descent,” Journal of Statistical Software, vol. 33, no. 1, pp. 1–22, 2010.
View at: Google Scholar
M. Lim and T. Hastie, “Learning interactions via hierarchical group-lasso regularization[EB],” http://www.stanford.edu/~hastie/Papers/glinternet.pdf.
View at: Google Scholar
H. Schwender and K. Ickstadt, “Identification of SNP interactions using logic regression,” Biostatistics, vol. 9, no. 1, pp. 187–198, 2008.
View at: Publisher Site | Google Scholar
S. Noah and R. Tibshirani, “A Permutation Approach to Testing Interactions in Many Dimensions,” http://statweb.stanford.edu/~tibs/research.html.
View at: Google Scholar
J. Wu, B. Devlin, S. Ringquist, M. Trucco, and K. Roeder, “Screen and clean: a tool for identifying interactions in genome-wide association studies,” Genetic Epidemiology, vol. 34, no. 3, pp. 275–285, 2010.
View at: Publisher Site | Google Scholar
Y. Nardi and A. Rinaldo, “The log-linear group-lasso estimator and its asymptotic properties,” Bernoulli, vol. 18, no. 3, pp. 945–974, 2012.
View at: Publisher Site | Google Scholar | MathSciNet
M. Yuan, V. R. Joseph, and Y. Lin, “An efficient variable selection approach for analyzing designed experiments,” Technometrics, vol. 49, no. 4, pp. 430–439, 2007.
View at: Publisher Site | Google Scholar | MathSciNet
H. Chipman, “Bayesian variable selection with related predictors,” The Canadian Journal of Statistics, vol. 24, no. 1, pp. 17–36, 1996.
View at: Publisher Site | Google Scholar | MathSciNet
N. H. Choi, W. Li, and J. Zhu, “Variable selection with the strong heredity constraint and its oracle property,” Journal of the American Statistical Association, vol. 105, no. 489, pp. 354–364, 2010.
View at: Publisher Site | Google Scholar | MathSciNet
J. Bien, J. Taylor, and R. Tibshirani, “A lasso for hierarchical interactions,” The Annals of Statistics, vol. 41, no. 3, pp. 1111–1141, 2013.
View at: Publisher Site | Google Scholar | MathSciNet
M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” Journal of the Royal Statistical Society. Series B. Statistical Methodology, vol. 68, no. 1, pp. 49–67, 2006.
View at: Publisher Site | Google Scholar | MathSciNet
R. Jenatton, J.-Y. Audibert, and F. Bach, “Structured variable selection with sparsity-inducing norms,” Journal of Machine Learning Research, vol. 12, no. 10, pp. 2777–2824, 2011.
View at: Google Scholar | MathSciNet
P. Radchenko and G. M. James, “Variable selection using adaptive nonlinear interaction structures in high dimensions,” Journal of the American Statistical Association, vol. 105, no. 492, pp. 1541–1553, 2010.
View at: Publisher Site | Google Scholar | MathSciNet
F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, “Structured sparsity through convex optimization,” Statistical Science, vol. 27, no. 4, pp. 450–468, 2012.
View at: Publisher Site | Google Scholar | MathSciNet
I. Ruczinski, C. Kooperberg, and M. LeBlanc, “Logic regression,” Journal of Computational and Graphical Statistics, vol. 12, no. 3, pp. 475–511, 2003.
View at: Publisher Site | Google Scholar | MathSciNet
P. Hall and J.-H. Xue, “On selecting interacting features from high-dimensional data,” Computational Statistics & Data Analysis, vol. 71, pp. 694–708, 2014.
View at: Publisher Site | Google Scholar | MathSciNet
J.-J. Wang, J. Li, T. Zhang, and W.-X. Hong, “Distinguishing visual feature extraction method using quadratic map and genetic algorithm,” Journal of System Simulation, vol. 21, no. 16, pp. 5080–5083, 2009.
View at: Google Scholar

Copyright

Copyright © 2014 Jin-Jia Wang and Yang Lu. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1500

Downloads

954

Citations