BioMed Research International

Volume 2015, Article ID 713953, 7 pages

http://dx.doi.org/10.1155/2015/713953

## Network-Based Logistic Classification with an Enhanced Solver Reveals Biomarker and Subnetwork Signatures for Diagnosing Lung Cancer

Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Avenida Wai Long, Taipa 999078, Macau

Received 24 October 2014; Revised 5 April 2015; Accepted 30 April 2015

Academic Editor: Jennifer Wu

Copyright © 2015 Hai-Hui Huang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Identifying biomarker and signaling pathway is a critical step in genomic studies, in which the regularization method is a widely used feature extraction approach. However, most of the regularizers are based on -norm and their results are not good enough for sparsity and interpretation and are asymptotically biased, especially in genomic research. Recently, we gained a large amount of molecular interaction information about the disease-related biological processes and gathered them through various databases, which focused on many aspects of biological systems. In this paper, we use an enhanced penalized solver to penalize network-constrained logistic regression model called an enhanced net, where the predictors are based on gene-expression data with biologic network knowledge. Extensive simulation studies showed that our proposed approach outperforms regularization, the old penalized solver, and the Elastic net approaches in terms of classification accuracy and stability. Furthermore, we applied our method for lung cancer data analysis and found that our method achieves higher predictive accuracy than regularization, the old penalized solver, and the Elastic net approaches, while fewer but informative biomarkers and pathways are selected.

#### 1. Introduction

Identifying molecular biomarker or signaling pathway involved in a phenotype is a particularly important problem in genomic studies. Logistic regression is a powerful discriminating method and has an explicit statistical interpretation which can obtain probabilities of classification regarding the class label information.

A key challenge in identifying diagnosis or prognosis biomarkers using the logistic regression model is that the number of observations is much smaller than the size of measured biomarkers in most of the genomic studies. Such limitation causes instability in the algorithms used to select gene marker. Regularization methods have been widely used in order to deal with this problem of high dimensionality. For example, Shevade and Keerthi proposed the sparse logistic regression based on the Lasso regularization [1, 2]. Meier et al. investigated logistic regression with group Lasso [3]. Usually, the Lasso type procedures are often called -norm type regularization methods. However, regularization may yield inconsistent selections when applied to variable selection in some situations [4] and often introduces the extra bias in the estimation [5]. In many genomic studies, we need a sparser solution for interpretation and accurate outcomes, but regularization has a gap to meet these requirements. Thus, a further improvement of regularization is urgently required. () regularization can assuredly generate more sparse and precise solutions than regularization. Moreover, penalty can be taken as a representative of () penalty and has demonstrated many attractive properties which do not appear in some regularization approaches, such as unbiasedness, sparsity, and oracle properties [6–8].

So far, we observed dense molecular interaction information about the disease-related biological processes and gathered it through databases focused on many aspects of biological systems. For example, BioGRID records collected various biological interactions from more than 43,468 publications [9]. These regulatory relationships are usually represented by a network. Combining these pieces of graphic information extracted from the biological process with an analysis of the gene-expression data had provided useful prior information to detective noise and removes confounding factors from biological data for several classification and regression models [10–14].

Inspired by the aforementioned methods and ideas, here, we define a network-constrained logistic regression model with penalty following the framework established by [11], where the predictors are based on the gene-expression data with biologic network knowledge. The proposed model is aimed at identifying some biomarkers and subnetworks regarding diseases. In order to achieve a better prediction, we use an enhanced half thresholding algorithm for regularization, which is more efficient than the old half thresholding approach in the literature [6, 15, 16].

The rest of the paper is organized as follows. In Section 2, we proposed a new version of the network-constrained logistic regression model with regularization. In Section 3, we presented an enhanced half thresholding method for regularization and the corresponding coordinate descent algorithm. In Section 4, we evaluated the performance of our proposed approach on the simulated data and presented the applications of the proposed methods to an analysis of lung cancer data. We concluded the paper with Section 5.

#### 2. Penalized Network-Constrained Logistic Regression Model

Generally, assuming that dataset has samples, , where is th sample with genes and is the corresponding variable that takes a value of 0 or 1. Define a classifier and the logistic regression is defined aswhere are the coefficients to be estimated. We can obtain by minimizing the log-likelihood function of the logistic regression. Following [11], to combine biological network with an analysis of the gene microarray data, we used a Laplacian constraint approach here. Consider a graph , where is the set of genes that meet explanatory variables and is the set of edges. If gene and gene are connected, then there is an edge between gene and gene , which is denoted by ; else . denotes the weight of edge . The normalized Laplacian matrix for is defined bywhere and are the degrees of genes and , respectively. The degrees of gene (or ) describe the number of the edges that connected with (or ). For , the network-constrained logistic regression model is presented aswhere the first term in (3) is the log-likelihood function of the logistic model and the second term is a network constraint based on the Laplacian matrix, which induces a smooth solution of on the graph.

Directly computing (3) performs poorly for both prediction and biomarker selection purposes when the gene number ≫ the sample size . Therefore, the regularization approach is vitally needed. When adding a regularization term to (3), the sparse network-constrained logistic regression can be written aswhere is a regularization parameter. In Zhang et al. [13], the authors used Lasso () which has the regularization term to penalize (4). However, the result of the Lasso type () regularization is not good enough for interpretation, especially in genomic research. Besides this, regularization is asymptotically biased [17, 18]. To improve the solution’s sparsity and its predictive accuracy, we need to think beyond regularization to penalties. In mathematics, type regularization with the lower value of would lead to better solutions with more sparsity and gives asymptotically unbiased estimates [17]. Moreover, penalty can be taken as a representative of penalty and has permitted an analytically expressive thresholding representation [6, 7]. Therefore, we proposed a novel net approach based on regularization to penalize the network-constrained logistic regression model, as shown in where .

#### 3. A Coordinate Descent Algorithm for the Network-Constrained Logistic Model with the Enhanced Thresholding Operator

penalty function is nonconvex, which raises numerical challenges in fitting the models. Recently, the coordinate descent algorithms [19] for solving nonconvex regularization models (SCAD [20], MCP [21]) have shown significant efficiency and convergence [22]. Since the computational burden increases only linearly with the feature number , the coordinate descent algorithm can be a powerful tool for solving high-dimensional problems. Its standard procedure can be demonstrated as follows: for every coefficient , to partially optimize the target function with respect to , and fix the remaining elements at their most recently updated values. The specific form of updating depends on the thresholding operator of the penalty.

In this paper, we present an enhanced thresholding operator for the coordinate descent algorithm:where , , , and as the partial residual for fitting .

*Remark*. This enhanced thresholding operator outperforms the old thresholding introduced in [6, 15, 16]. We know that the quantity of the regularization solutions depends seriously on the value of the regularization parameter . Based on this enhanced thresholding operator, when is chosen by some efficient strategies for the parameter tuning, such as cross validation, the convergence of algorithm (6) is proved [7].

The Laplacian matrix is nonnegative definite; thus, it can be written as by Cholesky decomposition. Following C. Li and H. Li [11] approach, (4) can be expressed aswhere , , , and is the regularization parameter and can be expressed as .

One-term Taylor series expansion for (7) can be written aswhere is the estimated response and is the weight for the estimated response. is the evaluated value under the current parameters. Thus, we can redefine the partial residual for fitting current as and . The procedure of the coordinate descent algorithm for penalized network-constrained logistic model is described as follows.

*Algorithm 1 (the coordinate descent algorithm for penalized network-constrained logistic model). *
We consider the following.*Step 1*. Initialize all () and , , and set , chosen by cross validation.*Step 2*. Calculate and and approximate the loss function (8) based on the current .*Step 3*. Update each and cycle over , until does not change.*Step 3.1*. Compute and .*Step 3.2*. Update .*Step 4*. Let , .

If dose not converge, then repeat Steps 2 and 3.

#### 4. Simulation and Application

##### 4.1. Analyses of Simulated Data

We evaluate the performance of four methods: the network-constrained logistic regression models with regularization ( net), regularization with old thresholding value ( net) and with the enhanced thresholding value (enhanced net), and the Elastic net regularization approach (Elastic net). We first simulated the graph structure to mimic gene regulatory network: assuming that the graph consists of 200 independent transcription factors (TFs) and each TF regulates 10 unlike genes, so there are a total of 2200 variables, , . The training and the independent test data sets include the sample sizes of 100, respectively. Each TF and its regulated genes were generated by the normal distribution . We set the correlation rate between and its regulated gene as 0.75, . The binary responder (), which is associated with the matrix of TFs and their regulated genes, is calculated based on the following formula and rule:where , for Model 1, and .

Model 2 was defined similar to Model 1, except that we considered the case when the TF can have positive and negative effects on its regulated genes at the same time:

In these two models, the 10-fold cross validation approach was conducted on the training datasets to tune the regularization parameters of the enhanced net, net, and net. Both penalized parameters for and ridge regularization in the Elastic net were tuned by the 10-fold cross validation on the two-dimensional parameter surfaces. We repeated the simulations over 100 times and then computed the misclassification error, the sensitivity, and the specificity averagely for each net model on the test datasets.

Table 1 summarizes the simulation results from each regularization net model. In general, our proposed enhanced net model achieved the smallest misclassification errors in Models 1 (9.22%) and 2 (10.76%) compared with the other regularization methods including the old thresholding method (9.85% for Model 1 and 10.83% for Model 2), net (11.81% for Model 1 and 13.21% for Model 2), and the Elastic net (13.12% for Model 1 and 14.14% for Model 2). Meanwhile, the enhanced net resulted in the highest sensitivity in Model 1 (98.5%) compared with the other methods. Moreover, the enhanced net obtained the best specificity in Model 2 (98.7%) amongst the other approaches. To sum up, the enhanced net outperforms the other three algorithms in terms of prediction accuracy, sensitivity, and specificity.