GEE-TGDR: A Longitudinal Feature Selection Algorithm and Its Application to lncRNA Expression Profiles for Psoriasis Patients Treated with Immune Therapies
With the fast evolution of high-throughput technology, longitudinal gene expression experiments have become affordable and increasingly common in biomedical fields. Generalized estimating equation (GEE) approach is a widely used statistical method for the analysis of longitudinal data. Feature selection is imperative in longitudinal omics data analysis. Among a variety of existing feature selection methods, an embedded method—threshold gradient descent regularization (TGDR)—stands out due to its excellent characteristics. An alignment of GEE with TGDR is a promising area for the purpose of identifying relevant markers that can explain the dynamic changes of outcomes across time. We proposed a new novel feature selection algorithm for longitudinal outcomes—GEE-TGDR. In the GEE-TGDR method, the corresponding quasilikelihood function of a GEE model is the objective function to be optimized, and the optimization and feature selection are accomplished by the TGDR method. Long noncoding RNAs (lncRNAs) are posttranscriptional and epigenetic regulators and have lower expression levels and are more tissue-specific compared with protein-coding genes. So far, the implication of lncRNAs in psoriasis remains largely unexplored and poorly understood even though some evidence in the literature supports that lncRNAs and psoriasis are highly associated. In this study, we applied the GEE-TGDR method to a lncRNA expression dataset that examined the response of psoriasis patients to immune treatments. As a result, a list including 10 relevant lncRNAs was identified with a predictive accuracy of 70% that is superior to the accuracies achieved by two competitive methods and meaningful biological interpretation. A widespread application of the GEE-TGDR method in omics longitudinal data analysis is anticipated.
With fast evolution of high-throughput technology, longitudinal omics experiments have become affordable and increasingly common in many biomedical fields for exploring dynamically or temporally changed biological systems or processes. Usually, the analysis strategies focus on analyzing individual time points separately. As many investigators have pointed out [1–4], a failure to incorporate information contained in the dependent structure of time course data results in inefficient estimation of the standard errors, leading to an inadequate statistical power. Especially in big omics studies, this problem stands out since the sample size of such data is usually small. Furthermore, an oversimplified consideration by combining the results from marginal analysis at individual time points tends to fail to detect a meaningful pattern of changes over time.
The generalized estimating equation (GEE) approach  is a well-established and widely used statistical method to analyze longitudinal data. GEE considers the first two marginal moments (i.e., mean and variance) of data and a working correlation matrix to model correlated responses, artfully avoiding the specification of full joint likelihood function. The appeal of GEE lies in that it yields consistent estimators for the parameters of interest, even if the working correlation structure is incorrectly specified. Naturally, GEE has been modified or extended to identify differentially expressed genes over time for high-throughput data. Such modifications and/or extensions are not simple due to the high dimensionality of omics data, although some efforts have been made [1, 2, 6].
Like its crosssectional counterpart, feature selection is imperative in the learning process for longitudinal omics data. Feature selection is aimed at eliminating irrelevant genes, avoiding overfitting, speeding up the learning process, and achieving a final model that is parsimonious (i.e., the number of selected genes is as least as possible). Consequently, a modification to GEE to analyze high-dimensional data necessitates the involvement of feature selection. In the literature, there are several such algorithms. For example, Wang et al.  used a smoothly clipped absolute deviation (SCAD) penalty term  which is a novel extension to the L1 penalty to equip the GEE models with feature selection capacity. The L1 penalty, also known as LASSO , forces genes with small estimated coefficients out of the final model, rendering a sparse solution by the means of which feature selection occurs. However, two subsequent works on this topic [1, 6] showed that this algorithm usually fails to converge when the number of covariates is much larger than the number of samples. This drawback is more apparent and fatal in longitudinal omics data, where the sample size is typically smaller than that of a crosssectional study.
Among a variety of existent feature selection algorithms, we have devoted dramatic efforts on the threshold gradient descent regularization (TGDR)  method (see the Methods section for its description). Previously, we had extended TGDR for classification task of multiple groups (>2) and for identification of subgroup-specific prognostic genes with a survival outcome [10–14]. By applying these TGDR extensions to different types of omics data including microarray, RNA sequencing, and mass spectrometry (MS) data, we have shown that TGDR and its respective extensions have many merits including easy-to-moderate programming intensity, good predictive performance, and biologically meaningful implications of the resulting signatures. In a recent work , we show that the TGDR algorithm can be regarded as an optimization strategy and that the final models given by TGDR have superior predictive performance and more meaningful biological interpretation than the LASSO models optimized by the coordinate descent method . Therefore, an integration of GEE with TGDR may overcome the drawbacks of existing approaches for the purpose of longitudinal feature selection.
Long noncoding RNAs (lncRNAs) are posttranscriptional and epigenetic regulators and have the characteristics of lower expression levels and more tissue-specific compared with protein-coding genes . Once being regarded as evolutionary junks, lncRNAs have been demonstrated to play essential roles in many complex diseases, especially in cancer . As pointed out by our previous study , psoriasis is an ideal model for examining the effects of targeted immune treatments given that it is well characterized by molecular profiles, displays low placebo effects, and possesses easily accessible diseased tissues. So far, the implication of lncRNAs in psoriasis remains largely unexplored and poorly understood. Among the limited research carried out to explore the roles of lncRNAs play in psoriasis; however, some encouraging results have turned up. For example, a very recent study  has shown that LOC285194 can serve as a sponger for miR-616 that regulates the expression of GATA3 though binding to its 3-untranslated region using Western blotting, quantitative real-time PCR, and dual-luciferase reporter assays. Specifically, the expression level of LOC285194 was lower in the affected skin of patient with psoriasis compared to the unaffected skin. Furthermore, Rakhshan et al.  showed that one SNP (i.e., rs12826786) of the HOX Transcript Antisense RNA (HOTAIR) is associated with a higher risk of developing psoriasis (TC+TT versus CC: , ). Therefore, we believe that the roles of lncRNAs play in psoriasis deserve to be explored deeply and widely.
In this article, we proposed a new feature selection algorithm, referred to as GEE-TGDR, specifically for longitudinal data mining and feature selection. In the GEE-TGDR method, the corresponding quasilikelihood function of a GEE model is the objective function to be optimized, while the optimization and feature selection are accomplished by the TGDR method. We applied this method to a longitudinal microarray gene expression data that is aimed at assessing the treatment efficacy of two immune therapies for psoriasis patients and identified the relevant lncRNAs that can predict the temporal changes of psoriasis area and severity index (PASI) scores that is utilized to determine if a patient with psoriasis responds to the treatments, with the objectives of revealing the underlying mechanisms of these two treatments from the perspective of lncRNAs.
Following the structures of a review by , the article is organized as follows. In Section 2, the details about the proposed GEE-TGDR method are given. In Section 3, the application of the GEE-TGDR method to psoriasis longitudinal lncRNA expression data and the analysis results are presented. Then, the biological relevance of identified lncRNA signature to psoriasis is discussed in detail. In Section 4, the limitations of the present study in addition to contributions and future work are discussed. Lastly, conclusions are given.
2. Materials and Methods
2.1. Experimental Data
The microarray dataset  used to characterize the proposed GEE-TGDR algorithm was in the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/) under the accession number of GSE85034. There were 179 arrays in this experiment, including the gene expression profiles of 30 patients with moderate to severe psoriasis at the baseline nonlesion skins and baseline lesion skins and at weeks 1, 2, 4, and 16. Of the 30 patients, half were administrated with adalimumab (ADA), and the other half were treated with methotrexate (MTX). One patient on the ADA arm had no expression measurements of week 16 since his/her psoriasis area and severity index (PASI) score already had experienced a 75% decrement at the week 4. In original paper, a treatment response was based on a reduction of 75% in PASI score after week 12 or later. Longitudinal profiles of PASI scores (baseline lesion skins, at weeks 1, 2, and 4) were the outcomes of interest, and the lncRNA expression values of the baseline lesional skins serve as potential predictors to investigate if they are relevant to the PASI scores of psoriasis patients over time.
In this study, the preprocessed data were directly downloaded from the GEO database. No alternative preprocessing had been carried out. By matching the gene symbols of lncRNAs in the GENCODE (https://www.gencodegenes.org/) database (version 32) to those of genes annotated by the Illumina HumanHT-12 V 4.0 bead chips, 662 unique lncRNAs were identified and included in the downstream analysis.
2.2. Statistical Methods
In this paper, we conceive a new novel feature selection algorithm called GEE-TGDR specifically for selecting relevant features associated with the temporal changes of longitudinal outcomes, in which GEE is equipped with TGDR just as its name implies. We briefly described both GEE and TGDR methods before proceeding to the proposed integration. Here, to keep it the most relevant, we focused on the case of continuous outcomes.
2.2.1. Threshold Gradient Descent Regularization
For continuous outcomes, the TGDR algorithm is based on a linear model, where a response variable ( is the sample size) is modelled by a -dimensional vector of observed covariates (here, ) as . Here, represent the coefficients of covariates for the magnitudes of association between covariates and the outcome. For continuous outcomes, a normal distribution is usually assumed, and then, the corresponding likelihood function is used as a response function/an objective function in the TGDR algorithm. With some algebraic simplification, the response function can be written as
The TDGR algorithm started from that the were initially set at zero’s (corresponding to the null model). Using to denote a small positive increment (e.g., 0.01) in the gradient descent search, and for iteration ,(1)Upon current estimate , a negative gradient matrix with its component as are calculated as(2)Let represent the threshold vector of size at iteration and is an indicator (if the condition is true, this indicator returns 1; otherwise, its value is 0), then its component (for the gene) is(3)Update and (4)Repeat steps 1-3 for times. can be regarded as a tuning parameter, with a large value corresponding to a dense model (more nonzero coefficients) and a small value to a sparse model (less nonzero coefficients). The optimal value of is determined by crossvalidations (CVs).
In the TGDR method, no explicit penalty term is added to the objective function (i.e., response function). The regularization on coefficients (thus the selection of features) is made possible by introducing the threshold function in step 2, which determines if the gradient of a coefficient is large enough to descent or more precisely speaking to be updated. For more detailed description of the TGDR method, the works [9, 21] are referred.
2.2.2. Generalized Estimating Equation
In the longitudinal notation, the time point/measurement of the subject, a -dimensional vector of response variables (here, and ) and covariates (here, represents covariate) are observed. Thus denotes the vector of responses at different time points for subject , and is covariates for subject at time point .
In the GEE model, the first two marginal moments of are denoted by (the expectation of given ) and (the variance of ). Here, are the coefficients representing the magnitude of association between covariates and outcomes, with representing how attribute is associated with the value of outcome (meaning the outcome at time point ). Those are parameters of interest. Furthermore, the distribution of is assumed to belong to an exponential family with a canonical link function. Let and , then under a canonical link function . Here, is an working correlation matrix with as the finite dimensional parameter vector for correlations, which would be usually estimated by the residual-based moment method. In a GEE model, the quasilikelihood function can be written asFour structures are commonly used for the working correlation matrix —first-order autoregressive (AR1), exchangeable, unstructured, and independent structure.
The conventional TGDR method only deals with univariate outcomes. As far as longitudinal outcomes that are multivariate are concerned, the method needs to be extended.
In this study, we proposed to replace the likelihood function with the corresponding quasilikelihood function and to extend TGDR as GEE-TGDR. With denoting a small positive increment (e.g., 0.01) in gradient descent search, then at iteration,(1)Upon current estimate , a negative gradient matrix with its component as are calculated(2)Let represent the threshold vector of size for the time point () at iteration , then its component (for the gene) is(3)Update and (4)Calculate the residuals, viz, , and based on them, to estimate the nuisance parameters involved in (for different correlation structures, the parameters are different) and . Of note, since at different time points, we have different threshold function, the selected genes at different time points are expected to differ. In this way, the selection of critical time points is possible(5)Repeat steps 1-4 for times. is a tuning parameter, the same as in the conventional TGDR method. The optimal value of is also determined by CVs
In this study, we only developed the GEE-TGDR algorithm for continuous outcomes given in the motivated database; PASI scores which are continuous were the outcomes of interest, then the corresponding expectations of given are . Here, let represent the time points measured; then are for the gene expression profiles at time point for subject , and are for the corresponding coefficients of those gene expression values at time point . Figure 1 gives the graphical illustration of the GEE-TGDR algorithm for continuous longitudinal outcomes.
Since the outcomes were continuous, the mean squared error (MSE) statistic was calculated to evaluate the performance of resulting gene signatures. It is worth pointing out that for the outcomes of other types, an extension suitable for the underlying data type of GEE-TGDR algorithm is straightforward, with the corresponding quasilikelihood function serving as the objective function/response function.
2.3. Statistical Language
Statistical analysis was carried out in the R language version 3.6.1 (http://www.r-project.org).
3.1. Identified LncRNA Signatures
In this study, we propose to extend the feature selection algorithm TDGR to account for correlation structure of longitudinal data. This is accomplished by defining the objective function of TDGR as the corresponding quasilikelihood function, which as in GEE is specified based on the first two moments and a working correlation matrix. TDGR-GEE is described in the Materials and Methods section. In this section, we illustrate the application of the proposed method while looking for biomarkers that predict clinical resolution of psoriasis after being treated with two immune therapies.
Gene expression profiles of baseline lesional skin biopsies were obtained for 30 subjects followed up to 16 weeks after treatment with adalimumab and methotrexate. Clinical resolution at weeks 1, 2, and 4 was measured by PASI. In this example, we would like to identify a signature of genes whose baseline expression values correlate with changes in PASI, our continuous longitudinal outcome. WE used 662 lnRNA as covariates in the proposed GEE-TGDR model, under 4 different working correlation structures. The performance statistics (i.e., MSEs) and identified lncRNA genes are presented in Table 1.
In this application, the results obtained under working correlation structures exchangeable, unstructured, and independent barely differ, with similar sets of biomarkers leading to similar performance. This reflects a well-known robust characteristic of GEE, where when predictors are correctly given, the GEE estimates remain consistent even if the correlation structures are misspecified. Under the AR1 structure, GEE-TGDR identified only one lncRNA as being related to PASI scores, leading to an underfitting and inferior to the performance when compared to the other three correlation structures.
Due to the patient burden and budgetary restrictions, longitudinal omics data are usually very short and unevenly spaced. In this case, AR1 is not well suited and the unstructured correlation may be the most suitable structure, even though that this structure corresponds to a model with more nuisance parameters involved in the corresponding working correlation structure.
Crossvalidation (CV) results gave us an idea for the variability in the model performance in this regard; CV results indicated that all correlation structures but AR1 structure provided similar results, with both the exchangeable and independent structures having the least MSEs but a bigger variability and the unstructured structure having a larger MSE but the smaller variations.
Even though that at individual time points, the identified features varied substantially for the unstructured, exchangeable, and independent working correlation structures (Figure 2), the unions of lncRNA lists across time are essentially the same, including 9 lncRNAs identified by all these three structures and one lncRNA selected by the independent structure alone (Figure 3).
3.2. Comparison with Competing Methods
In order to further characterize the GEE-TGDR method, a comparison with two competing methods was made. One competing method under consideration is the GEE-screening method  in which a GEE model was fit with the PASI scores over time as the outcome and the expression values of a certain lncRNA as the covariate. The GEE-screening method filtered genes one by one. Of note, in the GEE-TGDR method and the GEE-screening model, we only considered the unstructured working correlation structure. In the other competing method, namely, linear mixed model-based screening method, a GEE model was replaced by a linear mixed model (the outcome and the covariate are the same as those in the GEE-screening model), and the intercept term was regarded as a random effect. The lncRNAs with corresponding values of the coefficients <0.05 were selected as being relevant in both competing methods. Then, a support vector machine model was fit using the response status as the outcome and the identified lncRNAs by a specific method as predictors. According to the 10-fold crossvalidation results (to estimate the predictive performance of each method), the GEE-TGDR method achieved the best predictive accuracy (Table 2). Of note, even GEE-TGDR has the best performance compared to the other two competitive methods, its predictive accuracy is estimated as 70%, which is far from 100%, leaving a large space to be improved.
3.3. Biological Relevance
In order to gain biological insight-identified biomarkers, we evaluated the relevance to psoriasis of the 10 identified lncRNA using disease confidence scores, where a high score represents a solid support by the literature according to the GeneCards database. None of the 10 lncRNAs were directly related to psoriasis while 5 lncRNAs, listed in a descending order for the confidence scores and thus descending support by the literature according to the GeneCards database, MIR205, XIST, SNHG5, LINC01139, and SDHAP2 were associated with immunity.
Little meaningful information was extracted from currently annotated lncRNA databases, no surprisingly since that psoriasis remains largely unexplored from the perspective of lncRNAs. We thus focused on studying the mRNAs correlated or targeted by these lncRNAs. Specifically, we identified the genes whose baseline lesional expression was strongly correlated with at least one of the 10 lnRNA (, 5) and identified 225 mRNAs genes. According to the GeneCards database , approximately 30% of these mRNAs (64) were directly related to psoriasis, most notably IL10, FABP5, KRT16, CCR6, IL18, STAT3, GATA3, and SERPINB3, providing some validation of the lncRNA biomarkers identified by the GEE-TGDR method. In contrast, among the 29 target mRNAs identified by the lncRNA Disease 2.0 database  as targeted by the 10 lnRNA panel (all of which were identified by the correlation approach), GeneCards claimed that CCR10, AOC3, UBB, and WNK4 were directly related to psoriasis, but only CCR10 had a large confidence score for its relevancy to psoriasis. Of note, among the 10 lncRNAs, only RAMP2-AS1, PAX1P1-AS1, TMEM99, and LIN01018 have many correlated mRNAs, but the other five have few or no correlated mRNAs at all.
3.4. Enriched Pathways by Target mRNAs
A gene-set overrepresentation analysis was carried out on the 225 mRNAs identified as targeted by the 10 lnRNA biomarker panel using the STRING software  on KEGG and GO collections. About 346 enriched biological process (BP) terms, 23 molecular function (MF) terms, and 21 cellular component (CC) terms were identified in the GO collection reflecting the immune pathophysiology of the disease. The top 3 enriched KEGG pathways  reflected the inflammatory processes not only identifying inflammatory bowel diseases () and cytokine-cytokine receptor interaction () but also zeroing on the hallmark pathway in psoriasis: Th17 differentiation ().
Lastly, among the 225 mRNA, we selected the top 10 in terms of psoriasis-relevance () and constructed a lncRNA-mRNA interaction network, visualized by Cytoscape software  (Figure 4). We observed that the target mRNAs are highly connected, with IL10 serving as a hub gene. It is well-known that IL10 is an immunosuppressive cytokine and enables to maintain immunological homeostasis . Based on this, we anticipate that identified lncRNAs may regulate the expression of important cytokines such as IL10 and warrant further investigation.
4.1. Limitations and Future Work
At current stage, the GEE-TGDR method has several limitations. First, no grouping structure is taken into account, and thus, the GEE-TGDR method belongs to the conventional embedded feature selection category. So far, accumulated studies [28–31] have shown that a pathway-based method that considers grouping information is superior to its gene-based counterpart in which grouping information is ignored. Thus, how to extend the proposed GEE-TGDR method to account for correlations among genes is a research avenue we will pursue in the near future.
Second, the TGDR method is much slower than the coordinate descent (CD)  method as shown by our previous study . Given that the GEE-TGDR extension has the TGDR method as an optimization strategy, its speed of convergence is expected to be very slow. A method that combines the merits of these two algorithms together is definitely in demand. Alternatively, a sine cosine algorithm  may be integrated into the gradient descent step for a faster updating and a better tuning of hyperparameters (tuning parameters). Furthermore, the step increment is fixed at a constant value in the current version. In the future, this parameter will be modified to update along the iterations, as in the Adam algorithm, which may boost the computing efficiency and avoid being stuck in a local minimum value as well.
Third, the GEE-TGDR method only takes time-invariant covariates in its current version. For longitudinal gene expression profiles, a summary score would be utilized to summarize each gene’s expression values over time as one overall value. Consequently, covariates became time-invariant again. For example, the mean values of lncRNA expression profiles at baseline and week 1 can be used to represent the corresponding lncRNAs and then as the covariates to investigate they are associated with PASI scores at week 1, week 2, and week 4 or the change of PASI scores at those time points from the baseline levels. On the other hand, the GEE-TGDR method can be certainly extended to handle time-varying covariates, which can examine the impact of dynamic changes in gene expression values on the outcomes of interest and thus facilitate a timely adjustment on treatment strategies accordingly. Lastly, right now, the only type of outcomes is continuous; yet certainly, it can be extended to handle outcomes of other types, with the corresponding quasilikelihood function acting as the objective function.
In this study, we propose a new feature selection algorithm that is capable of analyzing longitudinal outcomes and investigating the associations between gene expression profiles and the temporal changes of outcomes. In the psoriasis application, overfitting might be possible on the basis of the large discrepancy in MSE statistics between the whole training set and the crossvalidations. Even worse but more realistic, overfitting and underfitting may accompany each other to exist in a feature selection process. Since for real-world applications, the true relevant genes are unknown so the biological relevance is usually resorted to abstract some insight about the appropriation of identified gene lists. Nevertheless, for psoriasis and the underlying mechanism of immune treatments to combat this disease, little has been investigated from the perspective of lncRNAs to mine such relevant information. To the best of our knowledge, our work here is one of first efforts to unveil the mechanisms of psoriasis and its immune treatments using lncRNA expression profiles and a feature selection method specific for longitudinal data.
After the limitations of the GEE-TGDR method are addressed in the near future, we believe that a lncRNA signature will be harvested to tell precisely which patients would respond to a specific treatment from those who would not and thus facilitating personalized regimens or at least complementing other molecular markers for precise treatment strategies.
In this study, we proposed a novel feature selection algorithm—GEE-TGDR—capable of handling longitudinal outcomes and identifying relevant genes associated with the temporal changes of such outcomes.
Our future work will focus on eliminating the limitations of the GEE-TGDR method. In addition, extensions of the current procedure to analyze other types of outcomes rather than continuous ones and a more efficient and faster implementation of updating coefficients are at the top of this list.
It is worth mentioning that besides dealing with longitudinal clinical outcomes, the GEE-TGDR can be adopted to inference the associations between lncRNAs and mRNAs and thus construct lncRNA-mRNA interaction networks. For example, using well-known cancer-related mRNAs as outcomes, the lncRNAs that may potentially regulate/target those mRNAs could be found with the aid of the GEE-TGDR method, which is also one of our future works. Therefore, we anticipate a widespread application of the GEE-TGDR method in omics data analysis.
Preprocessed gene expression data (accession no.: GSE85034) along with patient’s clinical information were downloaded from the GEO database (https://www.ncbi.nlm.nih.gov/geo/).
Conflicts of Interest
No competing interests have been declared.
ST conceived and designed the study. ST and CW analyzed the data. CW and ST interpreted data analysis and results. ST and CW wrote the paper. All authors reviewed and approved the final manuscript.
We thank Dr. Danna Gilbreath for the English editing. This study was supported by a fund (No. 31401123) from the National Natural Science Foundation of China. Dr. Suarez-Farinas was also supported by the Irma T. Hirschl/Monique Weill-Coulier Research Award.
J. H. Friedman and B. E. Popescu, Gradient directed regularization for linear regression and classification, Technical Report, Statistics Department, Stanford University, 2003.
S. Tian, H. H. Chang, C. Wang, J. Jiang, X. Wang, and J. Niu, “Multi-TGDR, a multi-class regularization method, identifies the metabolic profiles of hepatocellular carcinoma and cirrhosis infected with hepatitis B or hepatitis C virus,” BMC Bioinformatics, vol. 15, no. 1, p. 97, 2014.View at: Publisher Site | Google Scholar
J. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for generalized linear models via coordinate descent,” Journal of Statistical Software, vol. 33, no. 1, pp. 1–22, 2010.View at: Google Scholar
J. Correa, J. Kim, S. Tian, L. E. Tomalin, and J. G. Krueger, “Shrinking the psoriasis assessment gap: early gene-expression profiling accurately predicts response to long-term treatment,” Journal of Investigative Dermatology, vol. 137, no. 2, pp. 305–312, 2017.View at: Publisher Site | Google Scholar