Objective. Precision medicine with molecular profiles has revolutionized the management of lung cancer contributing to improved prognosis. Herein, we aimed to uncover the gene expression profiling of transcription factors (TFs) in lung cancer as well as to develop a TF-based genomic model. Methods. We retrospectively curated lung cancer patients from public databases. Through comparing mRNA expression profiling between lung cancer and normal specimens, specific TFs were determined. Thereafter, a TF genomic model was developed with univariate Cox regression and stepwise multivariable Cox analyses, which was verified through the GSE72094 dataset. Gene set enrichment analyses (GSEA) were presented. Downstream targets of TFs were predicted with ChEA, JASPAR, and MotifMap projects, and their biological significance was investigated through the clusterProfiler algorithm. Results. In the TCGA cohort, we proposed a TF-based genomic model, comprised of SATB2, HLF, and NPAS2. Lung cancer individuals were remarkably stratified into high- and low-risk groups. Survival analyses uncovered that high-risk populations presented unfavorable survival outcomes. ROCs confirmed the excellent predictive potency in patients’ prognosis. Additionally, this model was an independent prognostic indicator in accordance with multivariate analyses. The clinical implication of the model was well verified in an independent dataset. High risk score was in relation to carcinogenic pathways. Downstream targets were characterized by immune and carcinogenic activation. Conclusion. The proposed TF genomic model acts as a promising marker for estimation of lung cancer patients’ outcomes. Prospective research is required for testing the clinical utility of the model in individualized management of lung cancer.

1. Introduction

Lung cancer occupies the major cause of cancer incidence and mortality across the globe, with estimated 2.1 million newly diagnosed cases as well as 18.4% of all cancer-related deaths in 2018 [1]. It is primarily classified into small cell lung cancer (SCLC) as well as non-small cell lung cancer (NSCLC). Among them, NSCLC occupies around 85% of lung cancer cases, with an undesirable five-year survival rate below 16%. It is comprised of lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC), as well as large cell carcinoma histological subtypes [2]. Chemotherapy represents the major therapeutic option against lung cancer, and platinum-based dual treatment is the standard treatment against advanced patients [3]. Nevertheless, chemotherapeutic resistance remains challenging, impeding much therapeutic success. Though tyrosine kinase-based targeted therapy interferes with the oncogenic pathway in NSCLC, acquired resistance evolves contributing to rapid disease recurrence and progression [4]. Immunotherapy like PD-1 and PD-L1 inhibitors has displayed superiority to traditional chemotherapy, which has become the standard for treating NSCLC [57]. Nevertheless, in past clinical trials, the objective response rate of immunotherapy remains low [8].

Transcription factors (TFs), DNA-binding proteins, recognize the promoter sequences of genes as well as subsequently guide gene expression [9]. Recent studies demonstrate that patterns of TF programs as well as immune pathway activation characterize four main SCLC subtypes with diverse treatment vulnerabilities [10]. Additionally, system-epigenomics inference of TF activities indicates inactivated aryl-hydrocarbon-receptor as an important event during lung carcinogenesis [11]. Experimental evidences suggest the lung carcinogenic roles of TFs. For instance, Oct4, a critical stemness TF, controls expression of lncRNAs NEAT1 and MALAT1 for promoting lung carcinogenesis [12]. TF NFIB is in relation to tumor aggressiveness in LUAD [13]. TF OCT1 triggers the Warburg effect through upregulating hexokinase 2 in NSCLC [14]. TF YY1 mediates lung cancer progression through activation of lncRNA-PVT1 [15]. Hence, characterization of the genomic profiling of TFs may assist in offering an in-depth understanding of the precise treatment of lung cancer. Herein, we developed a TF genomic model for lung cancer outcomes and uncovered the relevant signaling pathways.

2. Materials and Methods

2.1. Public mRNA Expression Cohorts

The Cancer Genome Atlas- (TCGA-) LUAD and TCGA-LUSC datasets containing mRNA expression profiling, survival and clinicopathologic data () were curated from the Genomic Data Commons tool (https://portal.gdc.cancer.gov/) [16], as a training set. Fragments per kilobase million (FPKM) were converted to transcripts per million (TPM). Gene expression arrays of 442 LUAD patients were harvested from GSE72094 [17] via the Gene Expression Omnibus (GEO) repository (https://www.ncbi.nlm.nih.gov/gds/), which acted as a testing set. This dataset was based on the GPL15048 platform. Probe IDs were transformed into gene symbols in accordance with platform annotation file. The standardized expression value was logarithmically converted as well as scaled. Thereafter, the mean expression of genes with various probes was utilized as their expression value. The corresponding patient’s clinical data in TCGA and GSE72094 cohorts are displayed in Supplementary table 1. The gene sets of TFs were harvested from published literature (Supplementary table 2) [9].

2.2. Analyses of Lung Cancer-Specific TFs

DESeq2 package [18] was adopted for differential analyses of sequencing data. TFs with and adjusted were set as thresholds of lung cancer-specific TFs. Volcano plots and heat map were conducted for visualizing the distribution of lung cancer-specific TFs between lung cancer and normal specimens.

2.3. Construction of a TF Genomic Model

Univariate Cox regression models were conducted for evaluation of the interactions of lung cancer-specific TFs with patients’ prognosis in the TCGA cohort. Thereafter, stepwise multivariate Cox regression analyses were utilized for shrinking the variables as well as screening the most predictive biomarkers. The candidate TF-specific TFs were utilized for creating a risk scoring formula that was determined through a linear integration of mRNA expression as well as matched regression coefficients derived from the stepwise multivariate Cox analyses. Through ranking of the risk scoring system, patients were stratified into high- and low-risk subpopulations. Kaplan–Meier methods were presented for estimating the survival outcomes between high- and low-risk subpopulations, and log-rank tests were adopted for the calculation of the discrepancy in prognosis between subpopulations. The receiver operator characteristic (ROC) curve was depicted through the pROC package [19]. The prognostic potency of the TF genomic model was externally verified in the GSE72094 cohort.

2.4. Uni- and Multivariate Cox Regression Analyses

Through univariate Cox regression analyses, the interactions of clinicopathological indicators (age, gender, and pathological staging) and TF genomic model with lung cancer prognosis were evaluated across lung cancer individuals. Thereafter, multivariate Cox regression models were established for uncovering the independent prognostic indicators.

2.5. Gene Set Enrichment Analyses (GESA)

Through the Java program (http://software.broadinstitute.org/gsea/index.jsp), GSEA [20] was carried out on the basis of the “c2.cp.kegg.v7.2.symbols” gene set curated from the Molecular Signatures Database (MSigDB) project [21]. Cytoscape software [22] was adopted for visualizing our GSEA results. The interactions of specific gene sets with risk score for all genes were investigated, and positively and negatively correlated ones to the enrichment score were calculated. In total, 1000 permutations were conducted, and pathways with false discovery rate were regarded as having prominent enrichment.

2.6. Prediction of Downstream Targets of TFs

Three web-based interactive applications, containing ChIP Enrichment Analysis (ChEA; http://amp.pharm.mssm.edu/lib/chea.jsp) [23], JASPAR (http://jaspar.genereg.net), and MotifMap (http://motifmap.igb.uci.edu/) databases, were adopted for estimating the downstream targets of TFs. The ChEA project offers data on eukaryotic TFs, consensus binding sequences, and experimentally validated binding sites as well as modulated genes [24]. The JASPAR project represents an open-access project of curated, nonredundant TF-binding profiling stored as a position frequency matrix for TFs among diverse species in six taxonomic populations [25]. MotifMap provides integrative genome-wide maps of regulated motif sites for model species [26].

2.7. Functional Enrichment Analyses

Gene Ontology (GO) as well as Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were carried out for determining the biological functions of downstream targets of TFs through the clusterProfiler package [27]. GO depicted three complementary biological concepts containing biological process (BP) and molecular function (MF) as well as cellular component (CC). Meanwhile, KEGG may assist uncover high-level functions and utilities of biological systems.

2.8. Statistical Analyses

All statistical analyses were managed through R software (version 3.6.3). Student’s -test or Wilcoxon test was adopted for statistical comparisons, with as statistical significance.

3. Results

3.1. Identification of Lung Cancer-Specific TFs

This study retrospectively curated mRNA expression profiling, survival, and clinicopathologic data of lung cancer from the TCGA project. With the and adjusted thresholds, 320 upregulated TFs as well as 103 downregulated TFs were determined in lung cancer in comparison with normal tissues (Figures 1(a) and 1(b); Supplementary table 3). The above deregulated TFs were regarded as lung cancer-specific TFs.

3.2. Determination of Prognostic Lung Cancer-Specific TFs

On the basis of mRNA expression profiling and survival information of lung cancer patients from the TCGA cohort, we conducted the interactions of lung cancer-specific TFs with clinical prognosis through univariate Cox regression models. In accordance with , 13 lung cancer-specific TFs displayed remarkable associations with lung cancer outcomes (Figure 2(a); Table 1). Among them, GFI1B, HLF, and ZNF750 acted as protective factors of lung cancer prognosis. Meanwhile, CTCFL, TFAP2A, CENPA, VAX1, E2F7, FOXM1, SATB2, ARNTL2, NPAS2, and HMGA1 acted as risk factors of lung cancer prognosis.

3.3. Development of a TF Genomic Model for Lung Cancer Prognosis

Further multivariate Cox regression analyses displayed that SATB2, HLF, and NPAS2 were independently predictive of lung cancer prognosis. In accordance with the regression coefficient derived from multivariate Cox regression models and expression value of SATB2, HLF, and NPAS2, a TF genomic model was conducted for lung cancer prognosis. The risk score of each lung cancer patient in the TCGA cohort was calculated in line with the following formula: . Thereafter, lung cancer patients in the TCGA cohort were separated into two groups following the median risk score. Patients with >median risk score were classified into the high-risk group, while those with <median risk score were classified into the low-risk group (Figure 2(b)). Additionally, Figure 2(c) displayed the discrepancy in survival status between two groups. With the increase in risk score, the number of dead patients was gradually elevated. In comparison to the low-risk group, HLF presented reduced expression while SATB2 and NPAS2 possessed upregulated expression in the high-risk group (Figure 2(d)). Survival analyses uncovered that the high-risk group displayed remarkable survival outcomes in comparison to the low-risk group (Figure 2(e)). Further ROC analyses were conducted for verifying the predictive potency of the TF genomic model in lung cancer outcomes. The area under the curve (AUC) was 0.676, indicative of the convincing predictive potency (Figure 2(f)).

3.4. External Verification of the TF Genomic Model

The clinical application potential of the TF genomic model was externally verified in the GSE72094 cohort. In accordance with the same formula, the risk score of each lung cancer patient was quantified. Thereafter, we clustered patients into high- and low-risk groups (Figure 3(a)). As expected, more dead patients were noted in the high-risk group (Figure 3(b)). There was enhanced expression of HLF as well as weakened expression of SATB2 and NPAS2 in the low-risk group (Figure 3(c)). Additionally, the high risk score presented more undesirable survival outcomes (Figure 3(d)). ROC curves confirmed the favorable predictive potency of the TF genomic model in lung cancer outcomes (; Figure 3(e)).

3.5. Analyses and Verification of the TF Genomic Model as an Independent Prognostic Indicator of Lung Cancer

In the TCGA cohort, uni- and multivariate Cox regression models were presented for investigation of the interaction of conventional clinicopathological indicators and risk score with lung cancer outcomes. As a result, staging as well as risk score was independently predictive of patients’ prognosis (Figures 4(a) and 4(b)). The predictive independency was externally verified in the GSE72094 cohort. Our data confirmed that gender and staging as well as risk score were independent risk factors of lung cancer outcomes (Figures 4(c) and 4(d)).

3.6. Analyses of the TF Genomic Model Associated with Activation of Signaling Pathways

GSEA was conducted for investigating the activated signaling pathways correlated to the TF genomic model. In accordance with , ECM receptor interaction, small cell lung cancer, axon guidance, chronic myeloid leukemia, and adherens junction as well as regulation of actin cytoskeleton were remarkably activated in high-risk specimens (Figures 5(a)5(f)). Additionally, clusters of relevant genes linked to the high risk score were determined, containing genes relating to thyroid cancer, type diabetes mellitus, valine leucine isoleucine, and ECM receptor interaction as well as alpha linolenic acid (Figure 6).

3.7. Determination of Downstream Targets of TFs: SATB2, HLF, and NPAS2

Through integration analyses of the ChEA, JASPAR, and MotifMap databases, we determined 307 downstream targets of HLF, 4 downstream targets (CRY2, PER1, PER2, and CRY1) of NPAS2, and 2 downstream targets (UPF3B and TP63) of SATB2 (Supplementary table 4). GO enrichment analyses uncovered that the rhythmic process was remarkably enriched by downstream targets (Figure 7(a); Table 2). Additionally, we noted that the above downstream targets presented remarkable interactions with circadian rhythm and immune activation pathways (like allograft rejection, T cell receptor signaling pathway, PD-L1 expression and PD-1 checkpoint pathway in cancer, inflammatory mediator regulation of TRP channels, and inflammatory bowel disease) as well as carcinogenic pathways (like transcriptional misregulation in cancer, FoxO signaling pathway, non-small cell lung cancer, MAPK signaling pathway, AMPK signaling pathway, Rap1 signaling pathway, and apoptosis; Figure 7(b) and Table 3).

4. Discussion

High-throughput sequencing technologies may assist in determining more biomarkers that present close interactions with patients’ outcomes at the genetic levels. Herein, we proposed a 3-TF genomic model linked to lung cancer progression through conducting reliable bioinformatic analyses. Additionally, the 3-TF genomic model acted as an independent molecular marker for prediction of lung cancer patients’ survival outcomes. Our findings might be of great significance to elucidate the underlying biological mechanisms of lung carcinogenesis as well as to develop innovative prognostic indicators and molecular therapeutic targets.

Previously, a 7-TF gene model has been established for prediction of colon adenocarcinoma outcomes [28]. A 9-TF signature can be predictive of breast cancer recurrence for optimizing clinical management [29]. Additionally, Yang et al. proposed a TF-based prognostic signature that reliably predicts endometrial cancer individuals’ survival outcomes [30]. Here, through univariate analyses followed by stepwise multivariate Cox regression analyses, we developed a TF genomic model for lung cancer outcomes. In accordance with the formula, a TF genomic model was conducted, containing SATB2, HLF, and NPAS2. ROC curves confirmed the reliability of this model in prediction of patients’ prognosis. Following integration of clinicopathological indicators, the model was independently predictive of clinical prognosis. To our knowledge, the 3-TF genomic model’s potential as a predictor has not been proposed in previous evidences, though research might offer a novel guide of lung cancer outcomes. In routine clinical practice, pathological staging acts as an important survival determinant concerning oncologists as well as lung cancer individuals. Nevertheless, diverse patients’ survival outcomes with the same staging are indicative of the deficient pathological staging system for prognosis on the basis of the anatomic scope and staging system of the disease, in which pathological changes reflect the biological heterogeneity within lung cancer. The issues influence the predictive potency of the conventional system for lung cancer individuals. Our GSEA uncovered that the 7-TF genomic model presented remarkable interactions with ECM receptor interaction, SCLC, axon guidance, chronic myeloid leukemia, and adherens junction as well as regulation of actin cytoskeleton, indicative of the interactions of the 7-TF genomic model with lung carcinogenesis.

Previous evidences suggest the biological significance of SATB2, HLF, and NPAS2 within the TF-based genomic model in lung carcinogenesis. For instance, SATB2 reduces NSCLC invasiveness through modulation of EMT-relevant proteins as well as histone methylation of G9a [31]. Further, hypoxic tumor-derived exosomal miR-31-5p triggers LUAD metastases through negative modulation of SATB2-reversed EMT as well as activation of the MEK/ERK pathway [32]. Downregulated HLF facilitates multiple-organ distant metastases of NSCLC via the PPAR/NF-κb pathway NSCLC [33]. Reduced HLF expression is predictive of undesirable clinical outcomes of LUAD [34]. NPAS2 polymorphism independently predicts NSCLC patients’ prognosis [35]. Further analyses determined the downstream targets of TFs: SATB2, HLF, and NPAS2. Our further biological function analyses demonstrated the interactions of these downstream targets with circadian rhythm and immune activation pathways (like allograft rejection, T cell receptor signaling pathway, PD-L1 expression and PD-1 checkpoint pathway in cancer, inflammatory mediator regulation of TRP channels, and inflammatory bowel disease) and carcinogenic pathways (like transcriptional misregulation in cancer, FoxO signaling pathway, NSCLC, MAPK signaling pathway, AMPK signaling pathway, Rap1 signaling pathway, and apoptosis), indicating that SATB2, HLF, and NPAS2 modulated the above pathways to participate in lung carcinogenesis.

Nevertheless, a few limitations of this study need to be pointed out. Our findings were primarily on the basis of integrated bioinformatic analyses. However, sufficient experimental verification of our results remains lacking. In future studies, we will conduct in-depth in vitro and in vivo experiments to verify our conclusion. Because all patients were retrospectively harvested, the underlying bias linked to unbalanced clinicopathological characteristics cannot be ignored. Additionally, the reliability of the 3-TF genomic model for predicting survival outcomes of lung cancer individuals remains a key issue in the clinic. In particular, the guideline for the clinical application of our 3-TF genomic model requires an in-depth definition in our future studies.

5. Conclusion

Collectively, our findings proposed and verified a 3-TF genomic model (SATB2, HLF, and NPAS2) for prediction of lung cancer outcomes. This genomic model acted as an independent indicator as well as a complement prognostic factor for clinicopathological features of lung cancer.


SCLC:Small cell lung cancer
NSCLC:Non-small cell lung cancer
LUAD:Lung adenocarcinoma
LUSC:Lung squamous cell carcinoma
TFs:Transcription factors
TCGA:The Cancer Genome Atlas
FPKM:Fragments per kilobase million
TPM:Transcripts per million
GEO:Gene Expression Omnibus
ROC:Receiver operator characteristic
GESA:Gene set enrichment analyses
MSigDB:Molecular Signatures Database
FDR:False discovery rate
ChEA:ChIP enrichment analysis
GO:Gene Ontology
KEGG:Kyoto Encyclopedia of Genes and Genomes
BP:Biological process
MF:Molecular function
CC:Cellular component.

Data Availability

The data used to support the findings of this study are included within the supplementary information files.

Conflicts of Interest

The authors declare no conflicts of interest.

Supplementary Materials

Supplementary 1. Supplementary Table 1: clinical features of lung cancer patients in TCGA cohort.

Supplementary 2. Supplementary Table 2: the gene sets of TFs.

Supplementary 3. Supplementary Table 3: the detailed information of lung cancer-specific TFs.

Supplementary 4. Supplementary Table 4: determination of downstream targets of TFs: SATB2, HLF, and NPAS2.