Complexity

Complexity / 2019 / Article

Research Article | Open Access

Volume 2019 |Article ID 6136245 | https://doi.org/10.1155/2019/6136245

Yue Hu, Jin-Xing Liu, Ying-Lian Gao, Sheng-Jun Li, Juan Wang, "Differentially Expressed Genes Extracted by the Tensor Robust Principal Component Analysis (TRPCA) Method", Complexity, vol. 2019, Article ID 6136245, 13 pages, 2019. https://doi.org/10.1155/2019/6136245

Differentially Expressed Genes Extracted by the Tensor Robust Principal Component Analysis (TRPCA) Method

Academic Editor: Danilo Comminiello
Received31 Dec 2018
Revised27 Apr 2019
Accepted15 May 2019
Published02 Jun 2019

Abstract

In the big data era, sequencing technology has produced a large number of biological sequencing data. Different views of the cancer genome data provide sufficient complementary information to explore genetic activity. The identification of differentially expressed genes from multiview cancer gene data is of great importance in cancer diagnosis and treatment. In this paper, we propose a novel method for identifying differentially expressed genes based on tensor robust principal component analysis (TRPCA), which extends the matrix method to the processing of multiway data. To identify differentially expressed genes, the plan is carried out as follows. First, multiview data containing cancer gene expression data from different sources are prepared. Second, the original tensor is decomposed into a sum of a low-rank tensor and a sparse tensor using TRPCA. Third, the differentially expressed genes are considered to be sparse perturbed signals and then identified based on the sparse tensor. Fourth, the differentially expressed genes are evaluated using Gene Ontology and Gene Cards tools. The validity of the TRPCA method was tested using two sets of multiview data. The experimental results showed that our method is superior to the representative methods in efficiency and accuracy aspects.

1. Introduction

In the rapid development of sequencing technology, large amounts of gene expression data have been generated. Cancer (malignant tumor) is the common type of disease in this era and poses a serious threat to human health. Researchers in molecular biology have shown that the human body carries more than 20000 different genes, but few are associated with biological processes. Therefore, the study of gene expression data has become an important trend. The analysis of expression data can help explore the origin of life and understand differences between individuals. Genes are common determinants of in vivo cancer or tumor onset, which are identified as abnormally expressed. Therefore, on the one hand, identifying differentially expressed genes can help people explore the association between different diseases. On the other hand, this information can provide a theoretical basis for medical studies and clinical diagnosis. Techniques to screen differentially expressed genes from gene expression data have gained much attention [1]. These data consist of tens of thousands of genes and hundreds of samples. It is generally known that analysis of gene expression data is a typical high-dimension-small-size-sample (HD3S) problem. Many researchers have found that only a small portion of genes play key roles in biological processes [2]. Therefore, it is a very great challenge to identify genes related to diseases.

The selection of differentially expressed genes or feature selection is the identification of from features. Genomic data are usually contaminated by noise, and thus the identification of differentially expressed genes necessitates the premise of satisfying the system’s optimization criteria [3]. This process requires the identification of diseases-related genes and reduces noise, which is an HD3S problem. Moreover, the data are embedded in a high-dimensional space with a low-dimensional flow pattern, so dimension reduction has become an indispensable task [4]. Currently, despite many classic methods that are effectively applied to genomic data, there is still room for improvement. Principal component analysis (PCA) [5] is the most popular method for linear dimension reduction and data analysis. Despite the slight damage from small amounts of noise, the efficiency and effectiveness of PCA data processing are considerable. An important issue is that PCA is vulnerable to severely damaged data and outliers, especially when the actual data are ubiquitous. In addition, the low-rank representation (LRR) method is also very popular for feature selection. It can decompose the original matrix into the sum of the low-rank matrix and the sparse matrix [6]. Not only in terms of feature selection but also in other directions such as video background separation [7], subspace segmentation [8], image clustering [9], and image denoising [10], the LRR method is also widely used. Although the experimental results of the LRR method are superior, it still has some disadvantages. At present, to solve the above problems, many methods have been proposed to reduce the complexity of the data. The robust principal component analysis (RPCA) [11] method, which was recently proposed to have a strong integrity guarantee, is the first polynomial-time algorithm. Let the size of a given data matrix be , which can be decomposed into the sum of matrices and , where is a sparse matrix and is a low-rank matrix. It cannot consider the internal structure of gene expression data, thus overlooking some important information.

The disadvantage of RPCA is that it is a single-view model and can only handle two-order data. In the real world, multidimensional data exist anywhere and are also known as tensors. Like a color image, it is three-dimensional data containing columns, rows, and color models. For another example, the grayscale video contains two spatial vectors and one time vector. A third-order tensor represents the status of a social network. Rows and columns represent different social workers, and the third dimension represents the social modes between them, such as Twitter, Facebook, and the WeChat. To use the RPCA method, preprocessing must be performed to convert multiway data into matrix mode. However, this operation will result in a loss of key information, resulting in poor performance of the experiment. To avoid this problem, many researchers have proposed tensor methods to deal with multiway data. These methods deal with the relationships between the internal structures of tensor data.

To overcome the limitations of the matrix dimension, Lu et al. proposed a tensor robust principal component analysis (TRPCA) method, which extends the known RPCA method to the tensor case [12]. This method has been proven to be effective in many areas, such as image denoising, noise removal, and video separation monitoring [13]. Work [14] proposed a Bayesian robust tensor factorization (BRTF) generation model, which aims to capture global information and local sparse tensor information. The multiview gene expression data are similar to the components in the above fields. Their sparse disturbance signals are similar to the noise in the image.

Benefiting from the development of the big data era, multiple attributes of an object can be easily obtained. For example, an object can contain the color view and the shape view; in camera views of multiple angles of a single object, each camera’s characteristics are independent of each other; and the same gene has different levels of gene expression in different cancers. Multiview data contain more information than single-view data for better performance, rather than relying on single-view data [15]. Therefore, the emergence of multiview data has led to the emergence of multiview models. Most available feature selection methods are single-view models, and multiview models are few and far between.

To overcome the above problems, we proposed the TRPCA method to solve multiview data. Although the TRPCA method has been effectively applied for image recovery and removal of random noise from face images, its validity for gene expression data requires confirmation. Gene expression data are close to some low-dimensional subspaces, so it is natural to approximate nondifferentially expressed gene data to a low data rank. Although the human body contains tens of thousands of genes, only a few are in fact related to biological processes. Therefore, the differentially expressed genes are treated as sparsely disturbed signals in the original data.

In this paper, based on TRPCA, a novel approach is proposed for the identification of differentially expressed genes. Unlike the RPCA method, the TRPCA method extends to multiway data. It preserves the intrinsic geometry of the data. Thus, it can select more differentially expressed genes. Nondifferentially expressed genes are considered to be low-rank tensor signals, and differentially expressed genes are treated as sparsely turbulent signals. In the multiview data, tensor is decomposed into the sum of the low-rank tensor and the sparse tensor . Next, differentially expressed genes are identified based on the sparse tensor . Finally, differentially expressed genes are evaluated using the Gene Ontology and the Gene Cards tools.

The main contributions of this paper are as follows.

First, multiview data are innovatively constructed from a variety of cancer gene expression data, attempting to explore the intrinsic geometry structure between coexpressed genes by tensor.

Second, we proposed, for the first time, an approach and idea based on TRPCA, which aims to identify differentially expressed genes in a multiview model. In TRPCA framework, the sparse component contributes to capturing multiple interactions among views, which better preserve the complementary information.

Third, a large number of feature selection experiments are provided to identify differentially expressed genes. The selection of differentially expressed genes can be performed because the sparse tensor can restore common characteristic genes from multiview information. Marking these genes as listed genes will facilitate the diagnosis and treatment of cancers.

The rest of the paper is arranged in the following manner. The second section introduces the tensor-related symbol definitions, as well as detailed description of the TRPCA method. The selection results and analysis of differentially expressed genes are presented in Section 3. Finally, the main points and the future work are summarized.

2. Materials and Methods

2.1. Notations and Preliminaries

In this subsection, some symbols and definitions are given. Throughout this subsection, all symbols are defined according to [12]. We define the tensor symbol in bold Euler script letters, for example, . Matrices are represented in bold capital letters, such as . By analogy, vectors are represented in bold lowercase letters, for instance, . Lowercase letters are used to represent scalars such as . We define the identity matrix as and the size as . In this paper, and are used to represent the field of real and complex numbers, respectively. In the third-order tensor , we define its (i, j, k)-elements as or . The MATLAB notations , , and are used to represent horizontal and frontal slices of the -th level of the tensor, respectively. Additionally, the tensor front slice can also be represented by . The tensor tube is interpreted as .

We specify that the -norm is expressed as , the Frobenius norm is defined as . The norm of these tensors can be reduced to the norm of the matrices and vectors, when becomes a vector or a matrix. Let ; the tensor nuclear norm of denoted by . is defined as the average of the sum of the nuclear norms of for each front slice, such as [12, 16]. The same definition has been theoretically proven in the work [17]. Therefore, it guarantees the theoretical analysis and optimization proof of tensor nuclear norm based on the TRPCA model [18]. We use the fft function in MATLAB to compute the tensor by . The meaning is the result of the Fourier transform of tensor along the third dimension. Similarly, we can calculate to obtain by .

Specifically, we define as a block diagonal matrix, where each block diagonally is labeled as of , such asAn important concept is the block-circulant matrix, which is also known as the new matrix of tensors. The novel tensor-tensor product is defined based on this concept. In concrete terms, the size of the block circulation matrix of a tensor is , as shown belowwhere the tensor .

In addition, we also define the following operations [19]:More directly, it can be expressed in Figure 1, where ; .

The tensor-tensor product is an algebraic operation defined between two 3-order tensors, which is defined as

Let be a tensor in the real number range, with a size of .Then, tensor can be decomposed intowhere and are orthogonal tensors with sizes of and , respectively. is an F-diagonal tensor with size in the real domain.

Figure 2 shows the t-SVD decomposition process for the tensor. Thus, t-SVD can be perfectly derived from the matrix SVD in the Fourier domain.

2.2. Related Methods and Works

For the processing of high-dimensional and small-sample data, the most commonly used method is RPCA. Let the size of the given data matrix be , which can be decomposed into the sum of tensors and , where is a sparse matrix and is a low-rank matrix. The objective function can be expressed aswhere represents the matrix nuclear norm (the sum of singular values of ), represents the value of the -norm (the sum of the absolute values of all entries in ), and parameter . RPCA and its extensions have been successfully applied in image segmentation [20], background models [21], and the extraction of characteristic genes from genomic data [9].

Under ideal conditions, we expect to extend the conditions for recovering low-rank matrices to three-dimensional tensors. The tools and methods used to recover matrices can also be extended to the best. However, this achievement is not simple. The numerical algebra of tensor data are filled with hardness results [22]. The definition of a tensor rank is crucial for restoring the tensor effect. The rank of the matrix itself has many well-performed properties [23]. The tensor rank is very different. It is difficult to determine the definition of the tensor rank alone. Thus far, many different scholars have proposed a definition of tensor rank, but the limitations have always existed. Taking [24] as an example, the CP (CANDECOMP/PARAFAC) rank represents the minimum value of the rank-one-tensor decomposition, which is an NP (Nondeterministic Polynomial) hard problem. Therefore, the associated convex relaxation problem is also difficult to achieve. Assume that given a k-dimensional tensor , the rank of Tucker is a vector. This vector is defined as , where is the matrix calculated from the tensor -th mode. The Tuck rank is computationally feasible because it is defined on a matrix basis. Inspired by the situation that the nuclear norm of the rank of the matrix is the convex envelope problem, the Sum of the Nuclear Norm (SNN) is defined as . The Tucker rank convex surrogate refers to it. Good performance of this method in various fields has been confirmed [2528]. However, SNN cannot be considered a slight relaxation of the Tucker’s rank [29]. This article takes into account the low-rank tensor completion based on the SNN. The model isIn another work [30] based on the TRPCA model, the SNN algorithm is proposed, and its objective function iswhere is the -norm and refers to the sum of the absolute values of all elements in . This also ensures that the tensor can be reliably restored to meet certain inconsistencies.

2.3. Objective Function and Solutions Process

The tensor method means that, for a given three-order tensor , it can be decomposed into , where is the low-rank component and is the sparse component. Under certain suitable assumptions, this problem can be solved by convex optimization problems. The objective function can be expressed as the sum of the weights of the tensor nuclear norm and the -norm, i.e.,where represents the nuclear norm of the tensor and indicates the -norm of the tensor . The choice of parameter is . It is observed that when drops to 1, TRPCA degenerates to RPCA, so TRPCA is also seen as an extension of RPCA.

Due solely to a robust principal component analysis [31], the status of exact recovery cannot be ignored. This situation also applies to TRPCA. For example, define as a tensor that satisfies the following condition . The rest of the values are all equal to zero, and when . In this situation, we truly cannot perfectly confirm low-rank components and sparse components. Therefore, to avoid this thorny problem, we must assume that the low-rank component of is not sparse.

The most common algorithm for solving RPCA-related problems is the Alternating Direction Method of Multipliers (ADMM) algorithm [16]. Therefore, the ADMM algorithm is also used when solving the TRPCA-related convex function problem in this paper. The main content is that the values of and need to be updated simultaneously. It is clear that the cost of the iteration is mainly reflected in the update of , because this process requires the solution to FFT and SVD of matrices.

For (9), the Lagrangian multiplier was introduced to eliminate the equality constraints. According to a previous work [16], the ADMM algorithm on the Lagrangian function can be expressed as follows:where is a scalar parameter and , is a Frobenius norm.

After several iterations of the TRPCA method, the original tensor was decomposed. After taking the main function of the ADMM algorithm, we find partial derivatives of and , respectively. Let the partial derivatives be equal to zero and the final iteration formula beThe details of the solution algorithm can be found in Algorithm 1.

Input: tensor data , parameter
Initialize:,,;
,
While not converged do
(1) Update by Equation (11);
(2) Update by Equation (12);
(3) ;
(4) Update by ;
(5) Check the convergence conditions
,
End while
2.4. The TRPCA Model of Gene Expression Data

Considering the gene expression data with size , each row of the frontal slice in represents transcript reactions of one gene in all samples, and each column represents the gene expression level of genes in one sample. Without loss of generality, the matrix size of each front slice should be , , so this is a classic HD3S problem.

The purpose of using TRPCA to model multiview data is to discover important genes. As mentioned above, it is reasonable to treat important genes as sparse signals. Thus, the differential expression is regarded as the sparse disturbance signal , and the nondifferentially expression is regarded as the low-rank tensor . From this perspective, differentially expressed genes by various cancers can be identified from the sparse disturbance signal . The multiview model of TRPCA is shown in Figure 3. Three dimensions represent genes, samples, and disease types. Each front slice matrix of the input tensor represents the expression level of all samples of a cancer for all genes, and it is clear that different frontal slices represent different cancer types. The solid color represents the data point equal to or close to zero, and a colored noise point denotes a disturbance signal. As shown in Figure 3, the differentially expressed genes in the tensor can be recovered from the original tensor gene expression data.

Assume that the tensor decomposition has been completed by the TRPCA model. By selecting the appropriate parameter , the sparse disturbance signals can be obtained in the sparse tensor. For example, most of the entries in the sparse tensor is zero or close to zero, and genes that are nonzero can be considered as differentially expressed genes.

2.5. Identification of Differentially Expressed Genes

The low-rank tensor and sparse tensor can be obtained in the experiment. By using the sparse tensor , differentially expressed genes can be selected. Because we regard the important genes as sparse signals, the differentially expressed genes are treated as sparse perturbation signals. Therefore, differentially expressed genes can be extracted by the sparse perturbation tensor . Next, we complete the following steps for each front slice of the sparse tensor. First, the absolute values of each front slice data is calculated, and then the columns are summed. Next, we can obtain the following vector:The result of the tensor is to sum each slice and obtain a new vector. Then, the new vectors are arranged in descending order:Next, we perform the following operations on the descending vectors, filter out the top 500 maximum values, and extract the corresponding genes. Without losing generality, the higher the gene’s ranking, the more likely it is to become a differentially expressed gene. Therefore, we selected genes that were only related to the first number in the vector as differentially expressed genes. An important tool for our analysis of genomic data is GO::TermFinder [32]. GO::TermFinder is open source software in which Gene Ontology information and rich Gene Ontology terms can be accessed. When we infuse the gene name into the GO::TermFinder tool, this tool generates a rich vocabulary associated with that gene. The table contains rich biological explanations related to this gene. Performance comparisons of these methods were evaluated using P-values and hit counts. The P-values and the number of input genes were mainly used to measure the superiority of the experiment. The experimental method corresponding to the smaller P-value indicates that the effect of differentially expressed genes is better. The thresholds of its parameters are set in a uniform way: the maximum value of p is set to 0.01.

3. Results and Discussion

3.1. The Composition of the Dataset

The Cancer Genome Atlas (TCGA) maps the genomic variation of cancer using genomics analysis techniques. The TCGA project included the 33 most common cancers and more than 11,000 tumor samples for sequencing. An in-depth study of this information will inspire future clinical trials and treatments. In this paper, we used two multiview datasets to analyze the effectiveness of the proposed method. These multiviews included various cancer types: colon adenocarcinoma (COAD), head and neck squamous cell carcinoma (HNSC), esophageal carcinoma (ESCA), and pancreatic adenocarcinoma (PAAD). To ensure the versatility of the experiments, the experimental materials were composed of cancer data in the TCGA database (https://cancergenome.nih.gov/).

The information for multiple views in multiview data is rich, and it is significant to study this information in depth. However, one challenging issue is the heterogeneity between different views. In this experiment, we performed the preprocessing for multiview data as follows. First, data from different sources and characteristics were logically synthesized organically. In this process, the common gene parts of the data were extracted, and these common genes showed different expression levels in response to the same type of pathogen. Second, all public genes were aligned in alphabetical order to ensure the validity of the multiview data tags.

Under certain conditions, the tensor can be viewed as an extension of the matrix and vector. When the third dimension of the tensor drops to 1, the tensor degenerates to the matrix. When both the second and third dimensions of the tensor are reduced to 1, the tensor is reduced to a vector. In this paper, tensor refers to the three-order tensor, aiming to study the spatial structure between three-dimensional data and then explore the intrinsic links between various cancer diseases. The frontal section of multiview data is composed of gene expression data from different cancers. Subject to the tensor dimension, the number of samples is determined by the cancer with the fewest number of samples. The details of more multiview datasets are summarized in Table 1. The original tensor was decomposed into a sum of a low-rank tensor and a sparse tensor using TRPCA. The differentially expressed genes were considered to be sparse perturbed signals and then identified based on the sparse tensor.


DatasetNumber of genesNumber of samplesDimensions

COAD_HNSC_ESCA_GE20502192∗3=57620502∗192∗3
COAD_HNSC_PAAD_GE20502180∗3=54020502∗180∗3

In many cases, there is a certain connection between substances with similar behavior or changes. There are also many common characteristics between the occurrence of cancer, such as the commonality of disease-causing genes and the upregulation and downregulation of genes. Multiview gene expression data provide support for this study. In this paper, gene expression data in multiview data contain views from multiple cancers. These differences exist in the spatial structure between tensor slices, while the TRPCA method can decompose the original tensor into the low-rank tensor and the sparse tensor without destroying the internal structure of the tensor. The geometry between the internal slices of the tensor obtained from the TRPCA decomposition is not destroyed, retaining valid information. This not only improves the accuracy of feature selection, but also provides a new idea based on the tensor method. Therefore, the tensor-based robust principal component analysis method is able to explore more changes between its data than other methods. Extraction of differentially expressed genes under the influence of multiple views can not only elucidate the commonality of disease-causing genes between cancers but also establish the correlation between cancers. Marking these coexpression genes in the list of genes will provide a targeted orientation for cancer detection.

3.2. Results for the COAD_HNSC_PAAD Data

Gene Ontology is composed of three parts, biological processes, molecular functions, and cellular components. The differentially expressed genes are placed in the GO tool. After the P-values were obtained from the GO tool, they were ranked in ascending order according to the size of the P-values, and the first ten genes were selected for display. Better experimental results are shown in italic type. Table 2 lists the top ten genes generated by the TRPCA method and RPCA, LLRR, PCA, and BRTF methods for the COAD_HNSC_PAAD_GE dataset. The P-value indicates the enrichment degree of the gene. The P-value is the probability or opportunity to observe at least x of the total n genes in a list annotated to a particular GO term, given the proportion of genes annotated to the GO term in the entire genome. The closer the P-value is to zero, the more significant is the specific GO term associated with the genome. The P-values of these 10 genes revealed that the proposed method is superior to several other methods. Specifically, the P-value of GO:0006614 was 6.49E-74, which is much smaller than the P-values of other methods. In addition, the largest hit value in genetic terms was also detected in our method. There were 94 genes in the GO:0006614 terminology, and RPCA, LLRR, PCA, and BRTF could detect 51, 51, 60, and 55 genes, respectively. However, 61 genes were identified using the TRPCA method. The corresponding name for GO:0072599 was the establishment of protein localization to the endoplasmic reticulum. It contained TNF, TP53, and other genes that are related to the occurrence of induced tumors. By comparing the P-value and the hit count, we can conclude that our experimental method is superior to other methods.


IDTRPCARPCALLRRPCABRTFCount in genome
P-valueHit CountP-valueHit CountP-valueHit CountP-valueHit CountP-valueHit Count

GO:00066146.49E-74615.91E-56515.31E-56514.35E-72605.60E-635594
GO:00066135.10E-71618.65E-54517.76E-54512.84E-69601.53E-6055101
GO:00450471.24E-70618.65E-54517.76E-54512.84E-69603.25E-6055102
GO:00725993.78E-69612.24E-52512.01E-52511.87E-67605.89E-5955106
GO:00709724.88E-68641.03E-50539.22E-51534.25E-68646.89E-5757125
GO:00001846.19E-66621.20E-51531.07E-51535.41E-66626.35E-5857121
GO:00009565.56E-57681.27E-41566.89E-43574.81E-57681.19E-4660202
GO:00066122.63E-54651.81E-41559.57E-43564.92E-53648.00E-4860194
GO:00190832.35E-53629.12E-43548.16E-43544.73E-52614.85E-4858176
GO:00064012.34E-50681.85E-36561.32E-37572.04E-50685.19E-4160247

For the COAD_HNSC_ESCA_GE dataset, 500 genes extracted using TRPCA were compared with those obtained from Gene Cards for the three cancers. Two hundred fifty-five of 500 genes were associated with these three diseases. Many genes, which were previously thought to be unrelated to clinical outcomes, were identified. We list the top 10 differentially expressed genes with higher correlation scores in Table 3. Table 3 lists the gene names, related scores, related GO annotations, and related diseases. In general, among the identified differentially expressed information, the genes closely related to these three diseases were CDH1, MMP9, EPCAM, and MMP2. CCND1 and MMP1 are associated with the occurrence of COAD and HNSC. INS is associated with the development of COAD and PAAD.


Gene nameRelevance scoreRelated GO annotationsRelated diseases

CDH1183.35calcium ion binding and protein phosphatase bindingGastric Cancer, Hereditary Diffuse and Blepharocheilodontic Syndrome 1
CTNNB1168.66DNA binding transcription factor activity and binding.Mental Retardation, Autosomal Dominant 19 and Pilomatrixoma
CCND1165.91protein kinase activity and enzyme bindingMyeloma, Multiple and Von Hippel-Lindau Syndrome
MMP9139.39identical protein binding and metalloendopeptidase activityMetaphyseal Anadysplasia 2 and Metaphyseal Anadysplasia
EPCAM114.16protein complex binding.Diarrhea 5, with Tufting Enteropathy, Congenital and Colorectal Cancer, Hereditary Nonpolyposis, Type 8
MMP292.35serine-type endopeptidase activity and metallopeptidase activityMulticentric Osteolysis, Nodulosis, and Arthropathy and Arthropathy
PLAU89.58serine-type endopeptidase activityQuebec platelet Disorder and Alzheimer Disease
MMP188.33calcium ion binding and metallopeptidase activityEpidermolysis Bullosa Dystrophica, Autosomal Recessive and Recessive Dystrophic Epidermolysis Bullosa
IGF288.21growth factor activity and insulin receptor bindingGrowth Restriction, Severe, with Distinctive Facies and Silver-Russell Syndrome
INS86.99identical protein binding and protease bindingHyperproinsulinemia and Hiabetes Mellitus, Insulin-dependent, 2

Specifically, the official name CDH1 is Cadherin 1, which was the most correlated with this dataset. Recently, works [33, 34] have described mutations in CDH1 in COAD cell lines. Cytoplasmic CDH1 has independent prognostic value in PAAD and provides a new target for prognostic treatment [35]. A meta-analysis of the work [36] indicates that CDH1 promoter methylation is associated with HNSC risk and can be used as a valuable diagnostic biomarker for HNSC. In summary, CDH1 is related to the occurrence of these three cancers. In addition, MMP9 has been identified in a variety of malignancies [37] and as a potential marker for the prognosis of HNSC [38]. Dysregulated MMP9 expression induces invasive growth and metastasis of PAAD [39]. Expression of MMP9 is elevated in a variety of inflammatory and oncological indications and is evident in colitis and colorectal cancer [40]. MMP9 has a huge impact on the occurrence and treatment of three types of cancer, and thus MMP9 is closely related to this dataset. In summary, there was a close correlation between the differentially expressed genes and the cancers contained in the dataset, demonstrating the accuracy of our method.

3.3. Results for the COAD_HNSC_ESCA Data

The results of the experiment using the COAD_HNSC_ESCA dataset are listed in Table 4. TRPCA was compared with the other three methods, and higher expression is indicated in italic. For the gene GO:0005198, the TRPCA results were 1.30E-71, which is significantly less than 8.97E-64, 7.80E-67, 1.77E-70, and 1.02E-66. The name GO:0005198 denotes structural molecule activity. The INS gene has been identified with a multitude of mutant alleles with phenotypic effects. In terms of the number of hits, the TRPCA hit 131 genes, representing a larger number than the 123, 126 130, and 126 using the other methods, under the premise of a total hit number of 762. The P-value of gene GO:0006413 in TRPCA was 5.51E-64, which is significantly lower than the P-values measured using the three methods RPCA, LLRR, PCA, and BRTF. A global examination of the table revealed only two IDs of the P-values measured by our method that were equal to the PCA method; the rest were better than the three experimental methods. Therefore, in summary, whether we compared the P-values or the number of hits, our experiment performed better than the other three methods.


IDTRPCARPCALLRRPCABRTFCount in genome
P-valueHit CountP-valueHit CountP-valueHit CountP-valueHit CountP-valueHit Count

GO:00051981.30E-711318.97E-641237.80E-671261.77E-701301.02E-66126762
GO:00066141.46E-66575.18E-51481.60E-59533.63E-61541.178E-645694
GO:00064135.51E-64721.61E-41553.65E-49611.20E-55666.43E-5364194
GO:00066135.59E-64574.91E-49483.17E-57538.45E-59543.80E-6256101
GO:00450471.24E-63579.06E-49486.45E-57531.75E-58548.227E-6256102
GO:00037234.51E-631731.44E-291252.10E-401421.56E-641752.73E-401421632
GO:00725992.65E-62579.66E-48489.98E-56536.62E-59551.618E-6056106
GO:00001844.18E-61593.17E-47507.46E-55552.64E-56562.12E-5958121
GO:00709725.13E-60592.30E-46507.19E-54555.81E-60592.447E-5858125
GO:00709737.65E-45602.11E-34513.62E-40563.40E-4157nonenone116

Table 5 lists GO annotations and related diseases for the top ten differentially expressed genes screened using the TRPCA method for the COAD_HNSC_ESCA_ dataset. Of the top ten differentiated genes, most were associated with these three cancers. Overall, 8 of 10 differentially expressed genes were highly correlated with these three cancers. These 8 genes were EGFR, CDH1, ERBB2, CCND1, MMP9, EPCAM, MMP1, and MMP2, respectively. In addition, CTNNB1 was significantly expressed in patients with COAD and ESCA. The last gene, PLAU, could serve as a key biomarker for the accurate diagnosis and prognosis of HNSC, providing a potential target for clinical treatment.


Gene EDRelevance scoreRelated GO AnnotationsRelated Diseases

EGFR219.78identical protein binding and protein kinase activityInflammatory Skin And Bowel Disease, Neonatal, 2 and Lung Cancer.
CDH1182.64calcium ion binding and protein phosphatase bindingGastric Cancer, Hereditary Diffuse and Blepharocheilodontic Syndrome 1
ERBB2170.81identical protein binding and protein kinase activityGlioma Susceptibility 1 and Lung Cancer
CTNNB1165.7DNA binding transcription factor activity and bindingMental Retardation, Autosomal Dominant 19 and Pilomatrixoma
CCND1165.44protein kinase activity and enzyme bindingMyeloma, Multiple and Von Hippel-Lindau Syndrome
MMP9135.93identical protein binding and metalloendopeptidase activityMetaphyseal Anadysplasia 2 and Metaphyseal Anadysplasia
EPCAM105.76protein complex bindingDiarrhea 5, With Tufting Enteropathy, Congenital and Colorectal Cancer, Hereditary Nonpolyposis, Type 8
MMP198.28calcium ion binding and metallopeptidase activityEpidermolysis Bullosa Dystrophica, Autosomal Recessive and Recessive Dystrophic Epidermolysis Bullosa
MMP289.9serine-type endopeptidase activity and metallopeptidase activityMulticentric Osteolysis, Nodulosis, And Arthropathy and Arthropathy
PLAU87.24serine-type endopeptidase activityQuebec Platelet Disorder and Alzheimer Disease

From Table 5 we can see that the official name of “EGFR” is “Epidermal Growth Factor Receptor”, which is related to HNSC [41]. Its correlation score with these three diseases reached 219.78. The higher the relevant score, the greater is the correlation between the genes and the three diseases.

Therefore, the correlation between the gene “EGFR” selected by tensing TRPCA and CHOL ESCA HNSC was high. The related disease of EGFR was Inflammatory Skin And Bowel Disease, Neonatal, 2, and Lung Cancer, which is related to ESCA and CHOL [42, 43]. The official name of the gene “EPCAM” is “Epithelial Cell Adhesion Molecule”, which is related to the optimal treatment of HNSC [43] The relevance score with the three diseases was 105.76. Based on the table, we can also observe that the GO annotation of EPCAM is protein complex binding. EPCAM, claudin-7, CO-029, and CD44v6 expression were upregulated in COAD and liver metastasis, suggesting that high EPCAM expression is associated with COAD progression [44, 45]. EPCAM expression and release into the circulation can be an effective immunotherapy for ESCA patients [46]. The expression of EPCAM on disseminated tumor cells is significantly associated with the development of lymph node metastasis and significantly reduced overall survival of ESCA patients [47]. Overexpression of EPCAM eventually leads to uncontrolled development of COAD, HNSC, and ESCA. A number of studies have shown that the selected differentially expressed genes are closely related to the disease. Thus, the proposed method is superior for feature selection.

The correlation score refers to the size of the correlation between the selected genes and corresponds to the three diseases. The larger the correlation score, the greater the correlation between the genes and the three diseases. Table 6 lists the experimental results for the COAD_HNSC_ESCA dataset, which contains the number of related genes, the mean of relevant scores and the highest correlation score. Entering the three diseases in the dataset into the Gene Cards (https://www.genecards.org/), we can download a table containing the genes and related scores associated with the diseases. The genes we identified with this table are then compared, and common items are filtered out. The related number refers to the number of hits in the table for the 500 genes identified using the method. The greater the related number, the more relevant is the gene identified by the method. The average of the related scores is the average of all related genes identified. The highest relevant score is the maximum value of the relevant score in all relevant genes.


TRPCARPCALLRRPCABRTF

Related number250220246215237
Average of related scores29.5928.362927.2929.41
Highest relevant score219.78219.78219.78182.64182.64

The number of genes extracted by the TRPCA was 250, and the related numbers of RPCA, LLRR, PCA, and BRTF were 220, 246 215, and 237 respectively. Although the highest correlation score of TRPCA was the same as that of RPCA and LLRR, the mean values of the other three methods were 28.36, 29, 27.29, and 29.41 while TRPCA was 29.59. Therefore, regardless of the number of correlations, the average of the relevant scores or the highest correlation score, our method performed better than the other three methods.

As the result shows, the TRPCA method performs much better than the matrix decomposition method such as RPCA. The validity of the proposed method indicates that our approach is reasonable for processing multiview gene expression data. The reason is that the matrix decomposition method only independently performs matrix recovery on each gene expression data, and it cannot use information across views, which ignoring spatial geometric information between the data. The BRTF method decomposes tensor into the low-rank tensor, the sparse tensor, and the noise tensor. Differentially expressed genes are scattered in the sparse tensor and the noise tensor, which in turn affected the feature selection accuracy. The TRPCA approach can take advantage of multidimensional structures to improve the performance, which better preserves the redundant information in multiview data. This provides a new perspective to study multiview data. Therefore, TRPCA is an effective integration model to consider the intrinsic geometry of multiview data.

4. Conclusions

In this paper, the TRPCA method was applied to identify differentially expressed genes. It combined the TRPCA model with the sparsity of multiview data, which provided an efficient approach to identify genes. The approach decomposes the original tensor into a low-rank tensor and a sparse tensor, and it was compared with the RPCA, LLRR, PCA, and BRTF methods. The results show that the TRPCA method is more effective than the other state-of-the-art methods. Thus, a new excellent approach is proposed for the study of differentially expressed genes.

In the future, we will use the TRPCA method to performed detailed analyses of the intrinsic links of multiview data and continuously develop new methods to discover more differentially expressed genes.

Data Availability

The datasets that support the findings of this study are available at https://cancergenome.nih.gov/.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by the NSFC under grants nos. 61872220 and 61572284.

References

  1. N. You, J. Liu, and C. X. Mao, “An empirical bayesian method for detecting differentially expressed genes using EST data,” International Journal of Plant Genomics, vol. 2008, Article ID 817210, 4 pages, 2008. View at: Publisher Site | Google Scholar
  2. J. C. Liao, R. Boscolo, Y.-L. Yang, L. M. Tran, C. Sabatti, and V. P. Roychowdhury, “Network component analysis: Reconstruction of regulatory signals in biological systems,” Proceedings of the National Acadamy of Sciences of the United States of America, vol. 100, no. 26, pp. 15522–15527, 2003. View at: Publisher Site | Google Scholar
  3. A. DAddabbo, M. Papale, S. D. Paolo et al., “SVD based feature selection and sample classification of proteomic data,” in Proceedings of the International Conference on Knowledge-Based Intelligent Information and Engineering Systems, pp. 556–563, 2008. View at: Google Scholar
  4. L. Van Der Maaten, “Dimensionality reduction: a comparative review,” Review Literature & Arts of the Americas, vol. 5, 2009. View at: Google Scholar
  5. H. Abdi and L. J. Williams, “Principal component analysis,” Wiley Interdisciplinary Reviews: Computational Statistics, vol. 2, no. 4, pp. 433–459, 2010. View at: Publisher Site | Google Scholar
  6. G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery of subspace structures by low-rank representation,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 35, no. 1, pp. 171–184, 2013. View at: Google Scholar
  7. S. D. Babacan, M. Luessi, R. Molina, and A. K. Katsaggelos, “Sparse Bayesian methods for low-rank matrix estimation,” IEEE Transactions on Signal Processing, vol. 60, no. 8, pp. 3964–3977, 2012. View at: Publisher Site | Google Scholar | MathSciNet
  8. C. F. Chen, C. P. Wei, and Y. C. F. Wang, “Low-rank matrix recovery with structural incoherence for robust face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2618–2625, 2012. View at: Google Scholar
  9. J.-X. Liu, Y.-T. Wang, C.-H. Zheng, W. Sha, J.-X. Mi, and Y. Xu, “Robust PCA based method for discovering differentially expressed genes,” BMC Bioinformatics, vol. 14, no. S8, p. S3, 2013. View at: Publisher Site | Google Scholar
  10. L. Zhuang, H. Gao, Z. Lin, Y. Ma, X. Zhang, and N. Yu, “Non-negative low rank and sparse graph for semi-supervised learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2328–2335, 2012. View at: Google Scholar
  11. I. T. Jolliffe, “Principal component analysis,” Journal of Marketing Research, vol. 87, no. 100, p. 513, 2002. View at: Google Scholar
  12. C. Lu, J. Feng, Y. Chen, W. Liu, Z. Lin, and S. Yan, “Tensor robust principal component analysis: exact recovery of corrupted low-rank tensors via convex optimization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16), pp. 5249–5257, July 2016. View at: Google Scholar
  13. L. Chen, Y. Liu, and C. Zhu, “Robust tensor principal component analysis in all modes,” in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME '18), pp. 1–6, San Diego, Calif, USA, July 2018. View at: Publisher Site | Google Scholar
  14. Q. Zhao, G. Zhou, L. Zhang, A. Cichocki, and S.-I. Amari, “Bayesian robust tensor factorization for incomplete multiway data,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 4, pp. 736–748, 2016. View at: Publisher Site | Google Scholar | MathSciNet
  15. A. Kumar, P. Rai, and H. Daume, “Co-Regularized multi-view spectral clustering,” in Advances in Neural Information Processing Systems, pp. 1413–1421, 2011. View at: Google Scholar
  16. Z. Zhang, G. Ely, S. Aeron, N. Hao, and M. Kilmer, “Novel methods for multilinear data completion and de-noising based on tensor-SVD,” in Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '14), pp. 3842–3849, June 2014. View at: Google Scholar
  17. C. Lu, J. Feng, Z. Lin, and S. Yan, “Exact low tubal rank tensor recovery from gaussian measurements,” in Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI '18), pp. 2504–2510, Stockholm, Sweden, July 2018, https://arxiv.org/abs/1806.02511. View at: Publisher Site | Google Scholar
  18. C. Lu, J. Feng, y. chen, W. Liu, Z. Lin, and S. Yan, “Tensor robust principal component analysis with a new tensor nuclear norm,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. View at: Publisher Site | Google Scholar
  19. K. Braman, “Third-order tensors as linear operators on a space of matrices,” Linear Algebra and Its Applications, vol. 433, no. 7, pp. 1241–1253, 2010. View at: Publisher Site | Google Scholar | MathSciNet
  20. H. Li, Y. Zhang, J. Wang, Y. Xu, Y. Li, and Z. Pan, “Inequality-constrained RPCA for shadow removal and foreground detection,” IEICE Transactions on Information and Systems, vol. 98, no. 6, pp. 1256–1259, 2015. View at: Google Scholar
  21. A. Bittoni, F. Piva, M. Santoni et al., “KRAS mutation status is associated with specific pattern of genes expression in pancreatic adenocarcinoma,” Future Oncology, vol. 11, no. 13, pp. 1905–1917, 2015. View at: Publisher Site | Google Scholar
  22. C. J. Hillar and L.-H. Lim, “Most tensor problems are NP-hard,” Journal of the ACM, vol. 60, no. 6, p. 45, 2009. View at: Publisher Site | Google Scholar | MathSciNet
  23. J. G. Cragg and S. G. Donald, “On the asymptotic properties of LDU-based tests of the rank of a matrix,” Journal of the American Statistical Association, vol. 91, no. 435, pp. 1301–1309, 1996. View at: Publisher Site | Google Scholar | MathSciNet
  24. T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Review, vol. 51, no. 3, pp. 455–500, 2009. View at: Publisher Site | Google Scholar | MathSciNet
  25. J. Liu, P. Musialski, P. Wonka, and J. Ye, “Tensor completion for estimating missing values in visual data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 208–220, 2013. View at: Publisher Site | Google Scholar
  26. S. Gandy, B. Recht, and I. Yamada, “Tensor completion and low-n-rank tensor recovery via convex optimization,” Inverse Problems, vol. 27, no. 2, Article ID 025010, 2011. View at: Publisher Site | Google Scholar | MathSciNet
  27. R. Tomioka, K. Hayashi, and H. Kashima, “Estimation of low-rank tensors via convex optimization,” https://arxiv.org/abs/1010.0789, 2010. View at: Google Scholar
  28. M. Signoretto, Q. Tran Dinh, L. de Lathauwer, and J. A. K. Suykens, “Learning with tensors: a framework based on convex optimization and spectral regularization,” Machine Learning, vol. 94, no. 3, pp. 303–351, 2014. View at: Publisher Site | Google Scholar
  29. B. Romera-Paredes and M. Pontil, “A new convex relaxation for tensor completion,” Mathematics, pp. 2967–2975, 2013. View at: Google Scholar
  30. R. Tomioka and T. Suzuki, “Convex tensor decomposition via structured schatten norm regularization,” Advances in Neural Information Processing Systems, pp. 1331–1339, 2013. View at: Google Scholar
  31. S. Huang, Y. Yeh, and S. Eguchi, “Robust principal component analysis,” Journal of the Acm, vol. 58, no. 3, pp. 1–37, 2009. View at: Google Scholar
  32. E. I. Boyle, S. Weng, J. Gollub et al., “GO:: TermFinder—open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes,” Bioinformatics, vol. 20, no. 18, pp. 3710–3715, 2004. View at: Publisher Site | Google Scholar
  33. H. C. Kim, J. M. D. Wheeler, J. C. Kim et al., “The E-cadherin gene (CDH1) variants T340A and L599V in gastric and colorectal cancer patients in Korea,” Gut, vol. 47, no. 2, pp. 262–267, 2000. View at: Publisher Site | Google Scholar
  34. S. Govatati, G. K. Singamsetty, N. Nallabelli et al., “Contribution of cyclin D1 (CCND1) and E-cadherin (CDH1) alterations to colorectal cancer susceptibility: a case–control study,” Tumor Biology, vol. 35, no. 12, pp. 12059–12067, 2014. View at: Publisher Site | Google Scholar
  35. F. Jiao, H. Hu, T. Han et al., “Aberrant expression of nuclear HDAC3 and cytoplasmic CDH1 predict a poor prognosis for patients with pancreatic cancer,” Oncotarget , vol. 7, no. 13, pp. 16505–16516, 2016. View at: Google Scholar
  36. Z. Shen, C. Zhou, J. Li, H. Deng, Q. Li, and J. Wang, “The association, clinicopathological significance, and diagnostic value ofCDH1promoter methylation in head and neck squamous cell carcinoma: a meta-analysis of 23 studies,” Oncotargets & Therapy, vol. 9, pp. 6763–6773, 2016. View at: Google Scholar
  37. F. Riedel, K. Gotte, J. Schwalb, and K. Hormann, “Serum levels of matrix metalloproteinase-2 and -9 in patients with head and neck squamous cell carcinoma,” Anticancer Reseach, vol. 20, no. 5, pp. 3045–3049, 2000. View at: Google Scholar
  38. H. Ruokolainen, P. Pääkkö, and T. Turpeenniemi-Hujanen, “Expression of matrix metalloproteinase-9 in head and neck squamous cell carcinoma: a potential marker for prognosis,” Clinical Cancer Research, vol. 10, no. 9, pp. 3110–3116, 2004. View at: Publisher Site | Google Scholar
  39. B. Grunwald, J. Vandooren, M. Gerg et al., “Systemic ablation of MMP-9 triggers invasive growth and metastasis of pancreatic cancer via deregulation of IL6 expression in the bone Marrow,” Molecular Cancer Research, vol. 14, no. 11, pp. 1147–1158, 2016. View at: Publisher Site | Google Scholar
  40. D. C. Marshall, S. K. Lyman, S. McCauley et al., “Selective allosteric inhibition of MMP9 is efficacious in preclinical models of ulcerative colitis and colorectal cancer,” PLoS ONE, vol. 10, no. 5, Article ID e0127063, 2015. View at: Google Scholar
  41. J. R. Grandis, M. F. Melhem, W. E. Gooding et al., “Levels of TGF-α and EGFR protein in head and neck squamous cell carcinoma and patient survival,” Journal of the National Cancer Institute, vol. 90, no. 11, pp. 824–832, 1998. View at: Publisher Site | Google Scholar
  42. D. S. Ettinger, “Clinical implications of EGFR expression in the development and progression of solid tumors: Focus on non-small cell lung cancer,” The Oncologist, vol. 11, no. 4, pp. 358–373, 2006. View at: Publisher Site | Google Scholar
  43. D. Yoshikawa, H. Ojima, M. Iwasaki et al., “Clinicopathological and prognostic significance of EGFR, VEGF, and HER2 expression in cholangiocarcinoma,” British Journal of Cancer, vol. 98, no. 2, pp. 418–425, 2008. View at: Publisher Site | Google Scholar
  44. S. Kuhn, M. Koch, T. Nübel et al., “A complex of EpCAM, claudin-7, CD44 variant isoforms, and tetraspanins promotes colorectal cancer progression,” Molecular Cancer Research, vol. 5, no. 6, pp. 553–567, 2007. View at: Publisher Site | Google Scholar
  45. A. Lugli, G. Iezzi, I. Hostettler et al., “Prognostic impact of the expression of putative cancer stem cell markers CD133, CD166, CD44s, EpCAM, and ALDH1 in colorectal cancer,” British Journal of Cancer, vol. 103, no. 3, pp. 382–390, 2010. View at: Publisher Site | Google Scholar
  46. H. Kimura, H. Kato, A. Faried et al., “Prognostic significance of EpCAM expression in human esophageal cancer,” International Journal of Oncology, vol. 30, no. 1, pp. 171–179, 2007. View at: Google Scholar
  47. C. Driemel, H. Kremling, S. Schumacher et al., “Context-dependent adaption of EpCAM expression in early systemic esophageal cancer,” Oncogene, vol. 33, no. 41, pp. 4904–4915, 2013. View at: Publisher Site | Google Scholar

Copyright © 2019 Yue Hu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Related articles

No related content is available yet for this article.
 PDF Download Citation Citation
 Download other formatsMore
 Order printed copiesOrder
Views1269
Downloads797
Citations

Related articles

No related content is available yet for this article.

Article of the Year Award: Outstanding research contributions of 2020, as selected by our Chief Editors. Read the winning articles.