Abstract

Accurate tumor classification is crucial to the proper treatment of cancer. To now, sparse representation (SR) has shown its great performance for tumor classification. This paper conceives a new SR-based method for tumor classification by using gene expression data. In the proposed method, we firstly use latent low-rank representation for extracting salient features and removing noise from the original samples data. Then we use sparse representation classifier (SRC) to build tumor classification model. The experimental results on several real-world data sets show that our method is more efficient and more effective than the previous classification methods including SVM, SRC, and LASSO.

1. Introduction

Tumor is a solid lesion caused by the abnormal growth of cells. A timely accurate treatment is very important clinically. The premise of an accurate treatment is an exact diagnosis due to the heterogeneity of cancer. That is, we need to classify them accurately before treating tumors. Current methods for classifying cancer malignancies mostly rely on a variety of morphological, clinical, or molecular variables. Despite recent progresses, there are still many uncertainties in diagnosis. The advent of DNA microarray and RNA_seq [1] makes it possible to analyze tumor samples and classify them based on gene expression profiles. Moreover, we can get the expression data of tens of thousands of genes through DNA microarray or RNA-seq simultaneously.

Many methods for molecular data classification or clustering based on gene expression data have appeared in this area [214]. Huang and Zheng used independent component analysis [5] to extract features; Gao and Church introduced sparse nonnegative matrix factorization for feature extraction [4]; Zheng et al. proposed metasample-based sparse representation [7], and Furey et al. used support vector machines [8] to classify the gene expression data. All these methods have achieved impressive classification performances.

Recently published sparse representation classification (SRC) is also a powerful tool for processing gene expression data. SRC method was inspired by many theories such as Basis pursuing [15], compressive sensing for signal reconstruction [16], and least absolute shrinkage. It has already been widely used in face recognition [17] and texture classification [18]. In SRC method, test samples can be only represented as a sparse linear combination of the training samples from the same class. Furthermore, an imposed -regularized least square optimization is used to calculate an SR coefficient vector with only a few significant coefficients. In theory, a test sample can be well represented by only using the training samples from the same class. However, there is too much noise in gene expression data, which causes that the discriminative features are not obvious and the test samples can also be represented by some training samples from different classes. This will decrease the classification accuracy. To reduce noise [1921] and get salient features [20] for tumor classification, in this paper, we introduce latent low-rank representation to preprocess gene expression data. By combining it with SRC algorithm, we propose a new method for tumor classification.

Latent low-rank representation (LatLRR) is a kind of theory which can be used to extract principal and salient features from original data. LatLRR is the improved version of LRR. The two methods can be resolved by the inexact augmented Lagrange multiplier (ALM) optimization. In [1922], LRR has been successfully used for the recovery of subspace structure, subspace segmentation, feature extraction, outlier detection, and so forth. In [23], the author introduced LRR theory for face recognition in order to remove noise and achieved an impressive result. Based on these successful applications, in this paper, we introduce LatLRR into sparse representation classifier for tumor classification. Firstly, we use LatLRR to remove noise from original data and extract salient features. Then based on the new extracted salient features, we design sparse representation classifier to classify new test samples. We referred to the proposed method as SRC-based latent low-rank representation (SRC- LatLRR).

The rest of the paper is organized as follows. Section 2 describes our proposed SRC-LatLRR method in detail. We firstly review SRC and latent low-rank representation methods in Sections 2.1 and 2.2, respectively. Then we present our method in detail in Section 2.3. Section 2.4 specifies our experimental setting. In Section 3, we evaluate our method using several publicly available gene expression data sets. Section 4 concludes the paper and outlines our future work.

The abbreviations used in this paper are summarized in the Abbreviations section.

2. Methods

2.1. Sparse Representation Classification

Sparse representation classification is a supervised classification. Let denote a training sample matrix with samples and genes. As we know, each DNA microarray chip usually contains thousands of genes; the number of genes is much larger than tumor samples; that is, .

Let be the th sample of and the samples are divided into object classes. Assuming that there are samples belonging to th class and making up , the whole data set can be reexpressed as . Suppose that a new testing sample belongs to th class. Based on the theory of sparse representation, would lie in the linear span of the training samples ; that is, where is a scalar and .

Supposing a linear representation coefficient vector ,   can be also rewritten as Ideally, if the training samples are sufficient and the training samples sets that belong to different class are disjoint each other, then we have that is, in , only the entries corresponding to the same class as are nonzero.

From the above analysis, it can be seen that we can classify the test sample according to . So the key problem is how to calculate in (2). As in [7], would be sparse if the number of object classes is large; this is what sparse representation implies. According to the theory of compressive sensing [16, 2426] and SR, can be achieved by solving the following -minimization problem:

This problem can be solved by standard linear programming methods [15]. But (4) has no exact solutions since . Then a generalized version of (4) can be conceived: where is a scalar regularization. This function can balance the degree of noise by using . In this study, we solve this function by the truncated Newton interior-point method [27].

2.2. Latent Low-Rank Representation

Latent low-rank representation is an extension of low-rank representation. Consider an observed data matrix , where each column vector is a sample, and a dictionary , where is also a sample. can be linearly represented by the dictionary. That is, where is a coefficient matrix and each is the representation of . Equation (6) means that each column vector of can be represented by a linear combination of the bases in . In (6), the dictionary should be overcomplete enough to represent any observed data matrix . But meanwhile, this causes multiple feasible solutions of to (6). To achieve the optimal solution, low rankness criterion is introduced to (6):

Here, the optimal solution is the so-called lowest-rank representation of data with respect to the dictionary . Unfortunately, function (7) can not be easy to solve because of the discrete nature of the rank function. By matrix completion method [2830], we replace solving low-rank problem with dealing with nuclear norm [31]; then problem (7) can be rerepresented as where means the nuclear norm of matrix , that is, the sum of the singular values of matrix .

Strictly speaking, the dictionary should be overcomplete and noiseless. But this kind of dictionary is difficult to get. In practice, we usually use observed data matrix itself as the dictionary [19, 21, 32]. Finally we have the following convex optimization problem:

To solve this equation, two conditions need to be met. Firstly, the data sampling should be sufficient. Secondly, the data sampling should also contain sufficient noiseless data to achieve robust capability. In fact, the first one can be easily met but the second one not. Because gene expression data are usually noisy, in reality, function (9) may be invalid and not robust.

To solve the problem in (9), we introduce the following LRR problem [20]: where is the observed data matrix and the is the unobserved data, that is, the hidden data. We use the concatenation of and as a dictionary. The optimal result of (10) is , where and correspond to and , respectively.

By solving (10), the two problems above can be solved well. Then our next mission is to recover the affinity matrix by using only in the absence of the hidden data . The method is called latent low-rank representation (LatLRR), which is an improvement of LRR.

Supposing we have two matrices and , then by solving (10) we have the following equations: where and can be obtained through computing the skinny singular value decomposition of , and . Namely, and .

Depending on function (11), we have Let ; then we have the following simple function: If and come from the same collection of low-rank subspaces, then both and should be of low-rank, so we can achieve

Just as in [2830], we also change the above rank minimization problem to the nuclear norm. Then we have the following convex optimization problem:

Here, we replace , , and with , , and , respectively, for ease of representation. In (15), is the noiseless observed data. By considering there may exist corrupted data or noise in , we also need to introduce a denoising model about (15); then we have where is a scalar and is the -norm of sparse noise matrix . If , the problem (16) will be equivalent to (15), that is, no noise in the observed data . In (16), the optimal solutions   , , and   represent the principal features, salient features, and noise, respectively.

To solve the LatLRR problem listed in (16), we introduce the augmented Lagrange multiplier (ALM) [33] method and revise (16) as follows to meet the requirement of ALM algorithm:

This problem can be solved by ALM method which minimizes the following augmented Lagrange function: where and denote the trace and Frobenius norm of a matrix, respectively. is a penalty parameter. More details about (18) can be found in [33].

2.3. Sparse Representation Classification Based on LatLRR

Since LatLRR can extract the salient features and remove noise from original data sets, in this study, before using observed data for classification, we firstly use LatLRR to suppress noise and get the salient features. Then we use the denoised data for tumor classification; that is, we factorize the observed data into

Here, we only use for data classification. For a test sample , we can calculate its SR by the following function: where the parameter can be determined experimentally and is a coefficient vector. Assuming the test sample belongs to one of target classes, the training data set is sufficient. When classifying , we introduce , where is a square matrix obtained through LatLRR method when extracting the salient features.

Ideally, can be linearly represented by the samples from the same class in . Namely, the representation vector should be sparse and the nonzero entries are associated with the columns of from the same class. This will lead us to classify the test samples. However, noise and modeling errors will also introduce some nonzero entries to which correspond to the columns of from the multiple classes [17]. To solve this problem, we classify based on how well it can be reconstructed by using the coefficients from each class as in [17].

Using the result of (20), we construct as the characteristic function which selects the coefficients associated with the th class in the coefficient vector . By only using th class coefficients to reconstruct the test sample as , we can classify into the minimum residual class between and ; that is, Our classification algorithm can be summarized as follows.Input. Observed data for classes; test sample .Step  1. Normalize the columns of .Step  2. Extract the salient features of and remove to some extent noise to get data defined in (19).Step  3. Solve the optimization problem defined in (20).Step  4. Compute the residuals .Output. .

Our method can be seen as the combination of SRC [17] and latent low-rank representation for feature extraction [20], so we named it as SRC-LatLRR. In SRC, the test sample is represented as a sparse linear combination of the training samples from the same class. In LatLRR, noise is removed to some extent and salient features are simultaneously extracted from the training samples. So the introduction of LatLRR can improve the classification accuracy of SRC in a way.

2.4. Evaluation of the Performance

To evaluate our proposed method, we compare our method with SRC [17, 34], LASSO [35], and SVM [8, 36, 37]. SVM has been proved to be one of the best classifiers for classifying data in the area of “high dimensionality and small sample size” [36, 37]. We do binary classification and multiclass classification experiments in Sections 3.1 and 3.2, respectively. During the experiment, the best results of SRC, LASSO, and SVM are also used to compare with those of our method, which were achieved by choosing appropriate parameters experimentally. As the number of tumor sample is too small, we use stratified 10-fold cross validation in all our experiments. In the multiclass classification experiments, we do not use LASSO method because it is designed only for binary class classification problems [35]. As we know, dimensionality reduction can improve the classification performance and computing speed, so we reduce data dimensionality using between-category to within-category sums of squares methods in our experiments.

3. Experimental Results

3.1. Two-Class Classification Problem

In this subsection, three two-class microarray data sets are used to evaluate our method: colon cancer [38], prostate cancer [39], and diffuse large B-cell lymphoma [40].

The colon data set contains 62 samples consisting of 40 tumor and 22 normal. The prostate data set contains prostate tumors and normal prostate samples, each consisting of the expression levels of 12600 genes. For the DLBCL data set, the gene expression values were measured by high-density oligonucleotide microarrays. An overview of the three data sets is given in Table 1.

The classification results by using SVM, LASSO, SRC, and the proposed SRC-LatLRR are listed in Table 2. From Table 2, we can see that our method SRC-LatLRR performs well on all the three data sets. Even the performance of SRC-LatLRR is not better than SRC on the prostate cancer data set, but it is better than SVM and LASSO. In summary, SRC has an advantage for the prostate cancer and DLBCL data sets, but SRC-LatLRR is the best classifier for the colon cancer and DLBCL data sets.

To further evaluate our method, in this experiment, we also introduced BW feature selection in our method to classify these three data sets. The results are listed in Table 3, and the number of genes selected is given in the parenthesis behind data set. From Table 3, we can see that after feature selection, our proposed classification method outperforms the other three classification methods, and it can even achieve an accuracy of 100% for the DLBCL data set.

3.2. Multiclass Classification Problem

In this subsection, we use four multiclass data sets to further check the classification performance of SRC-LatLRR. The four data sets are lung cancer [41], leukemia [42], 11_tumors [43], and 9_tumors [44].

In lung cancer data set, there are four classes of lung cancer and normal class. This data set contains 203 samples. For leukemia data set, all the samples are classified into acute myelogenous leukemia, acute lymphoblastic leukemia, or mixed-lineage leukemia. The data set includes 72 samples with 11225 genes. For 11_tumors, there are 11 classes of samples, which are ovary, bladder/ureter, breast, colorectal, gastroesophagus, kidney, liver, prostate, pancreas, adeno lung, and squamous lung. This data set includes 174 samples. For the 9_tumors data set, there are 60 samples with 5726 genes. These 9 types of tumors are non-small-cell lung, colon, breast, ovarian, leukemia, renal, melanoma, prostate, and central nervous system. The detailed descriptions about these four data sets are listed in Table 4. All the four data sets were produced by oligonucleotide microarrays and the analysis tool Affymetrix GENECHIP [36].

The experimental results are listed in Table 5. From these results, we can see that the proposed method SRC-LatLRR does not have a clear advantage over SVM and SRC. The reason may be that in these data sets, the training samples of each class are very few so that the sample space is not complete.

We then introduced BW feature selection before applying our method. The obtained results are listed in Table 6. From the results we can see that the proposed method classified leukemia well. For the other data sets, it has no clear advantage. But it performed better than SRC for all the four data sets.

3.3. The Choice of the Balanced Parameter

In this section, we use the data sets described in Section 3.1 to check how in (16) affect the classification performance. We show the accuracies and the removed noise level by our method at different values of in Figures 1, 2, and 3 for the colon, prostate, and DLBCL data sets, respectively. From (16), we know that the lower the is, the bigger the noise level is removed. For these three figures we use to represent the level of the removed noise. From these three figures we can see that the noise that we remove from the original data can not be too much, or it will reduce the accuracy. The reason is that if is set to be too small, useful information may be also removed besides noise. On the contrary, if is too big, the noise that was removed is too little, and we still can not get a good classification result. The experiment suggests that for colon data sets, is the best choice and and for the prostate and DLBCL data sets, respectively.

4. Conclusions

For gene expression data, cancer diagnosis is one of the most important clinical applications. In this paper, we have proposed a new SR-based method for tumor classification which uses the noiseless salient features extracted from the original samples to classify a test sample. We compared our method with several state-of-the-art methods including SVM, LASSO, and SRC on seven data sets. The results of experiments show that the proposed method is better than SVM, LASSO, and SRC in a way. These demonstrate that SRC-LatLRR is effective and efficient for tumor classification. We also introduced gene selection into our method. The results show that gene selection can improve the classification accuracy to some extent.

During the study we also found that, for the optimal result of LatLRR on the observed samples, represents the affinity matrix of samples [21]. In theory, the affinity matrix can be used to cluster samples. In future, we will extend it to investigate the property of sample clusters.

Abbreviations

SR:Sparse representation
SRC:Sparse representation classification
LRR:Low-rank representation
LatLRR:Latent low-rank representation
ALM:Augmented Lagrange multiplier
SVM:Support vector machines
LASSO:Least absolute shrinkage and selection operator
BW:Between-categories to within-category.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by the National Science Foundation of China under Grant nos. 61272339 and 61271098, and 61374181, the Natural Science Foundation of Anhui Province under Grant no. 1308085MF85, and the Key Project of Anhui Educational Committee, under Grant no. KJ2012A005.