Abstract

Personalized drug design requires the classification of cancer patients as accurate as possible. With advances in genome sequencing and microarray technology, a large amount of gene expression data has been and will continuously be produced from various cancerous patients. Such cancer-alerted gene expression data allows us to classify tumors at the genomewide level. However, cancer-alerted gene expression datasets typically have much more number of genes (features) than that of samples (patients), which imposes a challenge for classification of tumors. In this paper, a new method is proposed for cancer diagnosis using gene expression data by casting the classification problem as finding sparse representations of test samples with respect to training samples. The sparse representation is computed by the -regularized least square method. To investigate its performance, the proposed method is applied to six tumor gene expression datasets and compared with various support vector machine (SVM) methods. The experimental results have shown that the performance of the proposed method is comparable with or better than those of SVMs. In addition, the proposed method is more efficient than SVMs as it has no need of model selection.

1. Introduction

The treatment of cancer greatly depends on the accurate classification of tumors. In spite of its effectiveness in classifying tumors by microscopic tissue examination, traditional histopathological approach fails to classify many cancer cases. The number of unclassified cancer cases can reach up to 40 000 per year just in the United States [1]. DNA microarray technology, on the other hand, has the potential to provide a more accurate and objective cancer diagnosis due to its high throughput capability of measuring expression levels of tens of thousands genes simultaneously. Since Golub et al. [2] successfully classified between acute myeloid leukemia (AML) and acute lymphocytic leukemia (ALL), many other types of cancer have been classified using gene expression data including breast cancer [3], lymphoma [4], lung cancer [5], bladder cancer [6], colon cancer [7], ovarian cancer [8], prostate cancer [9], melanoma [10], and brain tumors [11].

The successful application of microarray technology in cancer diagnosis greatly depends on the careful design of two important components of a gene data classification system: gene selection and sample classification, shown in Figure 1. Gene selection mainly serves two purposes: (i) to reduce dramatically the number of genes used in classification to manage the “curse of dimensionality” and (ii) selected genes might be biologically relevant, allowing further biological exploration which may lead to better understanding of underlying molecular mechanism associated with tumorigenesis and progression. Gene selection can be made by test statistics [12]. An excellent review on gene selection methods can be found in [13].

The second component, sample classification, is a challenging issue for a problem with a small number of learning samples and yet a large number of features (genes). The number of samples available for analysis ranges from tens to hundreds. Many established methods have been proposed to address the challenge. According to Lee et al. [14], they can be classified into four categories: (i) classical methods such as Fisher's linear discriminant analysis, logistic regression, K-nearest neighbor, and generalized partial least square, (ii) classification trees and aggregation methods such as CART, random forest, bagging, and boosting, (iii) machine learning methods such as neural network and support vector machines (SVMs), and (iv) generalized methods such as flexible discriminant analysis, mixture discriminant analysis, and shrunken centroid method.

In this paper, we propose a novel approach for classification, called sparse representation, inspired by the recent progress in -norm minimization-based methods such as basis pursuit denoising [15], compressive sensing for sparse signal reconstruction [1618], and Lasso algorithm for feature selection [19]. Ideally, a testing sample can be represented just in terms of the training samples of the same category. Hence, when the testing sample is expressed as linear combination of all the training samples, the coefficient vector is sparse, that is, the vector has relatively few nonzero coefficients. Testing samples of same category will have similar sparse representation, while different categories will result in different sparse representations. In order to recover the sparse coefficient vector, -regularized least square [20] is used.

Unlike general supervised learning methods, where a training procedure is used to create a classification model for testing, the sparse representation approach does not contain separate training and testing stages. Instead, classification is achieved directly out of the testing sample's sparse representation in terms of training samples. Another unique feature of the new method is no model selection needed. It is well known that the performance of a classifier, such as SVM, relies upon careful choice of the model parameters via model selection procedure.

2. Materials and Methods

2.1. Sparse Representation

Consider a training dataset   , where represents the ith sample, a d-dimensional column vector containing gene expression values with d as the number of genes, and is the label of the ith sample with N as the number of categories. For a testing sample , the problem of sparse representation is to find a column vector such that and is minimized, where is -norm, and it is equivalent to the number of nonzero components in the vector c.

Defining a matrix by putting as the ith column , the problem of sparse representation can be converted into Finding the solution to sparse representation problem is NP-hard due to its nature of combinational optimization. Approximation solution can be obtained by replacing the -norm in (2) by the -norm where the -norm of a vector v defined as . A generalized version of (3), which allows for certain degree of noise, is to find a vector c such that the following objective function is minimized: where the positive parameter is a scalar regularization that balances the tradeoff between reconstruction error and sparsity.

Since -norm minimization can efficiently recover sparse signal [20] and are robust against outliers, this study takes in (4). Therefore, the problem is reduced to solve (3) an -regularized least square problem: A truncated Newton interior-point method (TNIPM) proposed in [20] can be used to solve the above optimization problem in (5). For the convergence of the algorithm, the regularization parameter must satisfy the following condition: Please refer to [20] for more information about -regularized least square and the specialized interior-point method.

Another approach to determine the sparse solution to (2) is to use the framework of compressive sensing, which requires the system to be underdetermined. Including the construction errors e in (1) yields In compressive sensing approach, we need to rewrite (7) as where and . With these notations, the sparse representation can be obtained by the following constrained -norm minimization problem: The above linear programming problem can be solved by a specialized interior-point method called -magic [21]. The approach in (9) is used in [22] for face recognition by sparse representation.

Both approaches do generate nearly the same classification performance in our experiments. Our approach, based on -regularized least square, however, is much faster. First, the optimization problem scale in our approach is much smaller. For example, when the training dataset contains 300 samples and the gene number is 10 000, the matrix in our approach is while . Secondly, TNIPM is while -magic is [20]. In addition, it is noticed that basis pursuit, compressive sensing, and Lasso algorithm can also be converted into -regularized least square problems [20].

Let denote the sparse representation obtained by -regularized least square. Ideally, the nonzero entries in are associated with the columns in A corresponding to those training samples of the same category as the testing sample y. However, noises may cause the nonzero entries to be linked with multiple categories [22]. Simple heuristics, such as assigning y to the category with the largest entry in , are not dependable. Instead, we define N discriminate functions where is obtained by keeping only those entries in associated with category k and assigning zeros to other entries. Thus represents the approximation error when y is assigned to category k, and we can assign y to the category with the smallest approximation error. The classification algorithm is summarized (see Algorithm 1).

Input: and y
1. Normalize , and y
2. Create matrix A
3. Solve the optimization problem defined in (5)
4. Compute
Output:

2.2. Numerical Experiments

Numerical experiments are designed to quantitatively verify the performance of sparse representation method for cancer classification using gene expression data. The performance metric used in this study is accurate, obtained by stratified 10-fold cross-validation. We compare our approach with a few variants of multicategory SVMs. SVMs, as state-of-the-art machine learning algorithms, have been successfully applied in gene profile classification [23, 24]. The comprehensive study in [25] also shows that SVMs outperform K-nearest neighbors and neural network in gene expression cancer diagnosis.

All experiments are done on a PC with duo Intel 2.33 G CPU and 4 G memory under Windows XP (SP2). MATLAB R14 is used to implement sparse representation method. The optimization is done by l1_ls MATLAB package, which is available online (http://www.stanford.edu/~boyd/l1_ls/). The results of SVMs are obtained by gene expression model selector (GEMS), a software with graphic user interface for classification of gene expression data, which is freely available at http://www.gems-system.org/ and used in [25] for the comprehensive study of the performance of multiple classifiers on gene expression cancer diagnosis. Besides standard binary SVM, GEMS has implemented the following multiclass SVMs: one-versus-rest (OVR) [26], one-versus-one (OVO) [26], directed acyclic graph (DAG) [27], all-at-once method by Weston and Watkins (WW) [28], and all-at-once method by Crammer and Singer (CS) [29], which are used in comparison with sparse representation approach. Polynomial and RBF kernels are used for SVMs.

For fair comparison, the partition file of cross-validation generated by GEMS is used in sparse representation approach. As for model selection, 9-fold cross validation is used for SVMs.

The comparison is done with and without gene selection. Two popular gene selection methods are used in this study: Kruskal-Wallis nonparametric one-way ANOVA (KW) [30] and the ratio of between-groups to within-groups sum of squares (BW) [31].

2.3. Datasets

In the experiment, we use six datasets, which are among 11 datasets used in the comprehensive study [25]. For easy comparison, we adopt the name used in [25]. The information about the six datasets is summarized below. (i)9_Tumors [32]: the dataset comes from a study of 9 human tumor types: NSCLC, colon, breast, ovary, leukemia, renal, melanoma, prostate, and CNS. There are 60 samples, each of which contains 5726 genes.(ii)11_Tumors [23]: the dataset includes 174 samples of gene expression data of 11 various human tumor types: ovary, bladder/ureter, breast, colorectal, gastro-esophagus, kidney, liver, prostate, pancreas, adeno lung, and squamous lung. The number of genes is 12 533.(iii)14_Tumors [24]: the dataset contains 308 samples of 14 various human tumor types including leukemia, prostate, lung, colorectal, lymphoma, bladder, melanoma, uterus, breast, renal, pancreas, ovary, mesothelioma, and CNS, and 12 normal tissues including breast, prostate, lung, colon, germinal center, bladder, uterus, peripheral blood, kidney, pancreas, ovary, and brain. Each sample has 15 009 genes.(iv)Brain_Tumor1 [11]: the dataset comes from a study of 5 human brain tumor types: medulloblastoma, malignant glioma, AT/RT, normal cerebellum, and PNET, including 90 samples. Each sample has 5920 genes.(v)Brain_Tumor2 [33]: there are 4 types of malignant glioma in this dataset: classic glioblastomas, classic anaplastic oligodendrogliomas, nonclassic glioblastomas, and nonclassic anaplastic oligodendrogliomas. The dataset has 50 samples, and the number of genes is 10 367.(vi)Prostate_Tumor [9]: the binary dataset contains gene expression data of prostate tumor and normal tissues. There are 10 509 genes in each sample and 102 samples. According to [25], 9_Tumors, 14_tumors, and Brain_Tumor2 are the most difficult datasets which make all the classifiers, including SVMs, generate low classification performance.

All the gene expression data are normalized by being rescaled between 0 and 1. It is also for the purpose of speeding up the training of SVMs.

3. Results and Discussion

Table 1 shows the classification results of the experiment without gene selection for both sparse representation (SR) and SVMs. The results of SVMs are slightly differently from [25]. A possible explanation is that the distribution file of cross validation is different in our study from [25]. From Table 1, the proposed SR approach performs better than all SVM variants on 9_Tumors, 11_Tumors, and Brian_Tumor2, and most SVM variants on 14_Tumors, while the SR approach performs comparably with SVM variants on Prostate_Tumor and Brain_Tumor1. In addition, similar to SVMs, the SR approach also finds it difficult to classify three multicategory datasets: 9_Tumors, 14_Tumors, and Brain_Tumor2. However, the SR approach performs better than all SVM variants on these datasets except CS and OVR on 14_Tumors. The difficulty may mainly be caused by the small number of total samples and even the smaller number of samples for each category. For example, the 9_Tumors dataset only has 60 samples, and category 7 (prostate tumor) just has two samples.

Table 2 shows the results of sparse representation when KW and BW methods are used for gene selection, along with the best results achieved by SVMs with the corresponding gene selection methods. From Table 2, the performance of the proposed SR is comparable with the best SVM variant on all six datasets. In addition, since gene selection generate limited improvement for both methods, sparse representation approach, similar to SVMs, seems less sensitive to curse of dimensionality than non-SVM methods such as neural network and k-nearest neighbors.

It is worth mentioning that the results of SVMs for both with and without gene selection are obtained by careful model selection using 9-fold cross validation. Spare representation approach, on the other hand, has no need of adjusting model parameters for different datasets.

As for the computing efficiency, sparse representation approach is very fast when sample number is less than 100. For example, without gene selection, it needs less than 10 seconds for Brain_Tumor2 dataset, which has only 50 samples. The efficiency, however, is dramatically reduced for relatively large sample cases. The dataset 14_Tumors, which has 308 samples, needs more than 3000 seconds! The main reason lies in the fact that the current implementation needs solving one optimization problem defined in (5) for classification of each testing sample. As a result, the number of optimization problems to be solved equals to the number of samples in the dataset. When compared with SVMs, however, the proposed SR is still faster, at least, than GEMS implementations when model selection is counted for SVMs.

4. Conclusion

In this paper, we have described a new approach for cancer diagnosis using gene expression data. The new method expresses each testing sample as a linear combination of all the training samples. The coefficient vector is obtained by -regularized least square. Classification is achieved by defining discriminating functions from the coefficient vector for each category. Since -norm minimization leads to sparse solution, we call the new approach sparse representation.

Numerical experiments show that sparse representation approach can match the best performance achieved by SVMs. Furthermore, the new approach has no need of model selection. One direction of our future work is to investigate how to classify multiple testing samples by solving only one optimization problem to improve the efficiency.

Acknowledgments

The second author would like to thank Natural Science and Engineering Research Council of Canada (NSERC) for supporting this research. Both authors thank the editor and reviewers for their kind comments and suggestions.