Personalized drug design requires the classification of cancer patients as accurate as possible. With advances in genome sequencing and microarray technology, a large amount of gene expression data has been and will continuously be produced from various cancerous patients. Such cancer-alerted gene expression data allows us to classify tumors at the genomewide level. However, cancer-alerted gene expression datasets typically have much more number of genes (features) than that of samples (patients), which imposes a challenge for classification of tumors. In this paper, a new method is proposed for cancer diagnosis using gene expression data by casting the classification problem as finding sparse representations of test samples with respect to training samples. The sparse representation is computed by the -regularized least square method. To investigate its performance, the proposed method is applied to six tumor gene expression datasets and compared with various support vector machine (SVM) methods. The experimental results have shown that the performance of the proposed method is comparable with or better than those of SVMs. In addition, the proposed method is more efficient than SVMs as it has no need of model selection.
1. Introduction
The treatment of cancer greatly
depends on the accurate classification of tumors. In spite of its effectiveness
in classifying tumors by microscopic tissue examination, traditional histopathological approach fails to classify many cancer cases. The number of
unclassified cancer cases can reach up to 40 000 per year just in the United
States [1]. DNA microarray technology, on the other hand, has the potential to
provide a more accurate and objective cancer diagnosis due to its high
throughput capability of measuring expression levels of tens of thousands genes
simultaneously. Since Golub et al. [2] successfully classified
between acute myeloid leukemia (AML) and acute lymphocytic leukemia (ALL), many
other types of cancer have been classified using gene expression data including
breast cancer [3], lymphoma [4], lung cancer [5], bladder cancer [6], colon
cancer [7], ovarian cancer [8], prostate cancer [9], melanoma [10], and brain
tumors [11].
The successful application of microarray technology in cancer diagnosis greatly
depends on the careful design of two important components of a gene data
classification system: gene selection and sample classification, shown in Figure 1. Gene selection mainly
serves two purposes: (i) to reduce dramatically the number of genes used in
classification to manage the “curse of dimensionality” and (ii) selected genes
might be biologically relevant, allowing further biological exploration which
may lead to better understanding of underlying
molecular mechanism associated with tumorigenesis and progression. Gene
selection can be made by test statistics [12]. An excellent review on
gene selection methods can be found in [13].
Figure 1: The pipeline of cancer diagnosis using gene expression data.
The second component, sample
classification, is a challenging issue for a problem with a small number of learning samples
and yet a large number of features (genes). The number of samples available for
analysis ranges from tens to hundreds. Many established methods have been proposed
to address the challenge. According to Lee et al. [14], they can be classified
into four categories: (i) classical methods such as Fisher's linear
discriminant analysis, logistic regression, K-nearest neighbor, and generalized partial least
square, (ii) classification trees and aggregation methods such as CART, random
forest, bagging, and boosting, (iii) machine learning methods such as neural
network and support vector machines (SVMs), and (iv) generalized methods such
as flexible discriminant analysis, mixture discriminant analysis, and shrunken
centroid method.
In this paper, we propose a novel approach for
classification, called sparse representation, inspired by the recent progress
in -norm minimization-based
methods such as basis pursuit denoising [15], compressive sensing for sparse
signal reconstruction [16–18], and Lasso
algorithm for feature selection [19]. Ideally, a testing sample can be
represented just in terms of the training samples of the same category. Hence,
when the testing sample is expressed as linear combination of all the training
samples, the coefficient vector is sparse, that is, the vector has relatively
few nonzero coefficients. Testing samples of same category will have similar
sparse representation, while different categories will result in different
sparse representations. In order to recover the sparse coefficient vector, -regularized least
square [20] is used.
Unlike general supervised learning methods, where a
training procedure is used to create a classification model for testing, the
sparse representation approach does not contain separate training and testing
stages. Instead, classification is achieved directly out of the testing sample's
sparse representation in terms of training samples. Another unique feature of
the new method is no model selection needed. It is well known that the
performance of a classifier, such as SVM, relies upon careful choice of the
model parameters via model selection procedure.
2. Materials and Methods
2.1. Sparse Representation
Consider a training dataset , where represents the ith sample, a d-dimensional column vector containing gene expression values with d as the number of genes, and is the label of the ith sample with N as the
number of categories. For a testing sample ,
the problem of sparse representation is to find a column vector such that and is minimized, where is -norm,
and it is equivalent to the number of nonzero components in the vector c.
Defining a matrix by putting as the ith
column ,
the problem of sparse representation can be converted into Finding the solution to sparse representation problem is NP-hard due to its nature
of combinational optimization. Approximation solution can be obtained by
replacing the -norm in (2)
by the -norm where the -norm of a
vector v defined as .
A generalized version of (3), which allows for certain degree of noise, is to
find a vector c such that the
following objective function is minimized: where the positive parameter is a scalar regularization that balances the
tradeoff between reconstruction error and sparsity.
Since -norm
minimization can efficiently recover sparse signal [20] and are robust against
outliers, this study takes in (4). Therefore, the problem is reduced to solve
(3) an -regularized
least square problem: A truncated Newton interior-point method (TNIPM) proposed
in [20] can be used to solve the above optimization problem in (5). For the
convergence of the algorithm, the regularization parameter must satisfy the
following condition: Please refer to [20] for more
information about -regularized
least square and the specialized interior-point method.
Another approach to determine the sparse
solution to (2) is to use the framework of compressive sensing, which requires
the system to be underdetermined. Including the construction errors e in (1) yields In compressive
sensing approach, we need to rewrite (7) as where and . With these notations, the sparse
representation can be obtained by the following constrained -norm minimization problem: The above linear programming problem can
be solved by a specialized interior-point method called -magic [21]. The approach in (9) is used in [22] for
face recognition by sparse representation.
Both approaches do generate nearly the same classification performance in our
experiments. Our approach, based on -regularized
least square, however, is much faster. First, the optimization problem scale in
our approach is much smaller. For example, when the training dataset contains
300 samples and the gene number is 10 000, the matrix in our approach is while .
Secondly, TNIPM is while -magic
is [20]. In addition, it is noticed that basis
pursuit, compressive sensing, and Lasso algorithm can also be converted into -regularized least square
problems [20].
Let denote the sparse representation obtained by -regularized least square.
Ideally, the nonzero entries in are associated with the columns in A
corresponding to those training samples of the same category as the testing
sample y. However, noises may cause
the nonzero entries to be linked with multiple categories [22]. Simple heuristics, such as assigning y to the category with the largest
entry in ,
are not dependable. Instead, we define N discriminate functions where is obtained by keeping only those entries in associated with category k and assigning zeros to other entries. Thus represents the approximation error when y is assigned to category k, and we can assign y to the category with the smallest
approximation error. The classification
algorithm is summarized (see Algorithm 1).
Algorithm 1: Classification by sparse representation.
2.2. Numerical Experiments
Numerical experiments are designed to quantitatively verify the
performance of sparse representation method for cancer classification using
gene expression data. The performance metric used in this study is accurate, obtained by
stratified 10-fold cross-validation. We compare our approach with a few
variants of multicategory SVMs. SVMs, as state-of-the-art machine learning
algorithms, have been successfully applied in gene profile classification [23, 24]. The comprehensive study in [25] also shows that SVMs outperform K-nearest
neighbors and neural network in gene expression cancer diagnosis.
All experiments are done on a PC with duo Intel 2.33 G CPU and 4 G memory under
Windows XP (SP2). MATLAB R14 is used to implement sparse representation method.
The optimization is done by l1_ls MATLAB package, which is available online
(http://www.stanford.edu/~boyd/l1_ls/). The results of SVMs are obtained by gene
expression model selector (GEMS), a software with graphic user interface for
classification of gene expression data, which is freely available at http://www.gems-system.org/
and used in [25] for the comprehensive study of the performance of multiple
classifiers on gene expression cancer diagnosis. Besides standard binary SVM,
GEMS has implemented the following multiclass SVMs: one-versus-rest (OVR) [26],
one-versus-one (OVO) [26], directed acyclic graph (DAG) [27], all-at-once method
by Weston and Watkins (WW) [28], and all-at-once method by Crammer and Singer
(CS) [29], which are used in comparison with sparse representation approach.
Polynomial and RBF kernels are used for SVMs.
For fair comparison, the partition file of cross-validation generated by GEMS is
used in sparse representation approach. As for model selection, 9-fold cross validation
is used for SVMs.
The comparison is done with and without gene selection. Two popular gene selection
methods are used in this study: Kruskal-Wallis nonparametric one-way ANOVA (KW)
[30] and the ratio of between-groups to within-groups sum of squares (BW) [31].
2.3. Datasets
In the experiment, we use six datasets, which
are among 11 datasets used in the comprehensive study [25]. For easy comparison,
we adopt the name used in [25]. The information about the six datasets is
summarized below.
(i)9_Tumors [32]: the dataset
comes from a study of 9 human tumor types: NSCLC, colon, breast, ovary, leukemia, renal, melanoma, prostate, and
CNS. There are 60 samples, each of which contains 5726 genes.(ii)11_Tumors [23]: the dataset
includes 174 samples of gene expression data of 11 various human tumor types: ovary, bladder/ureter, breast, colorectal, gastro-esophagus, kidney,
liver, prostate, pancreas, adeno
lung, and squamous lung.
The number of genes is 12 533.(iii)14_Tumors [24]: the dataset contains 308 samples of 14 various human
tumor types including leukemia, prostate, lung, colorectal, lymphoma, bladder,
melanoma, uterus, breast, renal, pancreas, ovary, mesothelioma, and CNS, and 12
normal tissues including breast, prostate, lung, colon, germinal center, bladder, uterus, peripheral blood, kidney,
pancreas, ovary, and brain. Each sample has 15 009 genes.(iv)Brain_Tumor1 [11]: the dataset comes from a study of 5 human brain tumor
types: medulloblastoma, malignant glioma, AT/RT, normal cerebellum, and PNET,
including 90 samples. Each sample has 5920 genes.(v)Brain_Tumor2 [33]: there are 4 types of malignant glioma in this
dataset: classic glioblastomas, classic anaplastic oligodendrogliomas, nonclassic
glioblastomas, and nonclassic anaplastic oligodendrogliomas. The dataset has 50
samples, and the number of genes is 10 367.(vi)Prostate_Tumor [9]: the
binary dataset contains gene expression data of prostate tumor and normal
tissues. There are 10 509 genes in each sample and 102 samples.
According to [25], 9_Tumors, 14_tumors, and Brain_Tumor2 are
the most difficult datasets which make all the classifiers, including SVMs,
generate low classification performance.
All the gene expression data are normalized by being rescaled between 0 and 1. It is also
for the purpose of speeding up the training of SVMs.
3. Results and Discussion
Table 1 shows the classification results of
the experiment without gene selection for both sparse representation (SR) and
SVMs. The results of SVMs are slightly differently from [25]. A possible
explanation is that the distribution file of cross validation is different in
our study from [25]. From Table 1, the proposed SR approach performs better
than all SVM variants on 9_Tumors, 11_Tumors, and Brian_Tumor2, and most SVM
variants on 14_Tumors, while the SR approach performs comparably with SVM
variants on Prostate_Tumor and Brain_Tumor1. In addition, similar to SVMs, the SR approach
also finds it difficult to classify three multicategory datasets: 9_Tumors,
14_Tumors, and Brain_Tumor2. However, the SR approach performs better than all
SVM variants on these datasets except CS and OVR on 14_Tumors. The difficulty may
mainly be caused by the small number of total samples and even the smaller
number of samples for each category. For example, the 9_Tumors dataset only has
60 samples, and category 7 (prostate
tumor) just has two samples.
Table 1: Results without gene selection.
Table 2 shows the results of sparse
representation when KW and BW methods are used for gene selection, along with
the best results achieved by SVMs with the corresponding gene selection
methods. From Table 2, the performance of the proposed SR is comparable with
the best SVM variant on all six datasets. In addition, since gene selection
generate limited improvement for both methods, sparse representation approach,
similar to SVMs, seems less sensitive to curse of dimensionality than non-SVM
methods such as neural network and k-nearest neighbors.
Table 2: Results with gene selection.
It is worth mentioning that the results
of SVMs for both with and without gene selection are obtained by careful model
selection using 9-fold cross validation. Spare representation approach, on the
other hand, has no need of adjusting model parameters for different datasets.
As for the computing efficiency, sparse
representation approach is very fast when sample number is less than 100. For
example, without gene selection, it needs less than 10 seconds for Brain_Tumor2
dataset, which has only 50 samples. The efficiency, however, is dramatically
reduced for relatively large sample cases. The dataset 14_Tumors, which has 308
samples, needs more than 3000 seconds! The main reason lies in the fact that
the current implementation needs solving one optimization problem defined in (5)
for classification of each testing sample. As a result, the number of
optimization problems to be solved equals to the number of samples in the
dataset. When compared with SVMs, however, the proposed SR is still faster, at
least, than GEMS implementations when model selection is counted for SVMs.
4. Conclusion
In this paper, we have described a new approach for cancer diagnosis
using gene expression data. The new method expresses each testing sample as a
linear combination of all the training samples. The coefficient vector is
obtained by -regularized
least square. Classification is achieved by defining discriminating functions
from the coefficient vector for each category. Since -norm minimization leads to sparse
solution, we call the new approach sparse representation.
Numerical
experiments show that sparse representation approach can match the best
performance achieved by SVMs. Furthermore, the new approach has no need of
model selection. One direction of our future work is to investigate how to
classify multiple testing samples by solving only one optimization problem to
improve the efficiency.
Acknowledgments
The second author would like to thank
Natural Science and Engineering Research Council of Canada (NSERC) for
supporting this research. Both authors thank the editor and reviewers for their
kind comments and suggestions.