Abstract
Supervised learning of microarray data is receiving much attention in recent years. Multiclass cancer diagnosis, based on selected gene profiles, are used as adjunct of clinical diagnosis. However, supervised diagnosis may hinder patient care, add expense or confound a result. To avoid this misleading, a multiclass cancer diagnosis with classselective rejection is proposed. It rejects some patients from one, some, or all classes in order to ensure a higher reliability while reducing time and expense costs. Moreover, this classifier takes into account asymmetric penalties dependant on each class and on each wrong or partially correct decision. It is based on SVM coupled with its regularization path and minimizes a general loss function defined in the classselective rejection scheme. The state of art multiclass algorithms can be considered as a particular case of the proposed algorithm where the number of decisions is given by the classes and the loss function is defined by the Bayesian risk. Two experiments are carried out in the Bayesian and the class selective rejection frameworks. Five genes selected datasets are used to assess the performance of the proposed method. Results are discussed and accuracies are compared with those computed by the Naive Bayes, Nearest Neighbor, Linear Perceptron, Multilayer Perceptron, and Support Vector Machines classifiers.
1. Introduction
Cancer diagnosis, based on gene expression profiling, have improved over the past years. Many microarray technologies studies were developed to analyze the gene expression. These genes are later used to categorize cancer classes. Two different classification approaches can be used: class discovery and class prediction. The first is an unsupervised learning approach that allows to separate samples into clusters based on similarities in gene expression, without prior knowledge of sample identity. The second is a supervised approach which predicts the category of an already defined sample using its gene expression profiles. Since these classification problems are described by a large number of genes and a small number of samples, it is crucial to perform genes selection before the classification step. One way to identify informative genes pointed in [1] is the test statistics.
Researches show that the performance of supervised decisions based on selected gene expression can be comparable to the clinical decisions. However, no classification strategy is absolutely accurate. First, many factors may effectively decrease the predictive power of a multiclass problem. For example, findings of [2] imply that information useful for multiclass tumor classification is encoded in a complex gene expression and cannot be given by a simple one. Second, it is not possible to find an optimal classification method for all kinds of multiclass problems. Thus, supervised diagnosis are always considered as an important adjunct of traditional diagnostics and never like its substitute.
Unfortunately, supervised diagnosis can be misleading. They may hinder patient care (wrong decision on a sick patient), add expense (wrong decision on a healthy patient) or confound the results of cancer categories. To overcome these limitations, a multiSVM [3] classifier with classselective rejection [4–7] is proposed. Classselective rejection consists of rejecting some patients from one, some, or all classes in order to ensure a higher reliability while reducing time and expense costs. Moreover, any of the existing multiclass [8–10] algorithms have taken into consideration asymmetric penalties on wrong decisions. For example, in a binary cancer problem, a wrong decision on a sick patient must cost more than a wrong decision on a healthy patient. The proposed classifier handles this kind of problems. It minimizes a general loss function that takes into account asymmetric penalties dependant on each class and on each wrong or partially correct decision.
The proposed method divides the multiple class problem into several unary classification problems and train one SVM [11–13] coupled with its regularization path [14, 15] for each class. The winning class or subset of classes is determined using a prediction function that takes into consideration the costs asymmetry. The parameters of all the SVMs are optimized jointly in order to minimize a loss function. Taking advantage of the regularization path method, the entire parameters searching space is considered. Since the searching space is widely extended, the selected decision rule is more likely to be the optimal one. The stateofart multiclass algorithms [8–10] can be considered as a particular case of the proposed algorithm where the number of decisions is given by the existing classes and the loss function is defined by the Bayesian risk.
Two experiments are reported in order to assess the performance of the proposed approach. The first one considers the proposed algorithm in the Bayesian framework and uses the selected microarray genes to make results comparable with existing ones. Performances are compared with those assessed using Naive Bayes, Nearest Neighbor, Linear Perceptron, Multilayer Perceptron, and Support Vector Machines classifiers, invoked in [1]. The second one shows the ability of the proposed algorithm solving multiclass cancer diagnosis in the classselective rejection scheme. It minimizes an asymmetric loss function. Experimental results show that, a cascade of classselective classifiers with classselective rejections can be considered as an improved supervised diagnosis rule.
This paper is outlined as follows. Section 2 presents a description of the model as a gene selection task. It introduces the multiclass cancer diagnosis problem in the classselective rejection scheme. It also proposes a supervised training algorithm based on SVM coupled with its regularization path. The two experiments are carried out in Section 3, results are reported, compared and discussed. Finally, a conclusion is presented in Section 4.
2. Models and Methods
This section describes the multiclass cancer diagnosis based on microarray data. Feature selection is evoked as a first process in a genebased cancer diagnosis. Test statistics are used as a possible way for informative genes identification [1]. Once genes selection is processed, a classification problem should be solved. The multiclass cancer diagnosis problem, formulated in the general framework of classselective rejection, is introduced. A solution based on SVM [11–13] is proposed. First a brief description of SVM and the derivation of its regularization path [14, 15] is presented. Second, the proposed algorithm [3] is explained. It allows to determine a multiclass cancer diagnosis that minimizes an asymmetric loss function in the classselective rejection scheme.
2.1. Genes Selection Using Test Statistics
Gene profiles are successfully applied to supervised cancer diagnosis. Since cancer diagnosis problems are usually described by a small set of samples with a large number of genes, feature or gene selection is an important issue in analyzing multiclass microarray data. Given a microarray data with tumor classes, tumor samples and genes per sample, one should identify a small subset of informative genes that contribute most to the prediction task. Various feature selection methods exist in literature. One way pointed in [1] is to use test statistics for the equality of the class means. Authors of [1] formulate first the expression levels of a given gene by a oneway analysis of variance model. Second, the power of genes in discriminating between tumor types is determined by a test statistic. The discrimination power is the value of the test evaluated at the expression level of the gene. The higher the discrimination power is, the more powerful the gene is in discriminating between tumor types. Thus, genes with higher power of discrimination are considered as informative genes.
Let be the expression level from the th sample of the th class, the following general model is considered: In the model represents the mean expression level of the gene in class , are independent random variables and for ; .
For the case of homogeneity of variances, the ANOVA F or test [16] is the optimal one testing the means equality hypothesis. With heterogeneity of variances, the task is challenging. However, it is known that, with a large number of genes present, usually in thousands, no practical test is available to locate the best set of genes. Thus, the authors of [1] studied six different statistics.
(i)ANOVA F test statistic, the definition of this test is where and , . For simplicity, is used to indicate the sum taken over the index . Under means equality hypothesis and assuming variance homogeneity, this test has a distribution of [16].(ii)BrownForsythe test statistic [17], given by Under means equality hypothesis, is distributed approximately as where (iii)Welch test statistic [18], defined as with and . Under means equality hypothesis, has an approximate distribution of where (iv)Adjusted Welch test statistic [19]. It is similar to Welch statistic and defined to be where with chosen such that and . Under means equality hypothesis, has an approximate distribution of where (v)Cochran test statistic [20]. This test statistic is simply the quantity appearing in the numerator of the Welch test statistic , that is, Under means equality hypothesis, has an approximate distribution of .(vi)KruskalWallis test statistic. This is the wellknown nonparametric test given by where is the rank sum for the th class. The ranks assigned to are those obtained from ranking the entire set of . Assuming each , then under means equality hypothesis, has an approximate distribution of [21].These tests performances are evaluated and compared over different supervised learning methods applied to publicly available microarray datasets. Experimental results show that the model for gene expression values without assuming equal variances is more appropriate than that assuming equal variances. Besides, under heterogeneity of variances, BrownForsythe test statistic, Welch test statistic, adjusted Welch test statistic, and Cochran test statistic, perform much better than ANOVA F test statistic and KruskalWallis test statistic.
2.2. Multitumor Classes with Selective Rejection
Once gene selection is processed, the classification problem should be solved. Let us define this diagnosis problem in the classselective rejection scheme. Assuming that the multiclass cancer problem deals with tumor classes noted and that any patient or sample belongs to one tumor class and has informative genes, a decision rule consists in a partition of in sets corresponding to the different decision options. In the simple classification scheme, the options are defined by the tumor classes. In the classselective rejection scheme, the options are defined by the tumor classes and the subsets of tumor classes (i.e. assigning patient to the subset of tumor classes means that is assigned to cancer categories and with ambiguity).
The problem consists in finding the decision rule that minimizes a given loss function defined by where is the cost of assigning a patient to the th decision option when it belongs to the tumor class . The values of being relative since the aim is to minimize , the values can be defined in the interval without loss of generality. is the a priori probability of tumor class and is the probability that patients of the tumor class are assigned to the th option.
2.3. 1SVM
To solve the multiclass diagnosis problem, an approach based on SVM is proposed. Considering a set of samples of a given tumor classes drawn from an input space , SVM computes a decision function and a real number in order to determine the region in such that if the sample and otherwise. The decision function is parameterized by (with ) to control the number of outliers. It is designed by minimizing the volume of under the constraint that all the samples of , except the fraction of outliers, must lie in . In order to determine , the space of possible functions is reduced to a Reproducing Kernel Hilbert Space (RKHS) with kernel function . Let be the mapping defined over the input space . Let be a dot product defined in . The kernel over is defined by: Without loss of generality, is supposed normalized such that for any , . Thus, all the mapped vectors , are in a subset of a hypersphere with radius one and center . Provided is always positive and is a subset of the positive orthant of the hypersphere. A common choice of is the Gaussian RBF kernel with the parameter of the Gaussian RBF kernel. SVM consists of separating the mapped samples in from the center with a hyperplane . Finding the hyperplane is equivalent to find the decision function such that for the mapped samples while is the hyperplane with maximum margin with the normal vector of .
This yields as the solution of the following convex quadratic optimization problem: where are the slack variables. This optimization problem is solved by introducing lagrange multipliers . As a consequence to KuhnTücker conditions, is given by which results in The dual formulation of (13) is obtained by introducing Lagrange multipliers as
A geometrical interpretation of the solution in the RKHS is given by Figure . and define a hyperplane orthogonal to . The hyperplane separates the s from the sphere center, while having maximum which is equivalent to minimize the portion of the hypersphere bounded by that contains the set .
Tuning or equivalently is a crucial point since it enables to control the margin error. It is obvious that changing leads to solve the optimization problem formulated in (16) in order to find the new region . To obtain great computational savings and extend the search space of , we proposed to use SVM regularization path [14, 15]. Regularization path was first introduced by Hastie et al. [14] for a binary SVM. Later, Rakotomamojy and Davy [15] developed the entire regularization path for a SVM. The basic idea of the SVM regularization path is that the parameter vector of a SVM is a piecewise linear function of . Thus the principle of the method is to start with large , (i.e., ) and decrease it towards zero, keeping track of breaks that occur as varies.
As decreases, increases and hence the distance between the sphere center and decreases. Samples move from being outside (nonmargin SVs with in Figure ) to inside the portion (nonSVs with ). By continuity, patients must linger on the hyperplane (margin SVs with ) while their s decrease from to . s are piecewiselinear in . Break points occur when a point moves from a position to another one. Since is piecewiselinear in , and are also piecewiselinear in . Thus, after initializing the regularization path (computing by solving (16) for ), almost all the s are computed by solving linear systems. Only for some few integer values of smaller than , are computed by solving (16) according to [15].
Using simple linear interpolation, this algorithm enables to determine very rapidly the SVM corresponding to any value of .
2.4. Multiclass SVM Based on 1SVM
Given classes and trained SVMs, one should design a supervised decision rule , moving from unary to multiclass classifier by assigning samples to a decision option. To determine the decision rule, first a prediction function should decide the winning option. A distance measure between and the training class set , using the SVM parameterized by , is defined as follows: where is the angle delimited by and the support vector as shown in Figure 1. is a normalizing factor which is used to make all the comparable.
Using in (17) leads to the following:
Since the are obtained by the regularization path for any value of , computing is considered as an easyfast task. The distance measure is inspired from [22]. When data are distributed in a unimodal form, the is a decreasing function with respect to the distance between a sample and the data mean. The probability density function is also a decreasing function with respect to the distance from the mean. Thus, preserves distribution order relations. In such case, and under optimality of the SVM classifier, the use of should reach the same performances as the one obtained using the distribution.
In the simplest case of multiclass problems where the loss function is defined as the error probability, a patient is assigned to the tumor class maximizing .
To extend the multiclass prediction process to the classselective scheme, a weighted form of the distance measure is proposed. The weight is associated to . reflects an adjusted value of the distance according to the penalty associated with the tumor class . Thus, introducing weights leads to treat differently each tumor class and helps solving problems with different costs on the classification decisions.
Finally, in the general case where the loss function is considered in the classselective rejection scheme, the prediction process can be defined as follows: a blinded sample is assigned to the th option if and only if
Thus, in contrast to previous multiclass SVMs, which construct the maximum margin between classes and locate the decision hyperplane in the middle of the margin, the proposed approach resembles more to the robust Bayesian classifier. The distribution of each tumor class is considered and the optimal decision is slightly deviated toward the class with the smaller variance.
The proposed decision rule depends on , and vectors of , and for . Tuning is the most time expensive task since changing leads to solve the optimization problem formulated in (16). Moreover, tuning is a crucial point, it enables to control the margin error. In fact, it was shown in [11] that this regularization parameter is an upper bound on the fraction of outliers and a lower bound on the fraction of the SVs. In [9, 23] a smooth grid search was supplied in order to choose the optimal values of . The values were chosen equal to reduce the computational costs. However, this assumption reduces the search space of parameters too. To avoid this restriction, the proposed approach optimizes all the with corresponding to the SVMs using regularization path and consequently explores the entire parameters space. Thus the tuned are most likely to be the optimal ones. The parameter are set equals .
The optimal vector of , and , , is the one which minimizes an estimator of using a validation set. Since the problem is described by a sample set, an estimator of given by (11) is used: where and are the empirical estimators of and , respectively.
The optimal rule is obtained by tuning , and so that the estimated loss computed on a validation set is minimum. This is accomplished by employing a global search for and and an iterative search over the kernel parameter. For each given value of the parameter kernels, SVMs are trained using the regularization path method on a training set. Then the minimization of over a validation set is sought by solving an alternate optimization problem over and which is easy since all SVM solutions are easily interpolated from the regularization path. is chosen from a previously defined set of real numbers with . Algorithm elucidates the proposed approach.
3. Experimental Results
In this section, two experiments are reported in order to assess the performance of the proposed approach. First, the cancer diagnosis problem is considered in the traditional Bayesian framework. Five gene expression datasets and five supervised algorithms are considered. Each gene dataset was selected using the six test statistics of [1]. The decisions are given by the possible set of tumor classes and the loss function is defined as the probability of error to make results comparable with those of [1]. Second, in order to show the advantages of considering the multiclass cancer diagnosis in classselective rejection scheme, one gene dataset is considered and studied with an asymmetric loss function. A cascade of classifiers with rejection options is used to ensure a reliable diagnosis. For both experiments, the loss function was minimized by determining the optimal parameters and for for a given kernel parameter and by testing different values of in the set . Finally, the decision rule which minimizes the loss function is selected.
3.1. Bayesian Framework
Five multiclass gene expression datasets leukemia72 [24], ovarian [25], NCI [26, 27], lung cancer [28] and lymphoma [29] were considered. Table 1 describes the five genes datasets. For each dataset, the six test statistics , , , , , and were used to select informative genes.
The cancer diagnosis problem was considered in the traditional Bayesian framework. Decisions were given by the set of possible classes and loss function was defined by the error risk. This means that in (20) are defined according to the Table 2. The performance of the proposed method was measured by evaluating its accuracy rate and it was compared to results obtained by the five predictors evoked in [1]: Naive Bayes, Nearest Neighbor, Linear Perceptron, Multilayer Perceptron Neural Network with five nodes in the middle layer, and Support Vector Machines with secondorder polynomial kernel.
To compute the generalization accuracy of the proposed classifier, Leave One Out (LOO) resampling method is used to divide a gene dataset of patients into two sets, a set of patients and a test set of blinded patient. This method involves separate runs. For each run, the first set of samples is divided using 5 Crossvalidation (5CV) into a training set and a validation set. SVMs are trained using the training set for all values of . The decision is obtained by tuning the parameters , and for so that the loss function computed on the validation set is minimum. Optimal parameters are then used to build the decision rule using the whole samples. The blinded test set is classified according to this rule. The overall prediction error is the sum of the patients misclassified on all runs.
Table 3 reports errors of the proposed algorithm, the average value and the median value of the classifiers prediction errors reported in [1] when informative genes are used. Table 4 reports values when informative genes are used. , , , , , and represent the six test statistics.
Experimental results show that, for ovarian, NCI, lung cancer and lymphoma multiclass genes problems, the proposed approach achieves competitive performances compared to the classifiers reported in [1]. For these datasets, prediction errors of the proposed approach are less than the mean and median values of the classifiers prediction errors reported in [1]. However, for leukemia72, the proposed algorithm performances are almost in the same range of those provided by the classifiers reported in [1]. The proposed approach prediction error is equal, or in the worst case, slightly higher than the mean and median errors.
Moreover, we can note that focussing on the test statistics comparison, experimental results confirm those of [1]. , and can be the most performing tests under variances heterogeneity assumptions.
3.2. ClassSelective Rejection Framework
In the following, we present the study of lung cancer problem in the classselective rejection scheme. Lung cancer diagnosis problem is determined by the gene expression profiles of lung tumors and normal lung specimens from patients whose clinical course was followed for up to years. The tumors comprised Adenocarcinomas (ACs), squamous cell carcinomas (SCCs); cell lung cancers (LCLCs) and small cell lung cancers (SCLCs). ACs are subdivided into three subgroups AC of group tumors, AC of group tumors and AC of group tumors. Thus, the multiclass diagnosis cancer consists of classes.
Authors in [28] observed that AC of group tumors shared strong expression of genes with LCLC and SCC tumors. Thus, poorly differentiated AC is difficult to distinguish from LCLC or SCC. Confusion matrices (Tables 5 and 6) computed in the Bayesian framework, with and prove well these claims. It can be noticed that of the misclassified patients and of the misclassified patients correspond to confusion between these three subcategories. Therefore, one may define a new decision option as a subset of these three classes to reduce error.
Moreover, same researches affirm that distinction between patients with nonsmall cell lung tumors (SCC, AC and LCLC) and those with small cell tumors or SCLC is extremely important, since they are treated very differently. Thus, a confusion or wrong decision among patients of nonsmall cell lung tumors should cost less than a confusion between nonsmall and small lung cells tumors. Besides, one may provide an extra decision option that includes all the subcategories of tumors to avoid this kind of confusion. Finally, another natural decision option can be the set of all classes, which means that the classifier has totally withhold taking a solution.
Given all these information, the loss function can be empirically defined according to the asymmetric cost matrix given in Table 7. Solving lung cancer problem in this scheme leads to the confusion matrix presented in Table 8. As a comparison with Table 5, one may mainly note that the number of misclassified patients decreases from to and withhold decisions or rejected patients. This partial rejection contributes to avoid confusion between nonsmall and small lung cells tumors and reduces errors due to indistinctness among LCLC, SCC and AC3. Besides, according to the example under study, no patient is totally rejected. It is an expected result since initially (Table 5) there was no confusion between normal and tumor samples.
To take a decision concerning the rejected patients, we may refer to clinical analysis. It is worth to note that for partially rejected patients, clinical analysis will be less expensive in terms of time and money than those on completely blinded patients. Moreover, a supervised solution can be also proposed. It aims to use genes selected from another test statistic in order to assign rejected patients to one of the possible classes. According to Tables 3 and 4, prediction errors computed on same patients using genes selected by different test statistics may decrease since errors of two different test statistics do not occur on the same patients. Thus, we chose lung cancer dataset to reclassify the rejected patients of Table 8. Five of them were correctly classified while three remained misclassified. Results are reported in Table 9. The number of misclassified patients decreases to which is less than all the prediction errors obtained with informative genes (lung cancer problem prediction errors of Table 3). In fact, many factors play an important role in the cascade classifiers system such as the asymmetric costs matrix which has been chosen empirically, the choice of test statistics, the number of classifiers in a cascade system, Such concerns are under study.
4. Conclusion
Cancer diagnosis using genes involve a gene selection task and a supervised classification procedure. This paper tackles the classification step. It considers the problem of genebased multiclass cancer diagnosis in the general framework of classselective rejection. It gives a general formulation of the problem and proposes a possible solution based on SVM coupled with its regularization path. The proposed classifier minimizes any asymmetric loss function. Experimental results show that, in the particular case where decisions are given by the possible classes and the loss function is set equal to the error rate, the proposed algorithm, compared with the state of art multiclass algorithms, can be considered as a competitive one. In the classselective rejection, the proposed classifier ensures higher reliability and reduces time and expense costs by introducing partial and total rejection. Furthermore, results prove that a cascade of classifiers with classselective rejections can be considered as a good way to get improved supervised diagnosis. To get the most reliable diagnosis, the confusion matrix defining the loss function should be carefully chosen. Finding the optimal loss function according to performance constraints is an promising approach [30] which is actually under investigation.
5. Acknowledgments
The authors thank Dr. Dechang Chen of the Uniformed Services University of the Health Sciences for providing the gene selected data of leukemia72, ovarian, NCI, lung cancer and lymphoma using six test statistics ANOVA F, BrownForsythe test, Welch test, Adjusted Welch test, Cochran test and KruskalWallis test. This work was supported by the “Conseil Régional Champagne Ardenne” and the “Fonds Social Européen”.
