Abstract

With the recent development of biotechnologies, cDNA microarray chips are increasingly applied in cancer research. Microarray experiments can lead to a more thorough grasp of the molecular variations among tumors because they can allow the monitoring of expression levels in cells for thousands of genes simultaneously. Accordingly, how to successfully discriminate the classes of tumors using gene expression data is an urgent research issue and plays an important role in carcinogenesis. To refine the large dimension of the genes data and effectively classify tumor classes, this study proposes several hybrid discrimination procedures that combine the statistical-based techniques and computational intelligence approaches to discriminate the tumor classes. A real microarray data set was used to demonstrate the performance of the proposed approaches. In addition, the results of cross-validation experiments reveal that the proposed two-stage hybrid models are more efficient in discriminating the acute leukemia classes than the established single stage models.

1. Introduction

The recent development of cDNA microarray technologies has made it possible to analyze thousands of genes simultaneously and has led to the prospect of providing an accurate and efficient means for classifying and diagnosing human cancers [120]. Advances in microarray discrimination method promise to greatly advance cancer diagnosis, especially in situations where tumors are clinically atypical. The main challenge of microarray analysis, however, is the overwhelming number of genes compared to the smaller number of available tumor samples, that is, a very large number of variables relative to the number of observations [10, 2123]. As a consequence, the issue of developing an accurate discrimination method for tumor classification using gene expression data has received considerable attention recently.

Many approaches have been proposed for tumor classification using microarray data [10, 2233]. The existing methods can be divided into two types, the statistical-based methods [10, 22, 2426] and computational intelligence methods [22, 2733]. Due to the fact that the dimension of the genes data is very large, but there are only a few observations available, it is a must to reduce and refine the whole data set before we perform the classification tasks. While most related works have focused on the use of a single technique for tumor classification, little research has been done on the integrated use of several techniques simultaneously to classify tumor classes. To achieve the high accuracy for a particular classification problem with smaller computational time, hybrid evolutionary computation algorithms are commonly used for optimizing the resolution process [3436]. As a consequence, in this study, we aim to develop several effective two-stage hybrid discrimination approaches that integrate the framework of statistical methods and the computational intelligence methods for tumors classification based on gene expression data.

The remainder of this paper is structured as follows. The second section reviews several existing approaches considered in our comparison study. The third section addresses the proposed hybrid approaches for tumors classification. The fourth section shows classification results from the cross-validation. The final section reports the research findings and presents a conclusion to complete this study.

2. Review of Established Methods

Consider a two-class classification problem. Let be the gene expression profile vector, where is the expression level of the th gene in the th tumor sample, , . Let be a binary disease status variable (1 for case group and −1 for control group as a general example). Accordingly, the microarray data may be summarized as the following set:The following sections briefly review several well-known established microarray classification methods.

2.1. Fisher’s Linear Discriminant Analysis

With the use of gene expression data, several studies proposed to apply Fisher’s linear discriminant analysis (FLDA) to classify and diagnose cancer [10, 22, 24]. Assume that independent observation vectors and are obtained from the two known groups and , respectively. Let where

To classify new observation , we can utilize the following FLDA allocation rule:

2.2. Logistic Regression

The microarray discrimination approach with the use of logistic regression (LR) model was also studied for disease classification [22, 25, 26]. The structure of the logistic regression model can be briefly described as follows. Let be the conditional probability of event under a given series of independent variables . The logistic regression model then is defined as follows:

Collinearity diagnosis procedure should be conducted first to exclude variables exhibiting high collinearity. After collinearity diagnosis, the remaining variables are then used for logistic regression modeling and testing. Afterward, using logistic regression with Wald-forward method, we can identify significant independent variables, say, , and obtain a significance model

2.3. Artificial Neural Network

Based on gene expression profiles, the artificial neural network (ANN) has also been used to discriminate the tumor classes [22, 2729]. The ANN framework includes the input, the output, and the hidden layers. The nodes in the input layer receive input signals from an external source and the nodes in the output layer provide the target output signals. For each neuron in the hidden layer and neuron in the output layer, the net inputs are given bywhere    is a neuron in the previous layer,    is the connection weight from neuron    to neuron   , and    is the output of node   . The sigmoid functions are given bywhere    is the input signal from the external source to the node    in the input layer and    is a bias. The conventional technique used to derive the connection weights of the feedforward network is the generalized delta rule [37].

2.4. Support Vector Machine

To classify tumor classes using microarray data, the discrimination method with the use of support vector machine (SVM) has also been discussed [22, 3033]. The structure of SVM algorithm can be described as follows. Let , , , be the training set with input vectors and labels, where is the number of sample observations and is the dimension of each observation, and is known target. The algorithm is to seek the hyperplane , where is the vector of hyperplane and is a bias term, to separate the data from two classes with maximal margin width . In order to obtain the optimal hyperplane, the SVM was used to solve the following optimization problem:

Because it is difficult to solve (10), SVM transforms the optimization problem to be dual problem by Lagrange method. The value of in the Lagrange method must be nonnegative real coefficients. Equation (10) is transformed into the following constrained form [38]:In (11), is the penalty factor and determines the degree of penalty assigned to an error. Typically, it could not find the linear separate hyperplane for all application data. For problems that can not be linearly separated in the input space, SVM employs the kernel method to transform the original input space into a high dimensional feature space, where an optimal linear separating hyperplane can be found. The common kernel functions are linear, polynomial, radial basis function (RBF), and sigmoid. Although several choices for the kernel function are available, the most widely used kernel function is the RBF which is defined as [39]where denotes the width of the RBF. Consequently, the RBF is used in this study and the multiclass SVM method is used in this study [40].

2.5. Multivariate Adaptive Regression Splines

The multivariate adaptive regression splines (MARS) have also been applied for tumor classification using gene expression data [22, 30]. The general MARS function can be represented as follows:where and are the parameters, is the number of basis functions (BF), is the number of knots, takes on values of either 1 or −1 and indicates the right or left sense of the associated step function, is the label of the independent variable, and is the knot location. The optimal MARS model is chosen in a two-step procedure. Firstly, construct a large number of basis functions to fit the data initially. Secondly, basis functions are deleted in order of least contribution using the generalized cross-validation (GCV) criterion. To measure the importance of a variable, we can observe the decrease in the calculated GCV values when a variable is removed from the model. The GCV is defined as follows:where is the observations and is the cost penalty measures of a model containing basis function.

3. The Proposed Hybrid Discrimination Methods

The two-stage hybrid procedure is commonly used in various fields such as financial distress warning system [41, 42], medical area [43], statistical inference [44, 45], and statistical process control [36, 4648]. To obtain the best accuracy for a specific classification problem, hybrid evolutionary computation algorithms are commonly used to optimize the resolution process [3436]. In this section, several two-stage hybrid discrimination methods that integrate the framework of statistical-based approaches and computational intelligence methods are proposed for tumor classification based on gene expression microarray data.

The proposed methods include five components: the FLDA, the LR model, the MARS model, the ANN, and the SVM classifiers. The proposed hybrid discrimination methods combine the statistical-based discrimination methods and computational intelligence methods. In stage 1, influencing variables are selected using LR or MARS. In stage 2, the selected important influencing variables are then taken as the input variables of FLDA, LR, ANN, SVM, or MARS. The following sections address the proposed approaches.

3.1. Two-Stage Hybrid Method of LR and Various Classifiers

Stage 1. Substitute independent variables and dependent variable into logistic regression. Apply logistic regression with Wald-forward method to identify significant independent variables, say, .

Stage 2. Substitute the significant independent variables obtained in Stage 1 and dependent variable into various classifiers such as FLDA, ANN, SVM, or MARS. The obtained corresponding hybrid methods are referred to as the LR-FLDA, LR-ANN, LR-SVM, and LR-MARS, respectively.

3.2. Two-Stage Hybrid Method of MARS and Various Classifiers

Stage 1. Substitute independent variables and dependent variable into multivariate adaptive regression splines. Use multivariate adaptive regression splines to identify significant independent variables, say, .

Stage 2. Substitute the significant independent variables obtained in Stage 1 and dependent variable into various classifiers such as FLDA, LR, ANN, or SVM. The corresponding hybrid methods are referred to as the MARS-FLDA, MARS-LR, MARS-ANN, and MARS-SVM, respectively.

4. The Cross-Validation Experiments

This study performs a series of cross-validation experiments to compare the proposed approaches with those previously discussed in literature. This study considers a leukemia dataset that was first described by Golub et al. [5] and was examined in Dudoit et al. [10] and Lee et al. [22]. This dataset contains 6817 human genes and was obtained from Affymetrix high-density oligonucleotide microarrays. The data consist of 25 cases of acute myeloid leukemia (AML) and 47 cases of acute lymphoblastic leukemia (ALL).

Since the dimension of the data is very large () but there are only a few observations (), it is essential to reduce and refine the whole set of genes (independent variables) before we can construct the discrimination model. To refine the set of genes, Golub et al. [5], Dudoit et al. [10], and Lee et al. [22] proposed the methods of subjective ratios to select genes. It is well known that the two-sample -test is the most popular test to test for the differences between two groups in means. For the sake of strictness, instead of using a somewhat arbitrary criterion like that used in Golub et al. [5], Dudoit et al. [10], or Lee et al. [22], this study applies the two-sample -test with a significance level of 0.0001 to select the influencing genes. The results are given in Table 1.

The significant variables selected using two-sample -test are then served as the input variables of the established single stage discrimination methods reviewed in Section 2 and the proposed two-stage hybrid methods introduced in Section 3. To examine the presence of collinearity, the variance inflation factor (VIF) was calculated. As shown in Table 2, all the values of VIFs are less than 10. Consequently, there is no high collinearity among these variables. In addition, this study adopts the suggestions of Dudoit et al. [10] and Lee et al. [22] and performs a 2 : 1 cross-validation (training set : test set).

The difficulty with ANN is that the design parameters, such as the number of hidden layers and the number of neurons in each layer, have to be set before training process can proceed. User has to select the ANN structure and set the values of certain parameters for the ANN modeling process. However, there is no general and explicit approach to select optimal parameters for the ANN models [49]. Accordingly, the selection of design parameters for ANN may be based on the trial and error procedure.

This study employs the highest accurate classification rate (ACR) as the criterion for selecting the ANN topology. The topology is defined as , where it stands for the number of neurons in the input layer, number of neurons in the hidden layer, number of neurons in the output layer, and learning rate, respectively. Actually, too few hidden nodes would limit the network generation capability, while too many hidden nodes may result in overtraining or memorization by the network. Since there are 11 input nodes and one output node used in this study, the numbers of hidden nodes to test were selected as 9, 10, 11, 12, and 13. The learning rates are chosen as 0.1, 0.01, or 0.001, respectively. After performing the ANN modeling, this study found that the topology has the best ACR results.

This study also performed the SVM modeling to the microarray dataset. The two parameters, and , are the most important factors to affect the performance of SVM. The grid search method uses exponentially growing sequences of and to determine good parameters. The parameter set of and which generates the highest ACR is considered to be ideal set. Here, the best two parameter values for and are 2 and 0.5, respectively. The SVM package was performed in running the dataset, and the corresponding output is displayed in Algorithm 1. Observing Algorithm 1, in the case of and , we can have ACR = 100% for the initial training stage. Consequently, in the testing stage, we are able to obtain ACR = 25% and ACR = 93.75% for AML and ALL, respectively, by using the same parameter settings (i.e., and ). Accordingly, the ACR = 70.83% for the case of full sample.

> #Find the best parameter gamma&cost
> p<-seq(-1,1,1)
> obj<-tune.svm(y~., data=train, sampling="cross", gamma=2(p), cost=2(p))
> obj
Parameter tuning of ‘svm’:
- sampling method: 10-fold cross validation
- best parameters:
gamma cost
 0.5   2
> #Building the SVM model
> svm.model<-svm(y~., data=train, type="C-classification", gamma=obj$best.parameters1, cost=obj$best.parameters2)
> #Classification capability: Train
> svm.pred<-predict(svm.model, train)
> tab<-table(predict=svm.pred, true=train,1)
> tab
   true
predict 0  1
    0 17 0
    1  0 31
> cat(Accurate Classification Rate = ,100sum(dig(tab))/sum(tab), % n)
Accurate Classification Rate = 100 %
> #Classification capability: Test
> svm.pred<-predict(svm.model, test)
> tab<-table(predict=svm.pred, true=test,1)
> tab
   true
predict 0 1
    0 2 1
    1  6 15
> cat(Accurate Classification Rate = ,100sum(dig(tab))/sum(tab), % n)
Accurate Classification Rate = 70.83333 %

For MARS modeling, the results are displayed in Table 3. During the selection process, four important explanatory variables were chosen. The corresponding relative importance indicators are showed in Table 3. As a consequence, those four important variables would be served as the input variables for hybrid modeling process. In addition, the results of ACR for each modeling are listed in Table 4.

The rationale behind the proposed hybrid discrimination method is to obtain the fewer but more informative variables by performing the first stage LR or MARS modeling. The selected significant variables are then served as the inputs for the second stage of discrimination approach. In this study, the significant variables selected by performing LR and MARS modeling are variables , , , and and variables , , , and , respectively. For the hybrid LR-ANN model, the topology provided the best ACR results. For the MARS-ANN hybrid model, the topology also gave the best ACR results. Additionally, for both LR-SVM and MARS-SVM modeling, the best two parameter values for and are the same and they are 2 and 0.5, respectively.

For each of the thirteen different approaches, FLDA, LR, ANN, SVM, MARS, LR-FLDA, LR-ANN, LR-SVM, LR-MARS, MARS-FLDA, MARS-LR, MARS-ANN, and MARS-SVM, this study presents the corresponding ACRs in Table 4. By comparing the ACR with AML, while the LR has highest ACR (i.e., 62.50%) among the 5 single stage methods, both LR-SVM and MARS-LR have the highest ACR (i.e., 75.00%) among the 8 two-stage methods. Apparently, the two-stage methods provide a better classification performance. By comparing the ACR with ALL, the single stage methods of FLDA, ANN, and SVM give the highest ACR (i.e., 93.75%), and the two-stage methods of LR-ANN, LR-MARS, and MARS-ANN have the same ACR (i.e., 93.75%). It seems that the single stage and two-stage methods achieve a similar performance. As shown in Table 4, it can be seen that, among the thirteen methods mentioned above, the two-stage hybrid model of LR-MARS has the highest ACRs (i.e., 83.33%) for the full sample. As a consequence, the proposed two-stage hybrid approaches are more efficient for tumor classification than the established single stage methods.

In addition, Table 5 lists the overall averaged ACRs and the associated standard errors (in parentheses) for single stage and two-stage methods. In comparison to the single stage and the proposed two-stage methods in Table 5, one is able to observe that our proposed methods almost provide more accurate results than the single stage methods. Although the single stage methods have larger averaged ACR value than two-stage methods in classifying ALL, the difference is not too significant. In addition, observing Table 5 it can be found that the proposed two-stage approaches have the smaller standard errors for all the cases, which imply the robustness of the mechanisms. Figure 1 provides a comparison with respect to the overall improvement percentage in the single stage method. From Figure 1, it can be seen that the two-stage approaches are more robust than the single stage method.

5. Conclusions

This study proposes several two-stage hybrid discrimination approaches for tumor classification using microarray data. The proposed approaches integrate the framework of several frequently used statistical-based discrimination methods and computational intelligence classifying techniques. Based on the results of cross-validation in Table 4, it can be easily observed that the proposed hybrid method LR-MARS is more appropriate for discriminating the tumor classes.

Computational intelligence methodology is very useful in many aspects of application and can deal with complex and computationally intensive problems. With the use of several computational intelligence techniques, this study develops two-stage hybrid discrimination approach for tumor classification. The proposed hybrid model is not the only discrimination method that can be employed. Based on our work further research can be expanded. For example, one can combine other computational intelligence techniques, such as rough set theory [50] or extreme learning machine, with neural networks or support vector machine to refine the structure further and improve the classification accuracy. Extensions of the proposed two-stage hybrid discrimination method to other statistical techniques or to multistage discrimination procedures are also possible. Such works deserve further research and are our future concern.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work is partially supported by the Ministry of Science and Technology of China, Grant no. MOST 103-2118-M-030-001 and Grant no. MOST 103-2221-E-030-021.