Classification of cancers based on gene expressions produces better accuracy when compared to that of the clinical markers. Feature selection improves the accuracy of these classification algorithms by reducing the chance of overfitting that happens due to large number of features. We develop a new feature selection method called Biological Pathway-based Feature Selection (BPFS) for microarray data. Unlike most of the existing methods, our method integrates signaling and gene regulatory pathways with gene expression data to minimize the chance of overfitting of the method and to improve the test accuracy. Thus, BPFS selects a biologically meaningful feature set that is minimally redundant. Our experiments on published breast cancer datasets demonstrate that all of the top 20 genes found by our method are associated with cancer. Furthermore, the classification accuracy of our signature is up to 18% better than that of vant Veers 70 gene signature, and it is up to 8% better accuracy than the best published feature selection method, I-RELIEF.

1. Introduction

An important challenge in cancer treatment is to classify a patient to an appropriate cancer class. This is because class specific treatment reduces toxicity and increases the efficacy of the therapy [1]. Traditional classification techniques are based on different kinds of clinical markers such as the morphological appearance of tumors, age of the patients, and the number of lymph nodes [2]. These techniques however have extremely low (9%) prediction accuracies [3].

Class prediction based on gene expression monitoring is a relatively recent technology with a promise of significantly better accuracy compared to the classical methods [1]. These algorithms often use microarray data [4] as input. Microarrays measure gene expression and are widely used due to their ability to capture the expression of thousands of genes in parallel. A typical microarray database contains gene expression profiles of a few hundred patients. For each patient (also called observation), the microarray records expressions of more than 20 000 genes. We define an entry of a microarray as a feature.

Classification methods often build a classification function from a training data. The class labels of all the samples in the training data are known in advance. Given new sample, the classification function assigns one of the possible classes to that sample. However, as the number of features is large and the number of observations is small, standard classification algorithms do not work well on microarray data. One potential solution to this problem is to select a small set of relevant features from all microarray features and use only them to classify the data.

The research on microarray feature selection can be divided into three main categories: filter, wrapper, and embedded [5]. We elaborate on these methods in Section 2. These methods often employ statistical scoring techniques to select a subset of features. Selection of a feature from a large number of potential candidates is however difficult as many candidate features have similar expressions. This potentially leads to inclusion of biologically redundant features. Furthermore, selection of redundant features may cause exclusion of biologically necessary features. Thus, the resultant set of features may have poor classification accuracy.

One way to select relevant features from microarray data is to exploit the interactions between these features, which is the problem considered in this paper. More specifically we consider the following problem.

Problem Statement
Let be the training microarray dataset where each sample belongs to one of the possible classes. Let be the gene regulatory and the signaling network. Choose features using and so that these features maximize the classification accuracy for an unobserved microarray sample that has the same distribution of values as those in .

Unlike most of the traditional feature selection methods, we integrate gene regulatory and signaling pathways with microarray data to select biologically relevant features. On the pathway, one gene can interact with another in various ways, such as by activating or inhibiting it. In Figure 1, RacGEF activates RAC, BAD inhibits Bcl-xl, and PKB/Akt inhibits BAD by phosphorylation. We use the term influence to imply this interaction between two genes. We quantify influence by considering the number of intermediate genes between two genes on the pathway that connects them. The influence is the highest when two genes are directly connected. Our hypothesis in this paper is that selecting two genes that highly influence each other often implies inclusion of biologically redundant genes. The rationale behind this is that manipulating one of these genes will have significant impact on the other one. Thus, selecting one of them produces comparable prediction accuracy. So we choose the set of features such that each of them has the lowest influence on other selected features.

We propose a novel algorithm called Biological Pathway-based Feature Selection algorithm (BPFS) based on the above hypothesis that has the following characteristics.

Let the complete set of features be and the set of already selected features be . BPFS ranks all the features in with an SVM-based scoring method MIFS [6]. The score quantifies the capacity of a feature to improve the already attained classification accuracy. BPFS ranks features in decreasing order of their scores.

BPFS chooses a small subset of highly ranked features from and evaluates the influence of every feature in on the features in . Finally, it selects the feature in that has the lowest influence on the features in and moves it to from . BPFS repeats this step for a fixed number of iterations.

We observe that a significant fraction of the gene entries in the microarray do not have any corresponding gene in the pathway. We use the term unresolved genes to represent these genes. We propose a probabilistic model to estimate the influence of those genes on selected features.

We tested the performance of our method on five breast cancer data sets [711] to predict whether breast cancer for those patients relapsed before five years or not. Our experiments show that our method achieves up to 18% and 8% better accuracy than the 70-gene prognostic signature [7] and I-RELIEF [2], respectively.

The organization of the rest of this paper is as follows. Section 2 discusses the background material. Section 3 describes the proposed algorithm. Section 4 presents experimental results. Section 5, briefly, concludes the paper.

2. Background

Feature selection is an important area in data mining for preprocessing of data. Feature selection techniques select a subset of features to reduce relatively redundant, noisy, and irrelevant part of the data. The reduced set of features increases the speed of the data mining algorithms and improves accuracy and understandability of result. Feature selection is often used in areas such as sequence, microarray, and mass-spectra analysis [5]. The popular feature selection methods can be broadly categorized into the following.

Filter Methods (see [1214])
These are widely studied methods that work independent of the classifier. They rank the features depending on the intrinsic properties of data. One such method is to select sets of features whose pairwise correlations are as low as possible.

Wrapper Methods (see [15, 16])
These methods embed the feature selection criteria into the searching of subset of features. They use a classification algorithm to select the feature set and evaluate its quality using the classifier.

Embedded Methods (see [17])
These approaches select features as a part of the classification algorithm. Similar to the wrapper methods, they interact with the classifier, but at a lower cost of computation.
All the above-mentioned traditional feature selection methods ignore the interactions of the genes. Considering each gene as an independent entity can lead to redundancy and low classification accuracy as many genes can have similar expression patterns.

Several recent works on microarray feature selection have leveraged metabolic and gene interaction pathways in their methods. Vert and Kanehisa [18] encoded the graph and the set of expression profiles into kernel functions and performed a generalized form of canonical correlation analysis in the corresponding reproducible Hilbert spaces. Rapaport et al. [19] proposed an approach based on spectral decomposition of gene expression profiles with respect to the eigenfunctions of the graph. Wei and Pan [20] proposed a spatially correlated mixture model, where they extracted gene specific prior probabilities from gene network. Wei and Li [21] developed a Markov random field-based method for identifying genes and subnetworks that are related to a disease. A drawback of the last two model-based approaches is that the number of parameters to be estimated is proportional to the number of genes. So optimizing the objective function is costly as the number of genes in microarrays is more than 20 000. C. Li and H. Li [22] introduced a network constraint regularization procedure for linear regression analysis, which formulates the Laplacian of the graph extracted from genetic pathway as a regularization constraint.

One limitation of all the above mentioned methods that use biological pathway is that, all of them consider genetic interactions between immediate neighbors on the pathway. None of them explicitly consider interactions that are beyond immediate neighbors. Also, most of them performed quantitative analysis of the selected features on simulated datasets. So it is not possible to quantify the accuracy of those selected features on some real datasets only from the results in these papers. Additional experiments on real microarray datasets are required to justify those methods and their set of features. Also, as the reconstruction of genetic pathway is yet to be completed, we cannot always map a microarray entry to the biological pathway. They do not consider the implications of those missing information. In this paper, we introduce a new microarray feature selection method that addresses these issues.

3. Algorithm

This section describes our Biological Pathway-based Feature Selection algorithm (BPFS) in detail. BPFS takes a labeled two-class microarray data as input and selects a set of features. Algorithm 1 portrays a synopsis of BPFS. We discuss an overview of BPFS next.

* G and S denote the set of all features and the set of
* selected features respectively. Set represents all
* the remaining features.
Select the first feature from that has highest mutual information.
Repeat till there is more features to select.
  (a) Calculate marginal classification power for all the features in .
  (b) Select top features with highest marginal classification power as candidate set .
  (c) Calculate Total Influence Factor (TIF) for all the features in .
  (d) Select the feature with lowest TIF and include it into .

We denote the set of all features by . Let be the set of features selected so far. The set represents all the remaining features. BPFS iteratively moves one feature in to using the following steps, till the required number of features is selected (along with their rank).

Step 1 . (determine the best candidates (see 2(a) to 2(b) in Algorithm 1)). This step creates a candidate set of features from by considering their classification accuracy alone. To do this, BPFS first sorts all the available features in decreasing order of their marginal classification power and chooses the top (typically = 10 in practice) of them as the candidate set for next step. We define the marginal classification power of a feature as its ability to improve the classification accuracy when we include it into . Let us denote the set that contains these top features by the variable .

Step 2 . (pick the best gene using pathways (see 2(c) to 2(d) in Algorithm 1)). In this step, we use signaling and regulatory pathways to distinguish among the feature set obtained in Step 1. Given a set of already selected features , BPFS aims to select the next most biologically relevant feature from . We define a metric to compare the features in for this purpose. This metric estimates the total influence between a candidate feature and the set of selected features. We denote this total influence as the Total Influence Factor (TIF). TIF is a measure of the potential interaction (activation, inhibition, etc.) between a candidate gene and all the selected genes. A high value of TIF for a gene implies that the gene is highly influenced by some or all of the already selected set of genes. We choose the gene in that has the lowest TIF. We then include it in . We elaborate this step in Section 3.3.
In the following subsections we discuss the above aspects of our algorithm in more detail. Section 3.1 defines how we select our first feature. Section 3.2 discusses the first round of selection procedures based on classification capability. Section 3.3 describes the use of pathways for feature selection. Section 3.4 presents a technique to utilize the training space efficiently in order to improve the quality of features.

3.1. Picking the First Feature: Where to Start?

BPFS incrementally selects one feature at a time based on the features that are already selected. The obvious question, then is, how do we select the first feature? There are many alternative ways to do this. One possibility is to get an initial feature using domain knowledge. This, however, is not feasible if no domain knowledge exists on the dataset.

We use mutual information to quantify the discriminating power of a feature. Let us represent the th feature of microarray using a random variable and the class label of the data using another random variable . Assume that there are observations in the data. and can assume different values over those observations. Let an instance of and be an instance of . The mutual information of and is where is the joint probability mass function of and ; and are the respective marginal probability mass function of and . Thus, we use to quantify the relevance of the th feature for classification. We choose the feature with maximum mutual information as the starting feature.

Another way to select the first feature can be to utilize the marginal classification power. Essentially, it is way we can apply the second step of our algorithm which and select the top candidate as the first candidate.

Next we discuss how we select the remaining features.

3.2. Selecting the Candidate Features

In this step, BPFS sorts all the available features in in decreasing order of their marginal classification power. We define the marginal classification power of a feature later in this section. BPFS then chooses the features with the highest marginal classification power as the candidate set that will be explored more carefully in the subsequent steps. We elaborate on this next.

We use an SVM based algorithm, MIFS [6], to calculate the marginal classification power of all available features as follows. BPFS, first, trains SVM using the features in to get the value of the objective function of SVM. We use linear kernel for the SVM in our experiments. For very high dimensional data a linear kernel performs better than or comparable to a non-linear kernel [23]. A linear kernel is a simple dot product of the two inputs. So the objective function of SVM becomes where , and denote the Lagrange multiplier, the class label and the value of selected set of features of th observation respectively. Here, are vectors and is the dot product of them. Then for each feature , BPFS calculates the objective function if is added to as

Here denotes the value of the selected set of features along with the aforementioned feature for the th observation. Using the last two equations, we calculate the marginal classification power of a feature , as the change in the objective function of SVM, when is included in . We denote this value with variable . To paraphrase, marginal classification power of a feature is the capability of a new feature to improve the classification accuracy of a set of selected features, when the new feature is added to the already selected set. Formally, we compute for all as . BPFS sorts all the features in descending order of . It considers the top ( = 10 in our experiments) genes as possible candidates for the next round. Let denote the set of these genes. In the next steps, BPFS examines biological networks to find out the most biologically meaningful feature in .

3.3. Selecting the Best Candidate Gene

All the features in the candidate set often have high marginal classification power. In this step, we distinguish the features in by considering their interactions with the features in (the set of features that are already selected). We hypothesize that if a feature in is influenced by the features in greatly, then that feature is redundant for even if it has high marginal classification power. We discuss how we measure the influence of a feature on another one next.

Consider the entire pathway as a graph, where all the genes are vertices and there is an edge between two vertices if they interact with each other. In this paper, we do not consider any specific pathway such as p53 signaling pathway, rather a consolidation of all the available human signalling and regulatory pathways. If we have had the knowledge about the pathways that are affected by that specific biological condition (such as cancer), we could select features only from those pathways. However, the available literature does not provide the comprehensive list of affected pathways most of the time. Thus, we create a consolidation of all the regulatory and signaling pathways.

There are different kinds of interactions such as activation and inhibition. If two genes do not have a common edge, but they are still connected by a path, it means that they interact indirectly through a chain of genes. For example, in Figure 1, RacGEF activates RAC and RAC activates NFkB. Thus, RacGEF activates NFkB through RAC. We, therefore, compute the distance between them as two. A higher number of edges on the path that connects two genes implies feebler influence.

An abnormally expressed gene does not necessarily imply that its neighbor will be abnormally expressed [24]. This is because the interaction between two genes is a probabilistic event. For example, in Figure 1, if RacGEF becomes aberrantly expressed, there is a probability that RAC is also expressed aberrantly. Let us denote this probability with the variable . Similarly, if RAC is abnormally expressed, NFkB is abnormally expressed with a probability . So, if RacGEF is over expressed, NFkB can be over expressed with a probability of . This leads to the conclusion that as the number of hops increases the influence decreases exponentially. Thus, we use an exponential function to model this influence.

To quantify influence we define a metric, termed Influence Factor (IF), between two genes and as , provided . Here is the length of the shortest path that connects and on the pathway. To calculate total influence on a candidate gene asserted by a selected set of genes we calculate IF between the candidate gene and every gene in the selected set and sum it up. We call it the total influence factor (TIF) of a candidate gene with respect to already set of selected genes. Formally,

is zero if there is no path between and .

For example, in Figure 1 consider PKB/Akt as a candidate gene and assume that consists of two genes NFkB and CASP9. CASP9 is one hop away from PKB/Akt. So IF(PKB/Akt, CASP9) = 1. The shortest path from PKB/Akt to NFkB is of two hops through IKK. So IF(PKB/Akt, NFkB) = 0.5. Thus, TIF(PKB/Akt, CASP9, NFkB) = 1.5.

BPFS calculates TIF for all the genes in and selects the one that generates the lowest value of TIF. A low TIF value implies lower aggregate influence on the set of selected features . To paraphrase, the gene with lowest TIF is least interacting with the genes in . So, we select the gene that is biologically most independent from .

The gene databases, like KEGG, are still evolving. Thus, many of the genes cannot be mapped from microarray data to these databases. In Section 4 we describe a probabilistic technique to handle this problem.

3.4. Exploration of Training Space

We have described the key components of our feature selection algorithm (BPFS) in the previous subsections. As a dataset consists of comparatively smaller number of observations and a large feature set, BPFS is prone to overfitting. To avoid this problem, we propose a method that utilizes the training space efficiently.

Let be the training data. We create data subsets ( in our experiments) ,,, each containing 80% of the randomly sampled from it. We, then, run BPFS on each of them and get sets of features. We store these feature sets in a matrix , where the th row contains first features obtained from . Thus, is the th feature obtained from . We use this matrix to rank all the features in the following fashion.

(1)We assign a linearly decreasing weight across a row to emphasize the importance of the features that come first. More specifically, we assign a weight of to a feature that appears in the th column of a row.(2)We sum the weights of the features over all the rows to determine the overall weight of the features in . For example, assume that a feature appears in three rows of , at (5,3), (17,14), and (29,10), where the first number in each pair indicates the row and second number indicates the column. Also, assume that we want to choose a total of 150 features. Then, the total weight of this feature is .

We pick the features with the highest weight from our feature set. Weighing the features based on their positions helps us to prioritize the features that occur frequently and/or appear with high rank. We discuss the impact of the value of in Section 4.

4. Data Set and Experiments

In this section we evaluate BPFS experimentally. We use multiple real microarray datasets instead of synthetically generated data, as synthetic data may not accurately model different aspects of a real microarray data [25]. We observe that we can map only a small portion (25%) of the microarray entries to KEGG regulatory pathway. Some of them do not take part in any single interaction. So the only information we have about them is their measured expression value on the microarray dataset. Due to this missing data problem it is difficult to quantify the implication of biological pathway in our algorithm. To handle this problem we have conducted our experiments on two different kind of information. In one case, we use the KEGG pathways as it is and used a randomized technique to handle the interactions with unresolved genes. In the other case, we map all the microarray genes to KEGG pathway and assume that genes within a single pathway are fully connected and there is no common gene between two pathways. Still, we need to be careful while interpreting the results with fully connected pathway as it is only a simplistic view of the actual pathway. We cover the experiments with real pathways in the paper from Sections 4.24.6. In Section 4.7 we discuss the experiments with fully connected pathways.

In Section 4.1 we describe the experimental setup. In Section 4.2 we describe the randomization technique. We show the biological validity of our feature set, by tabulating the supporting publications against every feature (Section 4.3). We compare our signature against van't Veer's [7] on four data sets (Section 4.4). We compare the testing accuracy of our method to that of I-RELIEF, a leading microarray feature selection method (Section 4.5). We conducted cross-validation experiments where we extracted features from one dataset and tested its accuracy on another dataset in Section 4.6. Finally, we executed BPFS and I-RELIEF on an idealistic fully connected pathway in Section 4.7.

4.1. Experimental Setup

Microarray Data
In our experiments we used five breast cancer microarray datasets from the literature. We name these datasets as BCR [10], JNCI [8], Lancet [9], CCR [12] and Nature [7], respectively, according to the name of the journals they were published. BCR, CCR, and Lancet use Affymetrix GeneChip Human Genome U133 Array Set HG-U133A consisting of 24, 481 entries. Nature has its own microarray platform with 24, 481 entries. JNCI has the same platform as that of Nature, but it consists of a much smaller feature set of 1, 145 entries. We removed the observations whose class labels were not defined. For the rest of the data points we created two classes depending on whether relapse of the disease happened in five years or not, counting from the time of the primary disease. The datasets Nature, JNCI, BCR, CCR, and Lancet contain 97, 291, 159, 190, and 276 observations, respectively.

Pathway Data
We used the gene regulatory and signaling pathways of Homo Sapience in KEGG. We merged all the relevant pathway files to build a consolidated view of the entire pathway. The final pathway consists of 8 270 genes and 7 628 interactions. Clearly, some genes do not take part in any interaction.

Training and Testing Data
We randomly divided a microarray dataset (e.g., BCR dataset) in 4  :  1 ratio to create training and testing subset. We maintained the distribution of two classes in the undivided dataset unchanged in the training and testing subset. We collected features from the training dataset and tested the classification accuracy using those features on the test dataset. Now we elaborate on how we utilized the training space to select features. We created a number of subsets (typically 50 in our experiments) using bootstrapping from training dataset. Each subset contains 80% samples of the training data. We selected features from each of those subspaces using Section 3.1 to Section 3.3. Then we combined the obtained set of features using the method of Section 3.4.

Implementation and System Details
We implemented our feature selection algorithm (BPFS) using Matlab. For SVM, we used “Matlab SVM Toolbox”, a fast SVM implementation in C based on sequential minimal optimization algorithm [26]. For pathway analysis code we used Java. We ran our implementation on a cluster of ten Intel Xeon 2.8 GHz nodes on Ubuntu Linux.

Availability of code
The implementation of the proposed method can be downloaded from http://bioinformatics.cise.ufl.edu/microp.html.

4.2. Pathways with Unresolved Genes

To calculate the influence factor (IF) we need to calculate the number of hops between two genes on the pathway. This requires a mapping of those microarray entries to pathway genes. However, as some of the microarray entries are not complete genes and biological pathway construction is not yet finished [59], we can not map all microarray entries. We denote all the unmapped genes as unresolved genes. For example, Affymetrix microarray HG-U133A contains 24,481 entries. We are able to map only 6 500 entries to KEGG (Kyoto Encyclopedia of Genes and Genomes). For example, in Figure 1 we draw four rectangles with dotted lines that correspond to four microarray entries in Affymetrix platform. As they do not have Entrez Gene identification number, we can not associate them with any pathway. Hence, these are unresolved genes.

Our preliminary experiments suggest that unresolved genes represent a large fraction of the genes in set (more than 60% on average). To estimate the TIF of the unresolved genes, we develop a probabilistic model. Let be the set of candidate genes and be the set of selected genes in an iteration of BPFS. While calculating TIF for a we consider two cases:

Case 1 : (the candidate gene is resolved). Assume that is resolved. Let be the set that contains all the unresolved features in . Let be the set of resolved features in . Let be the expected influence of a gene on gene if all genes were mapped and the pathway construction was complete. We discuss how we estimate the value of later in this section. Then the expected number of genes from having influence on is = , where is the number of genes in . So the Total Influence Factor becomes

Case 2 : (the candidate gene is unresolved). When g is unresolved we consider it as a special case of Case 1. As the connectivity between and all genes in is unknown, we estimate TIF as In summary, to handle the unresolved genes we augment the probabilistic model to biological pathway-based selection and replaced (3) by (4). Among the genes in candidate set we select the gene with smallest TIF using (4). Now, we describe how we derive the value of in (4) and (5).

To derive the value of , we propose the following approach. It is reasonable to assume that there are many missing interactions in the currently available pathway databases, since missing interactions are continually being discovered. Let us denote the present incomplete pathway graph by and the hypothetical complete pathway by . Assume that contains times more interactions than that of . From we estimate as a function of and the expected number of the genes in as follows. We, first, build a graph as described in Section 3.3 from the KEGG database. We, then, randomly delete edges of and create the subgraphs of corresponding to 10%, 20%, , 100% of edges of that of . For each of these sub-graphs we calculate the average number of vertices reachable from a vertex.

Formally, let denote the set of vertices in , where . We denote the number of reachable vertices from () by . Then, the average reachability of is

where is number of vertices in . We calculate , the average reachability for all the subgraphs.

We, then, construct the function that evaluates to . To construct , we use a converging power series. We derive the value of parameters using , , . To calculate the average reachability of the hypothetical pathway we use value of s greater than 100 in . In Figure 2 we plot , , along with the constructed function against the s, the fraction of the current pathway.

We observe that interpolates the values , , accurately and it converges at around with the value of 180. Thus, we can conclude that the average reachability of is around 180. The probability that an unresolved gene has an interaction with a randomly chosen gene in in is given by where is number of nodes in the pathway. As the total number of human genes is close to 20,500 [60], we get 0.0088 as the value of .

4.3. Biological Validation of Selected Features

We collected the list of publications that support the relevance of the first twenty features selected from BCR on cancer. Table 1 lists the publications and cancer types for each of the genes.

To get the features, we created the training dataset as described in Section 4.1. We, then, trained BPFS on the training data and obtained a ranked set of features. We repeated this process ten times on BCR. We selected first twenty features from each rankings and merged them using the method described in Section 3.4 to get the final twenty features. We found relevant publication for all the twenty genes. We observed that four of them are directly responsible for breast cancer. Another seven genes are associated to breast cancer from the point of histology (two prostrate, four gastric and one colorectal). The rest of them are related to other kinds of cancer such as ovarian cancer and lung cancer.

In some cases a gene is involved for more than one kind of cancer. For example, ASCL1 is associated with both prostrate cancer and lung cancer. We concluded that BPFS chooses the set of genes that are responsible for breast cancer and other kind of cancers. Hence, BPFS selects a biologically meaningful feature set that reduces the number of redundant features and improves generalization accuracy by selecting more relevant feature set.

In general, the above approach of combining features obtained from ten different runs may lead to selecting features from the entire dataset. We did it only for this experiment to filter out the biologically significant features from a dataset. For the remaining experiments we kept separate training and testing datasets.

4.4. Comparison with van't Veer's Gene Signature

In this section we compare our gene signatures to the breast cancer prognostic signature found by van't Veer et al. [7]. van't Veer et al. generated the 70 gene signature using a correlation based classifier on 98 primary breast cancer patients. van't Veer's 70 gene experiment [7] demonstrated that genetic signature can have a much higher accuracy in predicting disease relapse free survival against clinical markers (50% versus 10%).

In our experiment, we demonstrate that our method finds a better gene signature than these 70-gene signature for a particular dataset. We created training and testing data from the four datasets JNCI, Lancet, BCR and CCR as described in Section 4.1. From those four training dataset we created four set of features using our method. We calculated the accuracy of the four set of features using the corresponding four test datasets. Also, we computed the test accuracy using this 70-gene signature on the same four test dataset just mentioned. Finally we plotted the testing accuracies obtained from both set of features in Figure 3.

Figure 3 illustrates the results for up to 150 genes for both the signature on BCR, CCR and Lancet. We observe that for all four data sets our accuracy is better than that of 70-gene signature. BPFS attains 18% better accuracy for JNCI dataset. From this we conclude that BPFS finds a better gene signature for all the datasets.

4.5. Comparison with I-RELIEF

We compare the accuracy of our method to that of I-RELIEF [2]. I-RELIEF is a nonlinear feature selection algorithm. It produced significant accuracy over van't Veer's 70 gene prognostic signature and standard clinical markers [61].

We, first, created training and testing dataset from the given data as discussed in Section 4.1. We obtained the ordered feature set by training the BPFS on the training data. We tested the quality of those features using the SVM classifier. We used identical set up and data sets for I-RELIEF. We used 2.0 as the kernel width for I-RELIEF as recommended [2]. We repeated the experiments ten times on each data set and present the average accuracy for different features. Figure 4 plots the standard deviation of the accuracy of our method over this 10 times of running. For all of them except the Nature dataset, the standard deviation is less than 1 (i.e., 10% accuracy). So we can conclude that our method is quite stable while we execute it over several subsets of data created from a single dataset.

Figure 5 compares our algorithm to I-RELIEF. We observe that for two data sets (BCR and JNCI), BPFS outperforms I-RELIEF for all the features selected. BPFS shows highest improvement (8%) over I-RELIEF for JNCI dataset, at around 50 features. JNCI has higher fraction of resolved genes (45% versus 25%). Thus, BPFS has a higher chance to select more resolved genes. This implies less dependability on the probabilistic model, the selection of genes is more accurate. From this observation, we expect that BPFS would produce better result when the missing links of the pathway would be discovered. For Nature dataset, BPFS produces better accuracy than I-RELIEF up to 130 features. BPFS has similar accuracy with that of I-RELIEF for CCR and Lancet data.

Our algorithm is based on linear kernel which is in general more appropriate when the number of features is much higher than number of samples. On the other hand I-RELIEF employs a non-linear kernel [23]. It is possible that the distribution of Lancet data works better with the type of kernel that I-RELIEF uses. We can potentially improve the classification accuracy of BPFS by using a non-linear kernel.

We observe that for all the datasets our method reaches its highest accuracy at around 50–70 features. We conclude that these 50–70 features consist the most important set of genes that are associated with the breast cancer.

4.6. Cross Validation Experiments

We conduct several cross validation experiments where we generate a set of features on one data set and validate its quality by testing it on some other data set. For this cross validation, we limit ourselves to the same microarray platform that we use to generate the feature set. For example, we test BCR dataset's features on Lancet as the microarray platforms on which they were generated are same. The main reason for doing so is that the set of genes used in two different microarrays can be different. Even for the same gene they use different part of the genomic sequence. Thus, inter-platform validation may not be representative of the actual generalization. Table 2 displays the result of our cross validation experiments.

We use two different version of our algorithm, one with the pathway information and another without the pathway information on observe the contribution of inclusion of the pathway information into our algorithm. In Table 2 we denote the version of our method with pathway by putting 1 at superscript of the result. Similarly we denote the version without pathway by putting 2 at the superscript of the datasets.

Also, to establish the relevance of our signature on a standard benchmark, we compare our signatures with van't Veer's 70-gene signature in the context of cross validation. For example, in Table 2 BCR dataset is cross validated with three feature sets obtained from Lancet, CCR and 70-gene signature. We observe that on an average our signatures perform better than the 70-gene one. For CCR data, both Lancet and BCR features generate better accuracy. For Lancet data the accuracies obtained using CCR and BCR features are similar to that of 70-gene signature. For BCR data, up to 80 features of our signatures outperform the 70-gene signature. Beyond that the 70-gene signature has a better accuracy. To sum up, we get better accuracy while testing with the features on a different platform compared to van't Veer's prognostic signature.

Regarding the comparison of the two versions of our algorithm (with and without the pathway information) it's difficult to reach a conclusive decision. For example, while we extract the feature set from CCR dataset and cross validate it BCR dataset the algorithm without pathway information is doing better than the other for upto 20 features, but for the larger number of features the algorithm with pathway information provides a better accuracy.

When we compare the accuracy with features from a different dataset to that of its own feature set, we observe that on the average, the drop of the accuracies are not more than 6%. For some extreme cases the drop can be higher. Specifically when we cross validate with features extracted from Lancet, the accuracy is lower.

4.7. Experiments on Fully Connected Pathways

In this section we describe the experiments with idealistic fully connected pathway. Here, the approach we took was to evaluate our experiments on an idealistic pathway, where we map all the genes including the unresolved genes into the KEGG pathway. The unmapped genes become singleton pathway with only a single member gene. For other pathways that are already listed in the KEGG database we assume that all the genes within a pathway are fully connected and there is no connection between two pathway. If we consider each pathway as the smallest indivisible module of interactions, then all the genes within a pathway coexpress in a similar fashion. We compare the accuracy of BPFS with this pathway with that of I-RELIEF. In Figure 6 we see that for BCR, JNCI and CCR environment, BPFS has a better accuracy. For BCR there is an improvement of 5% around when BPFS uses around 40 features. For JNCI, the improvement is over 10% with around 70 features. For CCR there is a 3% improvement for 30 features. For Nature dataset, I-RELIEF has slightly (around 1.5%) better accuracy up to 100 features. For Lancet dataset I-RELIEF performs better than BPFS. So, we observe that for three dataset BPFS has better accuracy, for one dataset the accuracy is almost comparable.

4.8. Contribution of Pathway Information

In this section, we describe the experiments that we conduct to understand the contribution of pathway information in our algorithm in terms of classification accuracy. In one version we use the complete version of the algorithm as it is, in the other version we select the features only based on the marginal classification power and skip the next step.

However, we observe that the contribution of biological network is not very decisive. For some dataset and some set of features the complete version of the method generates better accuracy, sometimes the trimmed version produces more accurate result. For instance, in Figure 7, for BCR dataset the complete method has upto 6% higher accuracy, where for Nature dataset the method with pathway generates better result for 50–100 number of features. The reason behind this fluctuation of accuracy might be that reconstruction of gene regulatory and signaling network is still in progress. Among almost 22 000 Affymetrix transcripts, we could map only 3300 genes that take part into at least one KEGG pathway. Even for those genes, the pathway construction is not complete. We hope that our algorithm can generate better accuracy in the near future when we can have a more comprehensive pathway structure.

5. Conclusions

In this paper we considered feature selection problem for a classifier on cancer microarray data. Instead of using the expression level of a gene as the sole feature selection criteria we also considered its relation with other genes on the biological pathway. Our objectives were to develop an algorithm for finding a set of biologically relevant features and to reduce the number of redundant genes. The key contributions of the paper are the following.

(i)We proposed a new feature selection method that leverages biological pathway information along with classification capabilities to reduce the redundancy in gene selection based on biological pathway. (ii)We proposed a probabilistic solution to handle the problem of unresolved genes that are currently not mappable from microarray to biological pathway. (iii)We presented a new framework of utilizing the training subspace that improve the quality of feature set.

Our algorithm improve quality of features by a biological way by excluding the features that have total influence factor, and includes genes that are apart in the biological network and still have high marginal classification power. Thus, we believe that instead of selecting a close set of genes as features our method identify biologically important key features for a significant number of pathways. We demonstrated the biological significance of our feature set by tabulating the relevant publications. We also established the quality of our feature set by cross validating them on other data sets and comparing them against van't Veer's 70-gene prognostic signature. Our experiments showed that it is better than best published available method I-RELIEF.