Abstract

Selection of reliable cancer biomarkers is crucial for gene expression profile-based precise diagnosis of cancer type and successful treatment. However, current studies are confronted with overfitting and dimensionality curse in tumor classification and false positives in the identification of cancer biomarkers. Here, we developed a novel gene-ranking method based on neighborhood rough set reduction for molecular cancer classification based on gene expression profile. Comparison with other methods such as PAM, ClaNC, Kruskal-Wallis rank sum test, and Relief-F, our method shows that only few top-ranked genes could achieve higher tumor classification accuracy. Moreover, although the selected genes are not typical of known oncogenes, they are found to play a crucial role in the occurrence of tumor through searching the scientific literature and analyzing protein interaction partners, which may be used as candidate cancer biomarkers.

1. Introduction

DNA microarray technology, a powerful tool in functional genome studies, has yet to be widely accepted for extracting disease-relevant genes, diagnosis, and classification of human tumor [13]. Generally, genes are ranked according to their differential expression by analysis of combination of normal and tumor samples, and genes above a predefined threshold are considered as candidate genes for the cancer being studied [4]. However, this method may produce a vast number of false, positives. In addition to the false-positive problem, the imbalance between the number of samples and genes may potentially degrade the classification accuracy and it can lead to possible overfitting and dimensional curse or even to be a complete failure in the analysis of microarray data [2]. An efficient way to solve these problems is gene selection. In fact, a good gene-selection method that can identify key tumor-related genes is of vital importance for tumor classification and identification of diagnostic and prognostic signatures for predicting therapeutic responses [5, 6].

Identifying minimum gene subsets means discarding most noise and redundancy in dataset to the utmost extent, resulting in not only classification accuracy improvement but also tumor diagnosis cost decrease in clinical application, which is still a key challenge in gene expression profile- (GEP-) based tumor classification. Rough set theory has been successfully used in feature selection [7, 8]. However, it is difficult to directly and effectively deal with real-valued attributes of microarray dataset [9]. Dataset discretization is usually adopted to tackle the problem, but the pretreatment may lose some useful information. To combat this problem, Hu et al. [10] first presented the basic concepts on neighborhood rough set (NRS) model and designed a novel feature selection method called forward attribute reduction based on neighborhood model (FARNeM) to select a minimal reduct, which avoided the preprocess of data discretization and hence decreased the information lost in pretreatment. But the reduct which satisfies criterions of higher classification performance and fewer gene numbers is not unique and full of chance. Obviously, it is not appropriate to use only a gene subset (a reduct) to train classifier, which necessitates it to select numerous minimal gene subsets with the highest or near highest dependence on training set to avoid the selection bias problem. Breadth-First Search (BFS) [11], a basic graph search algorithm that begins at the root node and explores all the neighboring nodes, were adopted to implement our goals for selecting any number of optimal and minimum gene subsets. However, for nodes, there are combinations of gene subsets in total. It is not practical to search all of the gene subsets in combinations. The computational complexity is too high. To circumvent these problems, we proposed a breadth-first heuristic search algorithm based on neighborhood rough set (HBFSNRS) to select numerous gene subsets. The dependence function of NRS was selected as the heuristic information.

To prioritize the numerous selected genes, a parameter was introduced. Previous studies showed that significant class predictor genes whose expression profile vector show remarkable discrimination capability among different class samples of specific cancer maybe play a crucial role in the development of cancer [4]. We hypothesized that the occurrence probability of genes in the final selected gene subsets may reflect the power of tumor classification and the significance of them to some extent. To probe our hypothesis, several publicly available microarray datasets were applied. HBFSNRS method was also compared with four related methods: PAM, ClaNC, Kruskal-Wallis rank sum test (KWRST), and Relief-F to demonstrate its good performance, efficiency, and effectiveness in gene selection, prioritization and cancer classification.

2. Materials and Methods

2.1. The Framework of Our Analysis Method

Our proposed method is different from the traditional gene selection strategies: Filters and Wrappers. The Filter methods are based mostly on selecting genes using between-class separability criterion [12], and they do not use feedback information from predictor performance in the process of gene selection, such as relative entropy, information gain, KWRST, and t-test. The wrapper methods select genes by using a predictor performance as a criterion of gene subset selection such as GA/SVM [13] and GA/KNN [14]. Our method is a combination of Filter and Wrapper methods. A novel HBFSNRS-based cancer classification framework is illustrated as Figure 1. Four major steps of the designed method are described as follows.

2.2. Gene Pre-Selection Based on KWRST

All of the microarray datasets, without respect to training and test dataset, were normalized per gene by subtracting the minimum expression measurements and dividing by the difference between the maximal and minimum values of that gene. The expression levels for each gene were scaled on .

Gene preselection can improve the classification performances since it may reduce the noise, which is also the common procedure for most classification application [15]. We applied gene preselection on training dataset to reduce the noise. All of the genes on the arrays of training data were sorted according to KWRST which is suitable for multiclass problem. In this study, the top ranking genes (the initial informative gene set ) were used for finding minimum gene subsets for constructing ensemble tumor classifier with HBFSNRS. Generally speaking, more than 1% of genes in the human genome are involved in oncogenesis [16], so we set the number of the selected top-ranked gene .

2.3. Neighborhood Rough Set Reduction

The basic concepts of neighborhood rough set (NRS) have been introduced by Hu et al. [10]. In our proposed algorithm, the dependence function of NRS was introduced to evaluate the goodness of selected gene subsets. Here, we presented only the basic notation from NRS approach used in the paper.

Assume there are subclasses of cancers, let denotes the class labels of samples, where indicates the sample being cancer , where . Let be a set of samples and be a set of genes, the corresponding gene expression matrix can be represented as , where is the expression level of gene in sample , , and usually .

Given an information system for classification learning

where is a nonempty sample set called sample space, is a nonempty set of genes also called condition attributes to characterize the samples, is a set of output variable called decision attribute (class labels of tumor samples), is a value domain of attribute , is an information function , , a reduction is a minimal set of attributes .

Given for all and , the neighborhood of in the subspace is defined as where is the threshold and is the metric function in subspace . There are three common metric functions that are widely used. Let and be two samples in n-dimensional space . denotes the value of in the sample . Then Minkowsky distance is defined as where () if , it is called Manhattan distance ; () if , it is called Euclidean distance ; () if , it is called Chebychev distance. Here, we use the Manhattan distance.

Given a neighborhood decision table , are the sample subsets with decisions 1 to , is the neighborhood information granules including , and is generated by gene subset , then the lower and upper approximations of the decision with respect to gene subset are, respectively, defined as where is the lower approximations of the sample subset with respect to gene subset , and is also called positive region denoted by which is the sample set that can be classified into one of the classes without uncertainty with the gene subset . denotes the upper approximations, obviously . The decision boundary region of to is defined as

The neighborhood model divides the samples into two groups: positive region and boundary region. The decision boundary is the sample set with neighborhoods from more than one class. Through these neighborhood information, we cannot completely be sure that these samples can be classified into the class. The samples in different gene subset subspaces will have different boundary regions and positive regions. The size of the boundary region reflects the discriminability of the classification problem in the corresponding subspaces. It also reflects the recognition power or characterizing power of the condition attributes. The greater the positive region is, the smaller the boundary region will be, and the stronger the characterizing power of the condition attributes will be. So we use the dependency degree of to to characterize the power of the selected gene subsets, which is defined as the ratio of consistent objects where and denotes the cardinal number of sample set and , respectively. If we say that depends totally on , and if , we say that depends partially. Here we define , and our goal is to find the gene subset which is equal to the set value.

2.4. Gene Reduction Based on HBFSNRS

Informative gene selection involves evaluating the quality of the selected gene subsets and searching for good gene subsets quickly. Here, the dependence function of NRS is used to measure the goodness of the selected gene subset. Here, the computational cost problem is addressed as below.

Initially, let be a set of gene subsets where each subset only has an informative gene. Then, for , is expanded to subsets by adding a different genes into each , where we set , we will get subsets in total. Among these subsets, we select the top-ranked gene subsets by the dependence function that need to be expanded in the next iteration to reconstruct the set , and now each element of has 2 genes. Similarly, in the next search layer, for , is extended to subsets excluding the genes have listed in the , where we set , and we will get subsets. Among these subsets, top-ranked gene subsets were selected to be expanded in next layer as the above method. Now, the element of has 3 genes. The search process continues following the above method until meeting the stop criteria. In each layer, we expend to subsets and only top-ranked gene subsets were selected to reconstruct the set from the total subsets, so that the search time will not increase exponentially with the increase of search depth. Here, denotes the cardinal gene number of the gene subset. In the virtue of the minimum construction idea, one of the techniques for the best feature selection could be based on choosing minimal gene subsets that fully describe classes of tumor classification in a given data set. Therefore, when the maximal dependence of the elements of (e.g., ) is obtained, the increment between the maximal dependence of two adjacent search levels is less than (e.g., ) or the number of iterative steps is equal to the set value (e.g., ), the searching process ends at that level. Otherwise, we continue to search genes in this way until meeting the stopping criterions. The pseudocode of HBFSNRS is shown in Algorithm 1.

Input   𝑆 , 𝐺 , 𝐷 , 𝛿 , 𝜃 , 𝑝 , 𝜔 , 𝑟 _ M a x , and D e p t h // 𝛿 is the threshold to control the size of the neighborhood, 𝜃 is the
threshold of increment, 𝑝 is the number of the preselected genes, 𝜔 is the search breadth, 𝑟 _ M a x is a given maximal
dependency function value and D e p t h is the upper bound of searching depth.
Output   R E D is the pool to contain the selected gene subsets r e d .
Step 1 : For each 𝑔 𝑖 𝐺 //Compute p-value by KWRST
      𝑃 𝑖 = K W R S T ( 𝑔 𝑖 ) ;
       End
Step 2 : 𝑔 𝑔 = s o r t ( 𝑃 , ‘‘ascend’’); //Rank genes by 𝑃 in ascending order
Step 3 : 𝐺 = S e l e c t ( 𝐺 , 𝑔 𝑔 , 𝑝 ) ; //Select he 𝑝 top-ranked genes as the initial gene set 𝐺 by 𝑃
Step 4 : For each 𝑔 𝑖 G //Let R E D = { { 𝑔 1 } , { 𝑔 2 } , . . . { 𝑔 𝑝 } } be a set of gene subsets where each
      𝑔 𝑖 r e d 𝑖 ; //gene subset only has an informative gene.
      r e d 𝑖 R E D ;
   End
Step 5 : 𝑖 𝑡 𝑒 𝑟 = 1 ; //The times of iteration.
Step 6 : For each r e d 𝑗 R E D
    For each 𝑔 𝑘 𝐺 r e d 𝑗
      r e d 𝑗 𝑔 𝑘 R E D ; //Adding genes not listed in r e d 𝑗 to r e d 𝑗 and save it as elements of R E D
      𝛾 r e d 𝑗 𝑔 𝑘 ( 𝐷 ) = C a r d ( P o s r e d 𝑗 𝑘 𝑔 ( 𝐷 ) ) / C a r d ( 𝑆 ) ; //Compute dependence degree of 𝐷 to r e d 𝑗 𝑔 𝑘 .
    End
       End
Step 7 : 𝑟 𝑟 = s o r t ( 𝑟 , ‘‘descending’’); //Rank gene subsets by 𝑟 in descending order
       R E D = S e l e c t ( R E D , 𝑟 𝑟 , 𝑤 ) ; //Select 𝜔 top-ranked gene subsets to reconstruct R E D .
Step 8 : If ( m a x i t e r ( 𝛾 ) > = 𝑟 _ M a x ) or a b s ( m a x i t e r ( 𝑟 ) m a x i t e r 1 ( 𝑟 ) ) < 𝜃 or ( i t e r = D e p t h )
     Break;  //here, we define m a x 0 ( 𝑟 ) = 0
     Else
     i t e r = i t e r + 1 ;
    Go to step 6;
       End

The dependence function of NRS is chosen as the objective function for evaluating the goodness of the selected gene subset mainly because it is computationally fast in that it does not use the feedback information of test data in the training process. To optimize the parameter in NRS that control the size of the neighborhood, different values for from 0 to 1 with step 0.01 were tested by running forward attribute reduction based on neighborhood model (FARNeM). values were sorted according to the classification accuracy by 3-KNN classifier using the corresponding gene subset selected by FARNeM. The 5 top-ranked values were used in the next step. But for ALL (a multiclass dataset), the gene number of the selected minimal and optimal reduct set reach 20 or even more for some of the top five values. Considering that a large gene subset with an excessive number of genes may contain much noise and redundancy, which may bias and negatively influence the tumor classification and gene prioritization, we discarded such top-ranked values and reselected five top-ranked values that produced reduct set with less than 20 genes.

2.5. Evaluation Criterion for the Selected Gene Subsets

We adopted 3-KNN classifier to evaluate the classification performance of the selected gene subsets. To improve prediction accuracy and stability, an ensemble classifier was constructed on the basis of the selected gene subset. For each , a simple majority voting strategy was applied to integrate the individual classifier that is constructed from the selected gene subsets obtained by HBFSNRS only on training set. Then, another ensemble classifier was built based on the above classification results with each value in the similar way.

Here, we hypothesized that genes with higher occurrence frequency are more likely to be important and cancer-related genes. Therefore, we count the occurrence frequency of each gene in all the selected gene subsets to measure its significance. But for a specific cancer, different value may select different sizes of the minimum gene subset. In this case, only counting the occurrence frequency is not appropriate for measuring the significance of genes. To avoid the selection bias, the significance of genes is measured by occurrence of probability, which is defined as where is the occurrence frequency of gene in all the gene subsets which are selected by HBFSNRS with ; is the total number of neighborhood values (we set ); is the number of genes in a selected gene subset with ; is the number of the final selected gene subsets by HBFSNRS (we set ).

In order to further investigate the significance of the selected gene, two main methods were used: () the selected genes were regarded as predictor set or classification model; () literature search and protein-protein interaction (PPI) network analysis.

2.6. Dataset

To evaluate the performance of the proposed method, seven gene expression datasets were used in this study: Acute Lymphoblastic Leukemia (ALL) [17], Breast cancer 30 (GSE5764) [18], Breast cancer 22(GSE8977) [18], Colon cancer [19], Prostate cancer 102 [20], and Prostate cancer 34 [21]. The two pairs of cross-platform datasets were used to evaluate the generalization performance for our cross-platform classification model. Datasets of Breast cancer, Colon cancer, and Prostate cancer are two-class classification systems that contain normal and tumor samples. ALL dataset is a multiple-class classification system. The dataset contains six subtypes of ALL: BCR-ABL, E2A-PBX1, 50, MLL, T-ALL, TEL-AML1. For Breast-cancer datasets, there are too many (54675) affymetrix probe identifiers, therefore the raw data were processed following these steps: affymetrix probe identifier was converted to entrez identifier. When multiple probes corresponded to the same entrez ID, we averaged over these probe intensities. The division of training set and test set is shown in Table 1.

3. Results

3.1. Redundant and Irrelevant Genes Potentially Degrade the Classification Accuracy

To avoid overfitting problem and improve classification accuracy and stability, an ensemble classifier was constructed on the basis of the selected gene subsets. We observed that the final integrated results (Table 2) were not satisfactory and no higher classification accuracy obtained compared to some individual classifiers. The main reason may be that our methods used all the selected gene subsets as classification model, which contain many redundant and tumor-unrelated genes and may potentially degrade the classification performance. Figure 2 shows the classification accuracy with different numbers of the top-ranked genes sorted according to the significance of genes defined as (6), from which we found that only a few top-ranked genes were enough to obtain higher classification accuracy. Meanwhile, when more genes were used as predictor set, there was only a little increase or even decrease in the classification performance. Therefore, we inferred that too many selected genes involve much more redundancy and irrelevancy, which degrades the classification accuracy.

3.2. Comparison with Other Related Methods

In order to elaborate the effectiveness of HBFSNRS, we compared the accuracy of our approach with other common filter methods including t-test, information gain, KWRST, and Relief-F. The experimental results indicate that our method is significantly superior to t-test and information gain, and slightly outperforms KWRST and Relief-F in the aspect of tumor classification. For simplicity, we only present KWRST and Relief-F results here (Figure 2). We found that only a few top-ranked genes could achieve higher accuracy in the classification of tumor samples of different classes by our proposed search algorithm. For ALL dataset, the prediction accuracy by HBFSNRS is superior to other methods regardless of the much fewer genes used in cancer classification. For breast-cancer dataset, using one active gene could test outcome with the accuracy of 22.73% by Relief-F, 63.64% by KWRST, whereas 100% test accuracy was obtained using one gene by the proposed HBFSNRS method. For colon-cancer dataset, using one, six active genes could get the prediction accuracy of 80% and 85% by our method, 65%, 70% by Relief-F, and 65%, 75% by KWRST, respectively. For prostate-cancer dataset, when using more than ten genes for tumor classification, KWRST significantly outperformed our method and Relief-F, but our method performs as well as the KWRST when only using the few top-ranked genes (both of our method and KWRST could get 97.06% accuracy using one gene). What is more, we compared our method with other statistical methods PAM and ClaNC. PAM, a statistical technique for class prediction from gene expression data that uses nearest shrunken centroids, was used to identify class predictor genes [22]. ClaNC ranks genes by standard t-statistics, which does not shrink centroids and uses a class-specific gene selection procedure [23]. In our context, ClaNC slightly outperformed PAM, so we only present the comparison with ClaNC here (Table 3). In comparison with ClaNC, our method could obtain higher classification accuracy when using a few top-ranked genes. The one-gene model by our method provides the classification accuracy of 100%, 80%, and 97.06% for Breast-cancer, Colon-cancer, and Prostate-cancer dataset, respectively, whereas ClaNC requires more genes to get the same accuracy. In ALL dataset, the test accuracies on independent test dataset are 87% with six genes, 94% with 12 genes, and 97% with 18 genes by our method. Using the same six, 12, 18 active genes could test outcome with the accuracy of 86%, 95%, and 97% by ClaNC, respectively, which indicates our method was comparable for ALL dataset. As a comparison, the minimum genes with the highest accuracy can be obtained in the classification process by HBFSNRS. In addition, results show that our method is obviously better than ClaNC in colon-cancer and breast-cancer cross-platform datasets. It is likely that ClaNC is not suitable for cross-platform datasets. We proposed that these few genes whose expression profile vector showed remarkable discrimination capability may closely correlated to cancer and could be seen as possible disease signatures.

3.3. Analysis of the Top-Ranked Genes (Case Studies)

Mining genes that give rise to ontogenesis is one of key challenges in the area of cancer research. Biologically the experimental results proved that the selected genes with high classification accuracy are functionally related to carcinogenesis or tumor histogenesis, so we could infer that the few top-ranked genes may be very important for tumor diagnosis. The 10 top-ranked genes according to the score for each tumor that were regarded as the candidate cancer genes listed in Table 4. To demonstrate our method’s ability in uncovering known cancer genes and predicting novel cancer biomarkers, the breast-cancer dataset was employed to this study as the method of [24].

First, we checked whether our method can uncover known famous cancer genes. We downloaded a list of 25 breast cancer biomarkers that have been annotated in the OMIM database [25]. Unfortunately, our used dataset (the 300 top-ranked genes selected by KWRST) does not include the 25 known breast cancer genes. Therefore our method cannot be evaluated with it in terms of uncovering known cancer genes. From another point of view, it is verified that higher differential expression of a gene does not necessarily reflect a greater likelihood of the gene being related to cancer. In other words, important genes might not be necessarily differentially expressed. But it is undeniable that higher differential expressions of genes are inevitably important in the cancer diagnosis and development.

Next, literature search method was used to check whether our method can predict novel cancer biomarkers. In the top 10 genes ranked by (6) for breast cancer, we found that these genes play an important role in the occurrence of breast cancer. The collagen triple helix repeat containing 1 (CTHRC1), ranked the first, whose aberrant expression is widely presented in human solid cancers including breast cancer and seems to be associated with cancer tissue invasion and metastasis [26]. The PDZ and LIM domain protein 4 (PDLIM4), ranked the second, was frequently methylated in breast cancers but not in normal breast tissues [27]. The keratin, type I cytoskeletal 17 (KRT17), ranked the third, was specifically overexpressed in basal-like subtypes of breast cancer [28]. The secreted frizzled-related protein 1 (SFRP1), ranked the fourth, was recently found to be associated with progression and poor prognosis in early stage of breast cancer [29]. The collagen alpha-1 (III) chain (COL3A1), ranked the fifth, was up-regulated in both invasive ductal and lobular carcinomas cells when compared with normal ductal and lobular cells [30]. The peptidase inhibitor 15 (PI15), ranked the sixth, was also differentially expressed but it was down regulated in lobular and ductal invasive breast carcinomas [30]. The actin gamma-enteric smooth muscle (ACTG2), ranked the seventh, is involved in the architecture and remodeling of cytoskeleton in basal medullary breast cancer [31]. The tissue factor pathway inhibitor 2 (TFPI2), ranked the eighth, whose aberrant hypermethylation with gene promoter was associated with metastasis in breast cancer [32]. The serpin B5 (SERPINB5), ranked the ninth, an epithelial-specific serine protease inhibitor, was a biomarker in disseminated breast-cancer cells [33].The fibronectin 1 (FN1), ranked the tenth, was recently suggested to be associated with the prognosis of patients with breast cancers [34].

Finally, we examined gene pathway that involved by the 10 top-ranked genes. The study is carried out using the software which can help the researchers to better understand the biological phenomenon understudied by pointing out significant cellular functions of the selected genes from the webpage “http://vortex.cs.wayne.edu/projects.htm” [35]. Results indicate that the pathways that the 10 top-ranked genes are involved in are ECM-receptor interaction (COL3A1, FN1), focal adhesion (COL3A1, FN1), vibrio cholerae infection (ACTG2), p53 signaling pathway (SERPINB5), Small cell lung cancer (FN1), wnt signaling pathway (SFRP1), regulation of actin cytoskeleton (FN1), pathways in cancer (FN1), which agree well with current knowledge on breast cancer [36]. Thus it can be seen that the selected genes that closely related to adhesion, motility, and metastasis may provide new insights in the underlying molecular mechanisms related to disease development, in designing therapy and in prognostication for patients with breast carcinoma. Thus, the analysis of existing biological experiment results of breast-cancer dataset well illustrates that our method has great power of identifying tumor-related genes.

Furthermore, another case study for prostate-cancer dataset was presented here. In the 10 top-ranked genes, six of them (HPN, MAF, GSTP1, WWC1, JUNB, and RND3) have been reported to be associated with prostate cancer. The hepsin (HPN), ranked the first, a cell surface serine protease that is markedly up-regulated in human prostate cancer, which is overexpression in prostate epithelium in vivo causes disorganization of the basement membrane and promotes primary prostate cancer progression and metastasis to liver, lung, and bone [37]. The transcription factor (MAF), ranked the second, was down-regulated in the tumors relative to normal prostate tissue and may be regarded as the candidate tumor suppressor gene [38]. The glutathione s-transferase P (GSTP1), ranked the fourth, whose CpG island hypermethylation is the most common somatic genome alteration described for human prostate cancer [39]. The gene WWC1, ranked the sixth, was found to interact with histone H3 via its glutamic acid-rich region and that such interaction might play a mechanistic role in conferring an optimal ER transactivation function as well as the proliferation of ligand-stimulated breast-cancer cells [40]. The transcription factor jun-B (JUNB), ranked the seventh, is an essential upstream regulator of p16 and contributes to maintain cell senescence that blocks malignant transformation of TAC. JUNB thus apparently plays an important role in controlling prostate carcinogenesis and may be a new target for cancer prevention and therapy [41]. The Rho-related GTP-binding protein RhoE (RND3), ranked the ninth, a recently described novel member of the Rho GTPases family, was regarded as a possible antagonist of the RhoA protein that stimulates cell cycle progression and is overexpressed in prostate cancer [42]. The remaining genes were not identified to correlate to prostate cancer previously. These genes need further analysis.

Genes related to a specific or similar disease phenotype tend to be located in a specific neighborhood in the protein-protein interaction network, and a protein is likely to be coexpressed with its interaction partners and those proteins that have similar function. Here, we applied a protein-network-based method to analyze the effect of neighborhood partners on the selected genes using all interactions in the Human Protein Reference Database [43]. Figure 3 indicates the protein-interaction network for each top-ranked gene of prostate cancer (KIAA0430 has no interaction partners in HPRD). The red-ellipse nodes represent the 10 top-ranked genes that were ranked by the score in (6), among which, those with an asteroid sign means known cancer genes. The diamond nodes indicate the direct interaction partners of the selected genes that were not cancer genes, and blue-octagon nodes show those partners that are identified as known cancer genes which were collected by querying the Memorial Sloan Kettering computational biology website, “Oncogene”, “tumor suppressor”, and “stability” are shown as [4, 44]. Among the 10 top-ranked genes for prostate-cancer dataset (Figure 3), 6 genes (ABL1, JUNB, MAP, P4HB, GSTP1, and RND3) that listed with an asteroid sign have been identified to be known cancer genes. Here, we mainly illustrate the three genes P4HB, PEX3, and ABL1 that we did not find reports on their association with prostate cancer. In the three genes, P4HB and ABL1 have been known as cancer genes. PEX3 is also a famous disease gene which was the cause of peroxisome biogenesis disorder, complementation group 12, and zellweger syndrome. It can be seen that mutation in these genes can lead to many diseases and may have a close relationship with prostate cancer. In this sense, our method is effective on cancer-related gene selection. Recently, Aragues et al. [4] suggest that cancer linker degree (CLD) of a protein which was defined as the number of cancer genes to which a gene is connected is a good indicator of the probability of being a cancer gene. We analyzed the cancer linker degree (CLD) of 10 top-ranked genes on each of the four datasets. For prostate cancer, as is shown in Figure , most of the top-ranked genes have a direct interaction with known cancer genes excluding the gene PEX3, and the CLD of ABL1, JUNB, WWC1, MAF, P4HB, GSTP1, HPN, and RND3 is 46, 13, 2, 6, 7, 1, 1, and 1, respectively. In the 10 top-ranked genes of ALL (TCFL5 and LRMP have no interaction partners in HPRD), SMARCA4, DNTT, and NONO are known cancer genes, and the CLD of SMARCA4, DNTT, NONO, CD72, MPP1, and CD99 is 19, 3, 6, 1, 2, and 2, respectively. For breast cancer, CTHRC1, PI15, and SERPINB5 have no interaction partners in HPRD. In the remaining 7 genes of 10 top-ranked genes, SFRP1 and TFPI2 are known cancer genes, and SFRP1, TFPI2, FN1, COL3A1, and KRT17 have a direct interaction with known cancer genes, the CLD of which is 2, 1, 17, 2, and 1 respectively. For colon cancer, FUCA1 has no interaction partners in HPRD. In the remaining 9 genes, MYH9 is a known cancer gene, the CLD of DES, MYH9, C3, and 2-Sep is 4, 3, 1, and 1, respectively. These results show that besides a few selected genes that typically correspond to known specific cancer mutations, a considerable portion of the top-ranked genes have many direct interactions with cancer genes, which suggests that these genes should be very likely to be involved in cancer and may play a central role in the protein network by interconnecting many known cancer genes, and thus the top ranked genes can be regarded as reliable disease biomarkers.

4. Discussions and Conclusions

4.1. Better Performance on Tumor Classification and Gene Selection and Prioritization

An ongoing challenge is to identify new prognostic markers that are directly related to disease and that can more accurately predict the likelihood of gaining cancer in unknown samples. Results indicate that our proposed method of gene selection by HBFSNRS has the following advantages in trying to tack this challenge. () Our method could obtain the highest or near highest prediction accuracy of tumor classification with the minimum gene subset. () Lists of ranked potential candidate cancer biomarkers with a specific cancer are presented by our approach. () Our proposed method can obtain many optimal gene subsets in a short period of time, which is essential to the whole search process. () Compared to other gene ranking methods KWRST and Relief-F, our method is relatively stable and contains little chance factors. The success of our methods, gene selection by HBFSNRS, can be attributed to a combination of several aspects. First, we adopted the dependence function of NRS to evaluate the goodness of selected gene subsets. There are two main advantages for this point: time saving and tumor classification without the feedback and leaked information of the test dataset. Second and more importantly, the designed process of gene search by our method can select any number of optimal gene subsets in a comparatively short time, which is an optimization of best-first search. Finally, considering the selection of value in the evaluation of gene subsets has the problem that the genes with different value will have different ranked positions or relevance to cancer. To avoid this problem of selection bias, we defined a score to describe the significance of genes by combining five groups of results that obtained by each value. We presented two case studies on breast cancer and prostate cancer to illustrate the power of our method to identify tumor-related genes. Our method illustrates well its high power of tumor classification and gene prioritization.

4.2. Limitation and Extension

One limitation of our approach is in data quality: current high-throughput technologies remain error prone and may be far from complete. In a recent paper, Zhang et al. [45] held that the integration of microarray data gives us more analytical power and reduces the false discovery rate. Given a specific cancer, efficient ways to integrate multiple independent microarray data may be a good way to solve the issue of data quality. The other limitation is the optimization of the threshold value of neighborhood rough set. On one hand, we tried the neighborhood rough set reduction method to evaluate the goodness of the selected gene subsets to save time in tumor classification without using the feedback information of the test dataset. On the other hand, the threshold selection is obtained through the feedback information of the test set. In addition, different values may select different gene subsets, hence the genes with different value will have different positions in gene prioritization, so the selection of has become more critical for gene prioritization. Fortunately, the choice of is not so important for gene ranking because the change of gene position in different values is not significant. In our study, Spearman’s rank correlation coefficient was used to determine whether there is a consistency between the results of gene prioritization with different values. Results indicate that there is high consistency among these results.

4.3. Future Work

Our proposed HBFSNRS method has improved the performance of tumor classification based on microarray and identified and prioritized lists of potential tumor-related genes from GEP, our future work will benefit further from integrating other sources. Recent high-throughput technologies have produced vast amounts of protein-protein interactions, which represent valuable resources for candidate-gene prioritization and give us new insights into the mechanism of disease. A great number of studies have shown that integration of multiple sources of data is more reliable for predicting cancer genes than the use of a single criterion [4, 4648]. Thus, it is an efficient method to integrate GEP and protein interaction network for gene prioritization. Although gene expression data and protein interaction data have been integrated for gene prioritization [49, 50], the results are not satisfactory. Therefore, it is still a challenging problem in the area of cancer research.

Acknowledgments

This work was supported by the grants of the National Science Foundation of China, Nos. 60905023, 30900321, 60973153, 60873012 & 60805021, the grant from the National Basic Research Program of China (973 Program), No. 2007CB311002, the grant of the Guide Project of Innovative Base of Chinese Academy of Sciences (CAS), No. KSCX1-YW-R-30, the Knowledge Innovation Program of the Chinese Academy of Sciences (0823A16121), and the China Postdoctoral Science Foundation (Grant no. 20090450825).