Abstract

Conventionally, pathway-based analysis assumes that genes in a pathway equally contribute to a biological function, thus assigning uniform weight to genes. However, this assumption has been proved incorrect, and applying uniform weight in the pathway analysis may not be an appropriate approach for the tasks like molecular classification of diseases, as genes in a functional group may have different predicting power. Hence, we propose to use different weights to genes in pathway-based analysis and devise four weighting schemes. We applied them in two existing pathway analysis methods using both real and simulated gene expression data for pathways. Among all schemes, random weighting scheme, which generates random weights and selects optimal weights minimizing an objective function, performs best in terms of 𝑃 value or error rate reduction. Weighting changes pathway scoring and brings up some new significant pathways, leading to the detection of disease-related genes that are missed under uniform weight.

1. Introduction

With the advent of microarray technology in the field of biomedical research [17], numerous statistical methods [8, 9] were proposed to analyze microarray gene expression data. But most are single gene based and do not consider the interacting relationship or dependencies among genes in a functional group. In single gene-based analysis, most subtly but coordinated differentially expressed genes are often not identified as significant and usually dropped by a strict cutoff threshold feature selection [10, 11]. In contrast, pathway-based analysis considers a set of biologically related genes and helps detect subtle changes in gene expression with the help of a joint effort by genes [3, 4, 12, 13]. Many researchers discussed the advantages of pathway-based analysis. Subramanian, for instance, considered an enrichment-based approach using various Kolmogorove-Smirnov statistics [3]; Curtis gave a good review of computational approaches proposed for pathway-based analysis [4]; Goeman et al. proposed the global test based on a generalized linear model [12]; Pang et al. described the random forest-based pathway analysis [13]; Harris et al. considered gene grouping based on gene ontology [14]; Misman et al. provide good reviews on those in [15].

A biological pathway is a series of actions among molecules in a cell that leads to a certain product or a change in a cell. Such a pathway can trigger the assembly of new molecules, such as fat or protein. Pathways can also turn genes on and off or spur a cell to move [16]. Biological pathways help researchers learn a lot about human disease, since identifying genes, proteins, and other molecules involved in a biological pathway can provide clues about what goes wrong when a disease strikes. Researchers may compare certain biological pathways in a healthy person to the same pathways in a person with a disease to discover the roots of the disorder. Using pathways extensively allows a quick overview of expression results in relation to biological mechanisms, facilitating the understanding of gene, protein, and metabolite interactions at higher levels. Over the past decade, researchers have discovered many important biological pathways through laboratory studies of cultured cells and various organisms, and they are stored in public domain biological pathway databases [16]. Biological pathways have been also curated manually combining three content sources: public domain databases, literature, and experts [17].

Pathway analysis aims to define the meaning of biological processes by identifying significant pathways through statistical evaluations. Pathways are scored in statistical evaluations based on activity, coregulation, and cascade effects in pathways as measured by the gene expression levels from the microarray experimental data. This score will rank those pathways higher in which more genes are overexpressed or underexpressed with reference to reference state [18]. Ranking pathways relevant to a particular biological process or disease is useful, since it allows researchers to focus on a smaller number of pathways for further study of the biological process or disease of interest. Most pathway analysis tools and methods, however, are assuming that all genes in a pathway are equally contributing to a biological process, and thus assigning uniform weight. But this assumption has been proved incorrect [19] because some genes may have higher relevancies to a particular biological process, and those genes presumably have higher predicting or classifying power than the others. One issue in the pathway analysis is the quality of pathways since biological pathway databases are not comprehensive, and the biological pathway content varies greatly in quality and completeness among the tools and databases [17]. Pathway data taken from public databases and open literatures may include nonrelevant genes and/or exclude relevant genes [20]. For instance, in the case of the famous Mootha’s type II diabetes pathway dataset [21], genes such as CAP1, MAPP2K6, ARF6, and SGK contained in the pathway ID 36, c17 U133 probes, are known to be related to human insulin signaling [15], while other genes are not yet. Also, SHC contained in the pathway ID 229 is known to be related to human insulin signaling, while others are not yet.

To address the problem of pathway quality and incompleteness in the pathway analysis tools and approaches, some researchers tried to minimize the misspecifications by defining signature genes to represent pathway behaviors, and/or refining pathways to adapt to specific conditions by removing unaltered genes from the dataset [19, 2224]. Others tried to improve the functional interpretation of gene groups by including additional information associated with the group [24]. Joining such efforts, we propose to apply nonuniform weighting scheme, which applies different weights to the genes in a pathway-based on the relevancies of genes to a related biological process or disease. The intuitive ideas behind our proposed ideas are that not all genes grouped in a pathway are related to a particular biological process or disease with the same significance, and thus applying weight to the genes proportional to their relevancies to a certain biological process or disease may generate more accurate results for pathway based analysis such as the molecular classification of diseases.

To investigate the impact of using weighting schemes in pathway based analysis, we devise four different weighting schemes and incorporate them into the existing pathway analysis methods, such as the global test [12] and the random forests [13, 2527]. Our schemes essentially apply larger weights to more differentially expressed genes between different sample groups (i.e., normal vs. tumor samples), so those genes impact more on the final results of analysis. The four weighting schemes we introduce in this paper are as follows. The first weighting scheme is based on the absolute value of two sample 𝑡 statistics denoted as absT. The second one is based on the 𝑄 test statistic of the global test denoted as Qdiff. The third and the fourth ones are based on a computational approach, which assigns weights randomly to genes and selects optimal weights minimizing an objective function. The third scheme called RWV (random weight vector) is to assign 𝑚 weights for 𝑚 genes, in which all samples of a gene are assigned the same weight. The fourth one called RWM (random weight matrix) is to assign a matrix of weights for a pathway of 𝑚(𝑔𝑒𝑛𝑒𝑠)×𝑛(𝑠𝑎𝑚𝑝𝑙𝑒𝑠), in which samples of a gene are assigned different weights.

We performed our experiments using the type II diabetes dataset obtained from Mootha et al. [28] and the canine dataset from Enerson et al. [29]. We also used simulated datasets to gain an in-depth understanding of weighting effects in a controlled way. In our experiments, we apply each weighting scheme onto the datasets and select top 20 or 33 significant pathways. We evaluate the performance of our weighting schemes by comparing the 𝑃 values of the pathways selected using each scheme with those selected using uniform weighting. We observe that when our weights are applied, the scoring of the pathways is changed, and some pathways originally in lower ranks are elevated to higher ranks, hence, contributing to improved prediction rates. According to the previous studies [28, 3034], several significant pathways identified by our weighting schemes are biophysiologically associated with related diseases.

2. Materials and Methods

2.1. Global Test and Random Forest

We used the global test [12] and the random forests [13, 25, 26] methods to investigate the impact of weighting in the pathway-based analysis and to evaluate the performance of our weighing schemes. First, we review the two methods briefly to explain how we incorporate our proposed weighting schemes into these methods.

2.1.1. Overview of Global Test

The global test method is a pathway analysis method developed by Goeman et al. [12]. It tests whether subjects with similar gene expression profiles have similar class labels, based on a logistic regression. Suppose that gene expression data containing 𝑛 samples for 𝑝 genes is normalized. Of these 𝑝 genes, a subgroup of 𝑚(1𝑚𝑝) genes is to be tested. Let 𝑋=(𝑥𝑖𝑗) be an 𝑛×𝑚 data matrix containing 𝑚 genes for 𝑛 samples of interest, and 𝑌𝑖 as the clinical outcome of the 𝑖th sample (𝑛×1vector). To model how the clinical outcome 𝑌 depends on the gene expression data𝑋, the global test adopts the generalized linear model framework developed by McCullagh [35], expressed as follows:𝐸𝑌𝑖𝛽=1𝛼+𝑚𝑗=1𝑥𝑖𝑗𝛽𝑗,(1) where 𝛽𝑗 is the regression coefficient for gene 𝑗(𝑗=1,,𝑚), is a link function (e.g., the logit function), and 𝛼 is an intercept. Testing a predictive effect of the gene expressions on the clinical outcome is equivalent to testing the hypothesis 𝐻0𝛽1=𝛽1=𝛽2=𝛽𝑚=0. Assume that 𝛽1,,𝛽𝑚 are a sample from some common distribution with zero mean and variance 𝜏2, then a single unknown parameter 𝜏2 determines the allowed deviation of the regression coefficients from zero. Thus, the null hypothesis is 𝐻0𝜏2=0. The formula 𝑟𝑖=𝑗𝑥𝑖𝑗𝛽𝑗(𝑖=1,,𝑛) is the linear predictor, that is, the total effect of all covariates for the 𝑖th sample. As 𝑟=(𝑟1,,𝑟𝑛) is a random vector with 𝐸(𝑟)=0 and cov(𝑅)=𝜏2𝑋𝑋, the generalized linear model is simplified to 𝐸(𝑌𝑖𝛽)=1(𝛼+𝑟𝑖). A test statistic for testing 𝐻0 is defined as1𝑄=𝜇2𝑛𝑖=1𝑚𝑗=1𝑅𝑖𝑗𝑌𝑖𝑌𝜇𝑗𝜇,(2) where 𝑅=(1/𝑚)𝑋𝑋 is an 𝑛×𝑛 matrix proportional to the covariance matrix of the random effects 𝑟, 𝜇=1(𝛼) is the expectation of 𝑌 under 𝐻0, and (𝑌𝜇)(𝑌𝜇) is the covariance matrix of the clinical outcomes of the samples. The test statistic 𝑄 has a higher value if the terms of the two matrices are correlated more. Essentially, it tests whether samples with similar gene expressions also have similar outcomes. The empirical distribution of test statistic 𝑄 under the null hypothesis 𝐻0 is calculated across all samples by randomly taking a large number of permutations (such as 100,000) of the vector 𝑌from the outcomes. The empirical 𝑃 value is the frequency such that 𝑄 for the permuted 𝑌 is at least as large as the true 𝑄, divided by the number of permutations. For our microarray datasets cases, 𝑌 is 1 for a disease sample or 0 for a normal sample.

The reason we selected the global test pathway analysis method for our study of weighting effect in the pathway-based analysis is that the generation and the assignment of weights for the genes in a pathway is easy and straightforward in the global test. Multiplying a desired nonuniform weight matrix 𝑊=(𝑤𝑖𝑗) to a gene expression data matrix 𝑋=(𝑥𝑖𝑗) in the global test method does not incur any side effects.

2.2. Overview of Random Forests

The random forests are a tree-based method developed by Breiman et al. (1984, 2001) [2527], which can be used for classifications or regressions [13]. The method grows multiple classification or regression trees using a deterministic algorithm, in which each tree is constructed using a different bootstrap sample from the original data. It leaves about one-third of the cases out of the bootstrap (out-of-bag) samples for testing purpose. The out-of-bag (OOB) samples are not used in constructing the 𝑘th tree but saved to be used as a test set. At the end of the run, it takes the 𝑖th sample to be the class which receives most of the votes every time case 𝑛 is the out of bag. The proportion of times that 𝑖 is not equal to the true class of 𝑛 averaged over all cases is called the estimated out-of-bag (OOB) error (http://stat-www.berkeley.edu/). Pang et al. [13] are the first group who proposed to apply the random forests approach to pathway analysis, and we adopted their approach to study the weighting effect in the pathway analysis. Our objective is to find the optimal weight 𝑊 that minimizes the OOB error rate using the objective function we modify in the following:𝑊=argmin𝑤[]𝐹(𝑤𝑋),(3) where 𝐹(𝑋) is the original cost function of the random forests that computes the OOB error rate of a group of data 𝑋, and 𝑤 is a weight matrix for the group of data 𝑋. Our objective is to find the weight matrix 𝑤 which minimizes the estimated OOB classification error for each pathway.

2.3. Proposed Weighting Schemes

We considered four nonuniform weight schemes, which intend to generate the weight for each gene in a pathway, based on its degree of differential expression between the different phenotypes. In this section, we describe the rationale behind each weighting scheme, generation, and assignment of nonuniform weights for genes in a pathway.

2.3.1. absT Based on Two-Sample 𝑡-Test Statistic |𝑇|

The two-sample 𝑡-test statistic is widely used to determine if the means of two populations are equal [35]. To measure how differentially a gene is expressed between two different groups (i.e., normal versus disease), we calculate the two-sample 𝑡-test statistic of the gene and take the absolute value of it and denote it as |𝑇|. The absT scheme determines the weight of each gene in a pathway using the |𝑇| value of each gene divided by the sum of all |𝑇| values of all genes in the pathway. Mathematically, the weight for the 𝑗th gene 𝑊|𝑇|(𝑗) is expressed in the following formula: 𝑊|𝑇|(𝑗)=||𝑇(𝑗)||𝑚𝑗=1||𝑇(𝑗)||.(4) With this scheme, the most differentially expressed gene will have the largest |𝑇| value and get the largest weight. The rationale is based on the hypothesis that more differentially expressed genes have higher relevancy to the disease or the phenotype of interest.

2.3.2. Qdiff Based on the Test Statistic 𝑄 of the Global Test

The test statistic 𝑄 of the global test is a test to find whether samples with similar gene expressions also have similar outcomes. If the covariance structure of the gene expressions between two sample groups resembles the covariance structure of their outcomes, the 𝑄 statistics is large. The proposed Qdiff weighting scheme uses the 𝑄 statistic of a pathway to construct the weights for genes in the pathway. The idea is based on our hypothesis that if excluding one gene from a pathway results in a large difference in the original test statistic 𝑄, the excluded gene may have a strong relevancy to the related disease or phenotype. To determine the weight for the 𝑗th gene in a pathway containing 𝑚 genes, the scheme uses the following formula: 𝑊|𝑄𝑑𝑖𝑓𝑓|(𝑗)=||𝑄𝑄(𝑗)||𝑚𝑗=1||𝑄𝑄(𝑗)||.(5) Here, 𝑄 is the test statistic of the pathway including all 𝑚 genes, and 𝑄(𝑗) is the test statistic of the same pathway but excluding the 𝑗th gene. The weight of the 𝑗th gene is determined by the difference of these two test statistics 𝑄 and 𝑄(𝑗), divided by the sum of all such differences calculated for all 𝑚 genes in the pathway.

2.3.3. RWV Based on Random Weight Vectors Generated by a Computational Approach

The computational RWV (random weight vector) scheme assigns 𝑚 random weights to 𝑚 genes in a pathway and identifies the optimal 𝑚 weights vector minimizing the 𝑃 value of the pathway. It uses the following pseudocode algorithm to obtain the optimal 𝑚 weights vector for each pathway.

Step 1. Run the global test on the original gene expression of a pathway and obtain the 𝑃-value for the pathway. Initialize this 𝑃-value as minP and the uniform weight vector as optW.

Step 2. for 𝑖=1: COUNT.

Substep 1. Generate a set of 𝑚 random values in the pre-defined range (i.e., 0.1range1.0).

Substep 2. Pick 𝑚 values randomly from the set of 𝑚 random values constructed in Substep 1, allowing replacements.

Substep 3. Multiply each gene expression 𝑋𝑗=[𝑋𝑗,1,𝑋𝑗,2,,𝑋𝑗,𝑛]with the corresponding weight 𝑤𝑗(1𝑗𝑚). This process constructs a weighted gene expression matrix 𝑤𝑋 for the pathway 𝑤𝑤𝑋=1𝑋1𝑤2𝑋2𝑤𝑚𝑋𝑚.(6)

Substep 4. Run the global test on the weighted gene expression matrix𝑤𝑋 of the pathway and obtain 𝑃-value.

Substep 5. If the 𝑃 value of the weighted gene expression matrix 𝑤𝑋obtained in Substep 4 is smaller than the current min𝑃, update the min𝑃 with this 𝑃 value and update the optimal weight vector optW with the new 𝑤=[𝑤1,𝑤2,,𝑤𝑚] constructed in Substep 2.

End (for loop)
Of course, the larger number of iteration increases the quality of the solution, but at the cost of higher computation time. We should also note that this weighting scheme assigns the weight to each gene across all samples as absT and Qdiff schemes do.

2.3.4. RWM Based on Random Weight Matrices Generated by a Computational Approach

In contrast to the three schemes assigning the same weight across all samples for a gene, RWM (random weight matrix) scheme assigns different weights to all samples for a gene. Essentially, RWM scheme uses the same algorithm of RWV scheme except that it generates 𝑛×𝑚 random values instead of 𝑚 random values, for the 𝑛 samples in the pathway of 𝑚 genes. The 𝑛×𝑚 random values in the predefined range are multiplied to the 𝑛×𝑚 gene expression data. Among all sets of random weights it applied, the scheme selects an optimal set of weights that minimizes the 𝑃-value in the global test, or the OOB error rate in the random forests for the pathway. The weighted gene expression matrix 𝑤𝑋 of a pathway is expressed in the following matrix:𝑤𝑤𝑋=1,1𝑋1,1𝑤1,2𝑋1,2...𝑤1,𝑚𝑋1,𝑚𝑤2,1𝑋2,1......𝑤2,𝑚𝑋2,𝑚𝑤............𝑛1,1𝑋𝑛1,1......𝑤𝑛1,𝑚𝑋𝑛1,𝑚𝑤𝑛,1𝑋𝑛,1𝑤𝑛,2𝑋𝑛,2...𝑤𝑛,𝑚𝑋𝑛,𝑚.(7) Obviously, RWM scheme can find a better solution in minimizing the 𝑃-value or the OOB error than RWV scheme, but it is computationally more complex.

2.4. Datasets

Real Datasets
The first real dataset we used for our study is the well-known type II diabetes microarray gene expression dataset obtained from Mootha et al. [28], consisting of 278 pathways for 13,842 genes, sampled from 26 people with type II diabetes and 17 without. The pathways were obtained from KEGG pathway database (http://www.genome.jp/kegg/pathway.html), and the curate pathways were constructed from known biological experiments performed by Mootha et al. Another real dataset we used is the canine dataset obtained from Enerson et al. [29], consisting of 441 pathways for 6,592 genes, sampled from 12 dogs with lesion and 17 without. The canine dataset was generated from the investigative toxicology studies designed to identify the molecular pathogenesis of a drug-induced vascular injury in coronary arteries of dogs, which were treated with adenosine receptor agonist CI-947. The canine genes were mapped to human orthologs, and the human orthologs for dogs were generated by matching the genes sequence using BLASTx [13, 29]. Note that not all genes in a pathway have the same significant relevancies to the related disease. Some genes in a pathway could be related more significantly to the disease and some genes less or not at all. The pathway ID 36 in the type II pathway dataset, for instance, contains several genes such as CAP1, MAPP2K6, ARF6, and SGK, which are known to be related to the human insulin signaling, while containing other genes whose relevancies to the type II diabetes are not known yet [21].

Simulated Datasets
To study the weighting effect with more control, we created two simulated datasets using the simulator function available in the boost R package, which allows a simulated data to retain the same mean and the same correlation structure of the original pathway data [13, 36]. As the basis of our simulations, we selected two real pathways containing more than 20 genes and generating high 𝑃 value in the global test or high OOB error rate in the random forests under uniform weight, to manifest the weighting effect more clearly. One pathway is “MAP00480_Glutathione_metabolism,” ID 164 from the type II diabetes dataset, containing 26 genes, ranked in the 277th with 𝑃-value 0.95 in the global test. Another pathway is “Eicosenoid Metabolism,” ID 441 from the canine dataset, containing 21 genes, ranked in the 421st with out-of-bag (OOB) error rate 0.48% in the random forests. For both cases, we used the multivariate normal distribution to create the simulated pathway data for sample size of 30, 50, and 100, with normal and disease group assigned with even number of samples.

3. Results and Discussion

We applied each proposed weighting scheme on each dataset in the global test and ranked the pathways in the increasing order of 𝑃 values obtained from the global test. From the ordered list of pathways for each output set, we selected the top 20 pathways for our analysis. In the random forests case, we only applied RWM scheme, since the other three proposed schemes apply the same weight across all samples for genes, and that does not change the outcome of the out-of-bag error calculations for the genes by the random forests algorithm. For the random forest application results, we selected top 33 pathways instead of 20, in the increasing order of OOB error rates, to include the multiple pathways tied in some ranks within the 20th. Ranking pathways is important in the pathway analysis because it enables researchers to focus on a small number of pathways, which are estimated as statistically significant in terms of the relationship to the disease or phenotype of interest. In this paper, we focus on the top 20 or 33 selected pathways groups for each weighting scheme for the performance analysis of the proposed schemes and the comparison of them to the performance of uniform weight.

For the greed search for the optimal set of weights in the applications of RWV and RWM schemes, we used 25,000 iterations, since our experiments on the type II diabetes dataset in the global test showed no meaningful decrease in the 𝑃 values, for the iterations of 20,000 or greater. The average 𝑃-values of the type II diabetes pathways corresponding to different number of iterations for running RWM scheme in the global test are displayed in Figure 1.

To help readers refresh the memory of our four proposed weighting schemes before we discuss the application results of those in the following sections, we provide a brief summary of the four schemes in Table 1.

3.1. The Global Test Application Results
3.1.1. Reduction of 𝑃 Values

Type II Diabetes Dataset
The pathway identification numbers (PID) of the type II diabetes pathways in all top 20 groups are displayed in Table 2. While the average 𝑃-value of the 20 pathways under uniform weight is 0.0612, it is much smaller under the proposed weighting schemes. In terms of the 𝑃 value reduction, RWM performed best (with the average 𝑃-value of.0001), followed by absT (.0007), Qdiff (.0027), and RWV (.0044). The amounts of reduction are ranged from.0611 for RWM to.0568 for RWV. As another metric to examine the impact of our weighting schemes, we counted the total number of pathways having 𝑃-value less than 0.05. Among all 278 pathways in the dataset, RWM yields the largest number (=264) of pathways with 𝑃-values less than.05, followed by absT (with 142), Qdiff (with 74), RWV (with 66), and uniform weight (with 8). Those results reveal that our schemes effectively reduce the 𝑃-values of the pathways compared to the uniform weight. The statistics of the 𝑃-value distributions for all 20 pathways groups are shown in the box plots in Figure 2. In terms of the 𝑃 values, RWM is the best followed by absT, and uniform weight is the worst. The dispersion of 𝑃 values for uniform weight is the widest among all with largest number of outliers.

Canine Dataset
According to the pathway analysis performed by Pang et al., the canine dataset has a relatively large number of differentially expressed genes [13]. We were interested in the performance of the proposed weighting schemes for such a dataset. The 20 pathways groups for all weighting schemes are displayed in Table 3. The average 𝑃-value of the 20 pathways for uniform weight is.00015, but it is also smaller for our weighting schemes. In terms of the 𝑃 value reduction, the best performing scheme is RWM (with average 𝑃-value of.00001), followed by absT (.00002), RWV (.00002), and Qdiff (.00012). The reduction amounts are ranged from.00014 for RWM to 0.00003 for Qdiff. Compared to the type II diabetes pathways results, the reduction amount for the canine pathways are smaller. Such result is not unexpected, since the canine dataset is known to have more differentially expressed genes and may leave smaller room to improve. Among all 441 pathways in the dataset, RWM has the largest number (=431) of pathways having 𝑃 value less than.05, followed by absT (with 405), RWV (with 388), uniform weight (with 204), and Qdiff (with 170). Our weighting schemes except Qdiff double the number of pathways with 𝑃 values less than.05. It is rather interesting that Qdiff improves the 𝑃 values of the 20 pathways over the uniform weight but decreases the number of total pathways with 𝑃 values less than.05. The 𝑃 value for all 20 pathways groups are shown in the box-plots in Figure 3. In terms of 𝑃-values, RWM and RWV are best followed by absT, and uniform weight and Qdiff are worst. The 𝑃 values for RWM and RWV are similar, but RWM is better in terms of outliers.

Simulated Datasets
Upon our observation that RWM performs best in terms of 𝑃-value reduction, we applied RWM scheme on our simulated data to study the 𝑃 value reduction in a more controlled environment. The 𝑃 values of all simulated pathway data under uniform weigh and RWM scheme are given in Table 4. In the simulation case 1, the 𝑃 values of the simulated pathways with 26 genes for 30, 50, and 100 samples were.2246,.2155, and.2573, respectively, under uniform weight (in Table 4(a)), but reduced to.0014,.0007, and.0002, respectively, under RWM scheme (in Table 4(b)). In the simulation case 2, the 𝑃 values of the simulated pathways with 21 genes for 30 and 50 samples were.0289 and.0004 under uniform weight, but.0002 and.0001, respectively under RWM scheme. However, for the sample size 100 data, the 𝑃-value was zero under uniform weight, and no further improvement was by the RWM.

3.1.2. Change of Ranks and New Significant Pathways

Our weighting schemes reduce 𝑃 values of most pathways in each dataset and hence change the ranks of the pathways determined by the uniform weight. So, some pathways in low ranks under the uniform weight improve their rankings and may draw researchers’ attention. We describe a few such cases in the following.

Type II Diabetes Dataset
We observed five pathways with pathway identification numbers (PIDs) of 13, 43, 51, 66, and 109 under absT scheme are originally ranked in the 107th or below under uniform weight. Interestingly, these pathways are reported to be associated with the type II diabetes in some ways in a couple of papers [37, 38]. The names, ranks, and 𝑃 values of those pathways under uniform and absT scheme are given in Table 5. Such low ranking pathways might have been ignored by researchers under the uniform weight, while they would draw researchers’ attention with our weighting schemes.

Canine Dataset
Six canine pathways with PIDs of 133, 154, 156, 320, 375, and 420 under absT scheme are originally ranked in the 258th or below under uniform weight. The associations of these new identified significant pathways to the cancer-related disease are also reported in several papers [39, 40, 40]. The names, ranks, and 𝑃 values of those pathways under absT scheme are compared to those under uniform weight in Table 6. We observe similar impacts on pathway ranks induced by our other weighting schemes. They are not reported here to conserve space, but available in the first author’s technical report.

3.1.3. Overlapping Pathways

While those newly identified significant pathways under would draw researchers’ fresh attentions, pathways identified as significant repeatedly under multiple weighting schemes may worth additional attention by researchers. We observed that several pathways hold high rankings across different weighting schemes, and their biological associations to the related diseases are discussed in numerous reports. We indicated those overlapping pathways appearing in three or more weighting schemes in bold faces in Tables 2 and 3. We discuss them in more detail for the two datasets in the following.

Type II Diabetes Dataset
Overlapping pathways among the top 20 groups include Alanine and aspartate metabolism (PID = 4), Glutamate metabolism (PID = 92), MAP00252 Alanine and aspartate metabolism (PID = 140), MAP00430 Taurine and hypotaurine metabolism (PID = 158), Oxidation Phosphorylation (PID = 228), and presented in bold faces in Table 2. Among them, oxidation phosphorylation (PID = 228) and glutamate metabolism (PID = 92) are well known type II diabetes pathways[28, 31].Alanine and aspartate metabolism (with PID = 4), glutamate metabolism (PID = 92), MAP00430_Taurine_and_hypotaurine_metabolism (PID = 158), MAP00252_Alanine_and_aspartate_ metabolism (PID = 140), and Alanine and aspartate metabolism (PID = 4) are also reported to be strongly associated with the type II diabetes in some ways by some researchers [31, 41, 42]. It is interesting to notice that pathways of PIDs 4 and 140 retain the high ranks (4th or above) across three different schemes.

Canine Dataset
Androgen and estrogen metabolism (PID = 17), tryptophan metabolism (PID = 39), multistep regulation of transcription by Pitx (PID = 117), RNA polymerase III transcription (PID = 151), mitochondrial carnitine palmitoyltransferase system (PID = 217), and Rho cell motility signaling pathway (PID = 391) are overlapping among different weighting schemes. Among them, tryptophan metabolism (PID = 39) and mitochondrial carnitine palmitoyltransferase system (PID = 217) hold the 8th or higher ranks, and the biological significance of the two pathways to lesions or cancerous lesions are discussed by many researchers [3234, 4346, 46]. Biological associations of the other overlapping pathways to the related disease are also discussed in some reports [47, 48, 48].

3.1.4. Prediction Performances

Prediction rates are another metric we used to measure the performances of our weighting schemes. Using LDA (linear discriminator analysis), SVML (support vector machine with a linear kernel), SVMP (support vector machine with a polynomial kernel), and KNN (k-nearest neighbors) classification methods, we measured the prediction performance of all genes in a pathway and take the average of it for all pathways in those 20 groups and cross validated those classification results using the LOOCV (leave-one-out cross validation) technique. The prediction performances of the pathways in all 20 groups are presented in Tables 7 and 8 for the two datasets.

As we can see in the Tables 7 and 8, however, the predicting power of those pathways selected under the proposed weighting scheme (except RWM) shows insignificant difference between those selected under uniform weight. This explains that those classifiers we used for the performance measurement are single gene based and do not consider gene’s dependencies in the pathway. Since our weighting schemes, except RWM, apply the same weight across all groups of samples for each gene, the classifying power of the genes do not change. Hence, those classifiers cannot be used to evaluate the improvement of predicting power of the pathways selected using our weighting schemes. Note that unlike other schemes, RWM applies different weights to all samples for a gene, and thus the classifiers measure the weighting effect on the samples for each gene but not on the genes in the pathways. We discuss the improvement of predicting power only for the 20 pathways selected under RWM scheme.

Table 6 shows the improvement of the predicting power of the genes in the 20 type II diabetes pathways selected under RWM scheme. The prediction rate 0.5 measured by LDA for 20 pathways for uniform weight was increased to 0.81 for RWM, which is 24% improvement. The prediction improvement made by RMW scheme was 18% when measured by SVML, 23% by SVMK, and 21% by KNN. As for the canine dataset results, the improvements were 2%, 0%, −1%, and 3% as measured by LDA, SVML, SVMP, and KNN, respectively, as shown in Table VIII. The small improvement for the canine pathways compared to that for the type II pathways may share the same reason with the small reductions of the 𝑃-values: the canine dataset to have relatively more differentially expressed genes, and thus may leave smaller room to improve.

3.2. Random Forests Results

The proposed absT and Qdiff weighting schemes are designed to incorporate into the covariance structure of the random effect R when the test statistic 𝑄 is calculated in the global test for a group of genes. Hence, the application of such schemes in the random forests method is not appropriate, and indeed the poor experimental results confirmed it. RWV application in the random forests is not appropriate either, since it assigns the same weight across all samples for a gene like absT and Qdiff schemes. Thus, we only discuss the application results of RWM scheme in the random forests method case, and compare them with those of uniform weight. We also compare those to the RWM application results in the global test method.

3.2.1. Reduction of out-of-Bag (OOB) Error Rate

The out-of-bag (OOB) error rate is the percentage of time that the random forests classification or regression is incorrect for the OOB data. To obtain an unbiased estimate of the classification or regression error in the random forests, OOB data run down the tree, and the overall error rate is computed when a specified number of trees are added to the forest. We used 50,000 trees to estimate the classification error, the same number used in the similar experiments performed by Pang et al. for their pathway analysis using the random forests method [13, 2527].

Type II Diabetes Dataset
Table 9 displays the PIDs and the OOB error rates of top 33 type II diabetes pathways in the random forests under uniform weight and RWM scheme. While the average OOB error rate under uniform weight is 35%, it is only 18% under RWM scheme. The OOB error rate is reduced into almost a half by the application of RWM scheme in the random forests.

Canine Dataset
The average error rate 8% under uniform weight is reduced to 6% under RWM scheme, which is only a half of the reduction made for the type II diabetes data under RFM. Table 10 shows the PIDs and the OOB error rates of the 33 canine pathways under uniform weight and RWM scheme. Again, a larger number of differentially expressed genes in the canine dataset may leave only a small room for weighting to improve the application result.

Simulation Datasets
In the simulation case 1, the error rates of the simulated pathways with 26 genes for 30, 50, and 100 samples are 0.27, 0.48, and 0.30, respectively, under uniform weight and reduced significantly to 0.13, 0.36, and 0.22, respectively, under RWM scheme. In the simulation case 2, the error rates of the simulated pathways with 21 genes for 30 and 50 samples are 0.50 and 0.30, respectively, under uniform weight and reduced to 0.30 and 0.20 under RWM, respectively. For the sample size of 100, the error rates were same 0.24 for both uniform and RWM schemes. The OOB error rates of simulated pathways under uniform and RWM scheme are given in Table 11. The substantial reduction of the error rates under RWM scheme over uniform weight in the two simulation cases supports our hypothesis that applying different weights to genes in the pathway analysis may enhance the quality of the analysis.

3.2.2. Change of Ranks and New Significant Pathways

Fourteen type II diabetes and five canine pathways out of each 33 group selected under RWM scheme are originally ranked in the 100th or below under uniform weight. We list each five most significantly changed type II diabetes and canine pathways in Tables 12 and 13, respectively, to compare their original ranks under uniform weight to the new ranks under RWM scheme

3.2.3. Overlapping Pathways

Three pathways with PIDs of 1, 4, and 140 from the type II diabetes dataset overlap between uniform weight and RWM scheme. Nine canine pathways with PIDs of 17, 39, 117, 151, 274, 354, 368, 378, and 395 for the canine dataset overlap. Further, several pathways overlap between the global test and the random forests application results both under RWM scheme. Four type II diabetes pathways with PIDs of 144, 176, 197, and 245 and five canine pathways with PIDs of 17, 39, 40, 117, and 274 are such pathways. Interestingly, the four canine pathways under RWM scheme overlapping between the global test and the random forests also overlap between uniform weight and RWM scheme in the random forests. We believe that such pathways overlapping across different weighting schemes applied in the same pathway analysis method, and across different pathway analysis methods for the same weighting scheme, may have even stronger relevancies to the related phenotypes.

3.2.4. Prediction Performances

The prediction rates of each 33 pathways group for each real dataset are given in Table 14. According to the four classifiers we used to measure the prediction rates of the selected pathways, RWM scheme improved the prediction rate of the type II pathways from 52% to 63% (LDA), 48% to 64% (SVML), 49% to 54% (SVMP), and 52% to 59% (KNN). But for the canine pathways, it worsened the prediction rates. Presumably, weights applied to the genes of good predicting power in the significant canine pathways may add noises to the expression data of those genes and degrade the predicting power.

3.3. Biological Support

To investigate further the significance of the proposed weighting scheme in terms of biological meaning, we searched the functional annotations of the genes in the selected pathways using weights. We particularly sought the biological support for absT scheme, since our overall performance analysis on our four proposed schemes finds the absT is the most useful and efficient requiring no complex computation like RWM scheme. Using DAVID functional annotation tool [49], we extracted 952 Homo Sapiens genes from 2,150 genes contained the 20 pathways selected under absT scheme. DAVID tool identified eleven enriched genes associated with type II diabetes with 𝑃-value.01 (by the gene-disease association search with GENTIC_ASSOCIATION_DB_DISEASE option). We list those eleven genes in Table 15.

Interestingly enough, the DAVID tools failed to identify any enriched genes for the type II diabetes in the top 20 pathways selected under uniform weight.

4. Conclusions

In this paper, we proposed to apply different weighting schemes in pathway-based analysis, based on our intuitive thought that genes more differentially expressed between two different groups of samples (normal versus tumor samples) will contribute more significantly to the related biological function or disease. We devised four weighting schemes absT, Qdiff, RWV, and RWM that assign different weights to genes in the pathways. The former two schemes assign weights to genes based on their relevancy to the related disease, and the latter two schemes select the weights minimizing 𝑃-values or error rates among all sets of weights randomly assigned. We investigated the weighting impact in the pathway-based analysis using two real and two simulated pathway datasets. To our best knowledge, we are the first team to apply weights to genes in the pathway-based analysis in open literature.

We made a few interesting observations through our investigations. First, our weighting schemes effectively reduce 𝑃-values of the pathways in the global test and OOB error rates in the random forests for all datasets used in our experiments. Second, our schemes increase the number of pathways with 𝑃-values less than 0.05. RWM performs best among all proposed schemes in terms of 𝑃-value and OOB error rate reduction, but the scheme is computationally expensive. Third, RWM improves prediction rates of high ranking pathways. Fourth, all the improvements discussed above are more significant for the type II diabetes dataset than the canine dataset. It may be due to the fact that genes with better predicting power or more differentially expressed leave less room for further improvement. In addition to the above improvements, our schemes could find potentially significant pathways which were missed by uniform weight. As described in Section 3, pathways whose ranks improved by weighting are associated to the related diseases according to the reports presented in numerous literatures. Finally, it is worth noting that absT and Qdiff schemes are, in theory, inferior to RWM scheme, but are computationally far less complex than RWM. So, it may be a good idea to apply them in case one as they cannot afford large computing power or long computing time.

We have unresolved issues for evaluating the weighting effect in the prediction performance of our proposed schemes. The four prediction methods (LDA, SVML, SVMP, and KNN) are single gene based and cannot be used to evaluate our schemes absT, Qdiff, and RWV. In the perspective of those methods, the same weight assigned across all samples of a single gene does not make any change in terms of classifying two different groups of samples for that single gene. Even for RWM scheme, they only can evaluate the weighting effect on samples but not on genes, since they cannot consider the interactive relationship or dependencies among genes in a group. It is necessary to develop a new prediction method that considers the dependencies among genes for more accurate assessment of weighting effects in the pathway-based analysis. Developing such prediction method is left for future research.

Acknowledgments

This research was supported in part by NIH Grants (nos. NS29525-13A and EB000830) and DOD/CDMRP Grant (nos. BC030280).