Review Article

Incorporating Pathway Information into Feature Selection towards Better Performed Gene Signatures

Table 1

A selective review of pathway-guided gene selection algorithms.

ReferenceBrief description of the proposed method and its characteristicsCategory

Zhu et al. [26]The proposed network-based SVM method combines the network-constrained penalty (see equation (1)) with a SVM model to carry out feature selection and classification.
It makes SVM models capable of carrying out feature selection; the network-constrained penalty gives heavier weights to genes with more direct neighbors (thus increases the chance of such genes being selected) and encourages a grouping effect. But the method only deals with binary classification and considers immediate neighbors.
Penalty
Chen et al. [28]The netSVM method also combines the network-constrained penalty (see equation (1)) with a SVM model.
Its advantages and disadvantages are similar to the network-based SVM method by Zhu et al [26] (see above)
Penalty
Sokolov et al. [29]The generalized elastic net penalty function is given and combined with an objective function to select important genes. This is named as the GELnet method.
The authors claimed that this penalty function includes many well-known penalty terms and the method is so flexible that it can deal with many outcome types. There is an independent R package (i.e., gelnet) to implement this method, but now this package can only conduct binary classification.
Penalty
Zhang et al. [53]The Net-Cox method adds a network-constrained penalty term to the corresponding partial likelihood function of a Cox model, aiming to select important prognostic genes
The Matlab codes are available online, making the implementation of this method easy. This method only considers direct neighbors.
Penalty
Bandyopadhyay et al. [32]After ranking genes in a pathway according to their marginal classification power, the proposed BPFS method starts from the gene with the largest power and then adds genes
The authors claimed that this method goes beyond the immediate neighbors and considers redundant gene elimination. Also, missing genes in the pathway databases are mapped to the network using a probabilistic technique. However, the method is hard to comprehend, and no codes are available.
Stepwise forward
Lee et al. [33]In each pathway, the method reorders genes according to their t-scores, and then the subset of genes whose combined expression has optimal discriminative power called CORGs is identified.
Only the membership of genes is considered. The method is simple and easy to implement.
Stepwise forward
Razi et al. [34]The proposed NBCG method starts with a seed gene and traverses the network to find the optimal subset on the basis of Shapley value.
The method uses the concept of Shapley value to take into account the collective power of the resulting gene subset. The choice of a seed gene may result in excluding a gene subset with subtle individual effects but significant concordant effect.
Stepwise forward
Wu et al. [54]The shortest path method (with well-known genes related to the disease under study, i.e., gastric cancer as seeds) is used to mine candidate genes and the combination of random forest +incremental feature selection is used to obtain the optimal subset.
The proposed method considers topology information of a network. The use of a wrapper method (RF+IFS) and permutation tests may slow the method down.
Stepwise forward1
Tian et al. [20]The weighted-SAMGSR method extends the SAMGSR algorithm by weighing SAMGS statistics according to genes’ connectivity levels in the network.
The method considers both the membership information and the connectivity level, and can handle two-class and multiple-class classification. The R-codes are available in the supplementary material. Computing time is a big concern since permutation tests are needed to calculate p-values of test statistics.
A hybrid of weighting and stepwise forward
Johannes et al. [23]The RRFE method uses the GeneRank algorithm to alter the ranking criterion of the SVM-RFE algorithm and selects a subset with the best discriminative power.
Weighing the coefficients of SVM models with their GeneRanks to increase the probability of a gene with more connected genes being selected, an independent R package (i.e., pathClass) is provided to implement this method. The method only considers how many direct neighbors a gene has and ignores topology information completely.
Weighting
Chan et al. [39]The wgSVM-SCAD method weighs the expression values of genes in a pathway according to their t-values and then uses a penalized SVM model (with SCAD penalty) to identify relevant genes.
The proposed method only considers membership information and the weights are only based on the relevance score (i.e., t-values) instead of pathway information.
Weighting
Tian et al. [16]Using sign averages of all genes inside a gene set to represent corresponding gene set, the proposed methods (i.e., one forward bi-level selection method and one backward bi-level selection method) filter out insignificant gene sets and insignificant genes in a specific order.
The sign average metric provides a better representation of a gene set than mean, median and the first PC. The proposed methods only consider membership information.
Bi-level selection
Lim and Wong [19]In both FSNet and PFSNet methods, a fuzzy value is assigned to each gene for each sample and then majority voting is used to determine important genes.
The codes are available online. The proposed methods only consider the gene grouping membership information.
Bi-level selection

Note: Bilevel selection algorithms are regarded as a special case of pathway-guided gene selection algorithms.
1Can be loosely categorized into the indicated category (e.g., stepwise forward).