Review Article

A Review for Detecting Gene-Gene Interactions Using Machine Learning Methods in Genetic Epidemiology

Table 4

Strengths and weaknesses of neural networks, support vector machine, and random forests methods for detect gene-gene interactions.

MethodsAuthor StrengthsWeaknesses

Neural networkMusani et al. [40]
Upstill-Goddard et al. [31]
(i) NN is able to model the relationship between disease and single nucleotide polymorphism (SNP)
(ii) NN can make prediction on data where the disease outcome is unknown by learning the outcome given on a dataset
(iii) NN is a method that can deal with large volumes of data
(iv) NN is suitable for genetic heterogeneity, high phenocopy rates, polygenic inheritance, and incomplete penetrance.
(v) GPNN and GENN are able to optimize the architecture of NN and possess high power to discover the presence of nonfunctional SNPs.
(vi) GPNN does not overfitting the data
(vii) GPNN possesses high power in dealing with epitasis model with weak marginal effect
(viii) GENN outperform GPNN by optimiz NN in fewer generations
(ix) GENN possesses high power to detect high risk loci in complex disease
(i) Presence of black box
(ii) Difficult to list out all possible NN architecture and it causes the difficulty to find the optimal architecture
(iii) GPNN needed parallel processing environment
(iv) GPNN causes the high false positive rate to occur in three locus models
(v) The output of GPNN is binary expression, and it can be hard to interpret (for instance, up to 500 nodes)
(vi) Result of NN was hard to interpret due to the dimensionality problem
(vii) NN needs comprehensive cross-validation to confirm validity

Support vector machine
(SVM)
Chen et al. [19]
Wasan et al. [41]
Upstill-Goddard et al. [31]
(i) SVM can deal with high dimension data set
(ii) SVM can be utilized to classify complex biological gene expression data
(iii) Does not trap at local minima
(iv) Not prone to overfitting
(v) SVM is robust to noise
(vi) The output of SVM is more interpretable if compared to MDR
(vii) Does not require user-defined decisions for classification
(viii) SVM is ready to be generalized to new structures
(i) Presence of black box
(ii) SVM is restricted to pairwise classification
(iii) SVM cannot be directly used for feature selection
(iv) Result produced may be affected by the presence of missing data
(v) The power of SVM might reduce with the presence of genetic heterogeneity
(vi) Additional training maybe needed to correct the bias of prediction accuracy. However, it could be computationally expensive for the proposed procedure
(vii) Accuracy produced by SVM might be suboptimal due to the SVM parameter C is forced to be one constant. Hence, a grid search for the parameter is needed by utilizing some promising SNP combinations in order to refine the results.

Random forest (RF)Upstill-Goddard et al. [31](i) RF does not exhibit strong main effects which uncover interactions among genes.
(ii) RF does not “overfit” the data.
(iii) SNPs predictive of a phenotype are identifying by RF.
(i) Presence of black box
(ii) RF does not succeed in GWAS data.
(iii) Sometimes RF is underestimating important scores of SNPs without marginal effects.
(iv) RF only detects interactions with large effect size.
Random jungle (RJ)(i) RJ is able to analyze data on a genome-wide scale.
(ii) RJ has more computationally efficient than RF.
If the main effects are weak, RJ fails to detect interactions.