Research Article  Open Access
Multivariate ClusterBased Multifactor Dimensionality Reduction to Identify Genetic Interactions for Multiple Quantitative Phenotypes
Abstract
To understand the pathophysiology of complex diseases, including hypertension, diabetes, and autism, deleterious phenotypes are unlikely due to the effects of single genes, but rather, genegene interactions (GGIs), which are widely analyzed by multifactor dimensionality reduction (MDR). Early MDR methods mainly focused on binary traits. More recently, several extensions of MDR have been developed for analyzing various traits such as quantitative traits and survival times. Newer technologies, such as genomewide association studies (GWAS), have now been developed for assessing multiple traits, to simultaneously identify genetic variants associated with various pathological phenotypes. It has also been well demonstrated that analyzing multiple traits has several advantages over single trait analysis. While there remains a need to find GGIs for multiple traits, such studies have become more difficult, due to a lack of novel methods and software. Herein, we propose a novel multiCMDR method, by combining fuzzy clustering and MDR, to find GGIs for multiple traits. MultiCMDR showed similar power to existing methods, when phenotypes followed bivariate normal distributions, and showed better power than others for skewed distributions. The validity of multiCMDR was confirmed by analyzing reallife Korean GWAS data.
1. Introduction
In genomewide association studies (GWAS), genotype data from a large number of single nucleotide polymorphisms (SNPs) are collected, to associate SNPs with traits of interest [1]. Not only single gene effects, but also interaction effects, between genes, play important roles in complex diseases such as hypertension, diabetes, and autism. By identifying genegene interactions (GGIs), we expect to increase statistical power, to detect associations. Moreover, we also hope to clarify the biological pathways underlying human diseases, by detecting interactions between loci [2].
In many cases, a phenotype is considered, and there are various studies on statistical methods for finding GGIs, for univariate phenotypes. For studying qualitative traits, as in the casecontrol studies, one simple way for identifying genetic interaction is to fit a logistic regression model (LRM) that includes main effects and relevant interaction terms. However, LRMs perform poorly when there is a dimensionality problem. Another wellknown approach is a multifactor dimensionality reduction (MDR) method [3, 4], which reduces dimensions by converting a highdimensional to a onedimensional model. The genotype combinations are classified as either “highrisk” or “lowrisk,” depending on the ratio of cases to controls, for each genotype combination. Thus, an MDR can avoid the issues of sparse data cells and overparameterization of models [2] and can outperform LRMs, for detecting higher order GGIs [5]. Recently, various approaches such as using multiple contingency table (MODENDR) [6] or particle swarm optimization method (PBMDR) have been developed [7].
Due to its superior performance there are now various extensions of MDR, including ordinal phenotypes, quantitative phenotypes, survival information, and oddsratiobased analysis [8–11]. One specific extension of MDR, generalized MDR, which is applicable to both dichotomous and continuous traits, was proposed [12]. However, GMDR does not provide a computationally efficient algorithm that is easy to implement, and it still requires a dichotomous outcome in the data file [9]. As an alternative, quantitative MDR (QMDR) modified MDR’s constructive induction algorithm, which assigns a genotype to either the high or lowrisk groups by comparing the local and global means and then applies a ttest to compare the means of the two groups. More recently, clusterbased MDR (CLMDR), which is less sensitive to outliers and distributional assumptions, was also developed [13, 14]. Compared to QMDR, CLMDR was shown to yield higher power, when the phenotype distribution is skewed. However, CLMDR was developed only for univariate phenotype rather than multivariate phenotypes.
When considering multiple phenotypes, it becomes more difficult to find GGIs. Thus, most GWAS studies still focus on one trait to identify genetic variants associated with common complex traits, even though multiple phenotypes or repeated measurements of phenotypes are available. However, in the study of a complex disease, several correlated traits are often measured at the same time as risk factors for the disease. For example, it is known that intermediately correlated phenotypes, such as Factors VII, VIII, IX, XI, and XII and von Willebrand factor, jointly predict the risk of developing thrombosis [1, 9, 20]. By modeling multivariate diseaserelated traits, the power to detect associations between genes and diseases is expected to increase. Analyses of multiple traits have been successful in analyzing various complex diseases. In general, the multivariate approach has several advantages over the univariate approach considering one trait at a time. For example, the multivariate approach can consider several traits simultaneously in one model and hence it can take into account the correlation among traits. As a result, the multivariate approach would have higher power to detect pleiotropic genes and it can identify genetic variants not easily detected by the univariate approach [21].
There is relatively less GGI research on multivariate traits case. To deal with multiple phenotypes, generalized estimating equations (GEE)GMDR is an extension of GMDR method, using the GEE model [22]. MultiQMDR, which extends QMDR to multivariate cases, has also been proposed [5]. MultiQMDR classifies samples into high vs. lowrisk groups, by using summary statistics, based mainly on principal component scores. After classification, the two groups’ mean vectors are compared, using Hotelling’s statistic. While this approach is simple and intuitive, it is not appropriate when the distribution of phenotypes is not symmetric and/or skewed and is also sensitive to outliers.
Recently, several MDR extensions were proposed using the fuzzy set theory [23–27]. Such fuzzy setbased MDR methods classify highrisk or lowrisk groups as equivalent to defining the degree of membership in high and lowrisk groups. By adopting the fuzzy set theory, fuzzy setbased MDR methods take into account the uncertainty of this binary classification. Fuzzy setbased MDR methods allow the possibility of partial membership into high and lowrisk groups, through a membership function, which transforms the degree of uncertainty into a scale. Then, the best genotype combinations can be selected, maximizing a new fuzzy setbased accuracy measure. Specifically, fuzzy MDR [23] was proposed to detect GGIs for a binary trait and was shown to yield higher power than the original MDR. Furthermore, an empirical fuzzy MDR (EFMDR) model [24] was proposed to overcome the selection problem of tuning parameters in the original fuzzy MDR, while a fuzzy setbased generalized multifactor dimensionality reduction (FGMDR) model [25] was also proposed for covariate adjustment, for both quantitative and binary traits. More recently, a faster version of EFMDR was developed [26]. Fuzzy Cmeansbased entropy approach [27] was proposed as the method to detect GGIs for binary trait. It uses two measures: correct classification rate (FCMEMDRCCR) and likelihood ratio (FCMEMDRLR).
Here, we propose a new method to detect GGIs for multiple quantitative traits. The main idea of our method to detect GGIs for multiple quantitative traits lies in combining fuzzy clustering with a modified multifactor dimensionality reduction (MDR) approach, named “multivariate cluster MDR” (multiCMDR). Like other MDRbased methods, multiCMDR also pools multiple genotype combinations into two groups and uses them as a new attribute, reducing multidimensional space into one dimension. To classify genotype combinations, we first performed fuzzy kmeans clustering and computed a threshold, representing the ratio of the sum of the membership degrees of the two groups. Each multilocus genotype is labeled by comparing the local ratio, in each multilocus genotype, to the global ratio. Then, multiCMDR identifies the best genotype model, using Hotelling’s statistic. To find the overall best model, 10fold crossvalidation (CV) is performed and the best model is chosen which has the largest CV consistency. Unlike other GGI methods for multiple quantitative traits, multiCMDR is robust to outliers and underlying distributions.
We first introduce the multiCMDR method in detail in Section 2. We next present a simulation study in Section 3, to show the performance of the proposed methods by comparing them to other methods, such as multiQMDR. For a phenotype distribution, multivariate normal and multivariate gamma distributions are considered. In Section 4, we apply our method to three lipidrelated phenotypes data extracted from the GWA study of the Korean Association Resource (KARE) project, as an illustration. We end with some conclusions in Section 5.
2. Materials and Methods
In this section, we introduce a new procedure, multiCMDR, for finding GGIs for multiple continuous phenotypes. Similar to other MDRbased methods, multiCMDR pools multiple genotype combinations into two groups and uses them as a new attribute that reduces a multidimensional space into only one dimension. The detailed algorithm is described in Figure 1 and the multiCMDR pseudocode is presented in Pseudocode 1.

Step 0. Preprocessing.(i)Suppose there are samples, with SNP data points and continuous phenotypes. Let be the phenotype vector and let be the genotype vector for the th subject, respectively, .(ii)Standardize all the phenotypes to have a mean of zero and no unit variance.
Step 1. Perform fuzzy kmeans clustering.(i) Perform fuzzy k−means clustering with using phenotype information. Here, we make an additional pseudocluster (i.e., “noise cluster”) during the process of clustering [28]. Samples are then allocated into one of three clusters: two good cluster groups and one noise cluster. In this study, we set the noise cluster threshold value to equal the average squared Euclidean distance between samples. Two good clusters and one noise cluster are obtained by minimizing the following : such that . is the membership degree of the subject in group , is the center of the cluster , is the membership degree of the noise cluster, (>1) is the fuzzifier parameter which defines the group’s fuzziness (usually ), and is a squared distance of each data point to the noise cluster.
Step 2. Trim the data and calculate the global ratio.(i)Data are trimmed by removing all the samples in the noise cluster. The remaining samples have membership degrees for each of the two groups. Denote these two groups as and . The membership degree of the subject in group is given by (ii) Calculate global ratio : where is the membership degree of the subject in cluster .
Step 3. Divide the samples by Nfolds.(i)For Nfolds, split the crossvalidation (CV) samples randomly into N subgroups of equal size. Let N1 sets of samples be the training dataset and let the remaining dataset be the test dataset used for evaluating the model.
Step 4. Calculate the local ratio.(i)To find the order genegene interactions, select a set of m SNPs from a pool of SNPs. Calculate the local ratio for the genotype combination in the training set. is the ratio of the sum of membership degrees of the samples belonging to to that belonging to : where is the membership degree of the subject with the genotype combination, in cluster .(ii)Label each genotype combination either “,” if , or “,” if .
Step 5. Calculate the test statistic.(i)Calculate Hotelling’s statistic, for both training and testing datasets, to test differences of the mean vectors between the and groups: where is the number of observations in group and is the number of observations in group ; is observation of ; is observation of .(ii)The model with the largest statistic in the training data is chosen as the best model. Statistics for the test data will be performed later.
Step 6. Find the final best model and obtain the empirical pvalue.(i)Repeat Steps 4 and 5 N times, for each fold, and count the number of specific SNP combinations for the best model. We call this crossvalidation consistency (CVC).(ii) Find the best final interaction model, i.e., the one with the largest CVC.(iii) Derive the final statistic for the best model by averaging N statistics for the test data and let this statistic be .(iv)To evaluate the statistical significance of the best model, perform a permutation test and obtain the empirical pvalue. Generate permuted datasets by shuffling only the phenotype vector across individuals while fixing the genotype vector . This way of shuffling nullifies the association between the phenotype and genotype vectors, while preserving the correlation structures within their components. Perform the multiCMDR and calculate statistics for each permuted dataset. test statistics are in . The empirical pvalue is calculated as where is indicator function, returning 1 if is true, otherwise 0.
3. Results and Discussion
3.1. Simulation Analysis
In this section, we conducted simulations to compare the performance of the proposed multiCMDR method, with multiQMDR and univariate QMDR methods. We also compared the performance of the two versions of multiCMDR. One version is a nontrimmed version of multiCMDR. That is, the noise cluster is not generated in the fuzzy clustering step. The other version uses kmeans clustering, without considering membership score. For multiQMDR methods, the First Principal Component (FPC) was used to classify each cell into high or lowrisk groups, as previously described [5]. For a univariate approach, QMDR was performed for each phenotype, separately. All of these methods were compared in terms of their hitratios, representing the ratio at which the true causal SNP pair is identified by the best model.
We then generated a multivariate normal distribution and a multivariate gamma distribution for phenotypes. We used 70 different penetrance functions that define a probabilistic relationship with diseasecausal interaction. The models consisted of 7 different heritability values (0.01, 0.025, 0.05, 0.1, 0.2, 0.3, and 0.4) and 2 different minor allele frequencies (MAFs, 0.2 and 0.4). A total of 5 models for each of the 14 heritabilityminor allele frequency combinations were considered. Thus, a total of 70 models were generated. The details of the 70 penetrance functions are given in [29]. For every 70 models, 100 datasets were generated. For each dataset, the sample size was 400, and we considered 20 SNPs and 2 continuous phenotypes. SNP1 and SNP2 denoted diseasecausal SNP interactions. We used 10fold crossvalidation to determine best overall model.
3.1.1. Multivariate Normal Distribution
For the multivariate normal distributed case, two continuous phenotype values, , were associated with SNP_{1} and SNP_{2}, respectively, and were generated from the bivariate normal distribution,where and , and is the element from the row and column of a penetrance function, representing the two functional interacting SNPs. From this, we considered 3 different : = 0, 0.25, 0.5. We used R software to generate simulation data. For multivariate normal distributed cases, we used mvrnorm() function in MASS package in R.
The hitratios for each heritability values are reported in Figure 2. In the bivariate normal distribution case, all the multivariate methods were generally more powerful than the univariate QMDR methods. As the correlation increased, however, the difference between multivariate and univariate methods decreased. All multivariate methods showed similar performance. In the case of zero correlation, multiQMDR showed slightly better performance than multiCMDR. The hitratios of multiCMDR, with trimming, were similar to those of multiCMDR without trimming. That is, there was no effect of trimming outliers in multiCMDR for the bivariate normal distribution case. The lower the correlation, the higher the hitratio, when the values of heritability were 0.05, 0.1, and 0.2. This is because the lower the correlation, the more unique information for each variable. In a similar context, when the correlation was high, the hitratios of the multivariate and univariate methods were similar.
3.1.2. Multivariate Gamma Distribution
For the skewed distribution, we generated bivariate gamma distribution using Gaussian copula [30]. In the Gaussian copula, the correlation matrix is responsible for the dependence. We used the same correlation structure, for the bivariate normal case. When the marginal distributions were continuous, a bivariate distribution could be defined by the density of the following form:where represents the copula density, , , are marginal probability density functions, and is joint density function. The Gaussian copula density is then defined as follows:where , , and is the inverse cumulative distribution function of the standard normal distribution; are marginal cumulative distribution functions. The forms of two gamma distributions, and , are as follows:From this, we considered 3 different s : = 0, 0.25, 0.5. For multivariate gamma distributed cases, we used mvdc(),normalCopula(),rMvdc() functions in copula package in R.
In Figure 2, we observed that the proposed multiCMDR outperformed the QMDR and the multiQMDR, for all ranges of heritability, for the bivariate gamma distribution case. Also, multiCMDR, without trimming, performed better than multiQMDR. For the bivariate gamma distribution, the lower the correlation, the higher the overall hit ratio. The difference of hitratios between multiCMDR and other methods was greatest when the heritability was 0.1. As the correlation increases, the differences between hit ratios of the multivariate methods, except multiCMDR, decrease.
To sum up, the power of proposed multiCMDR is similar to that of multiQMDR, for symmetric distribution while it outperformed multiQMDR for the skewed distribution. Moreover, the powers of the two different versions of multiCMDR were also slightly better than those of multiQMDR, in skewed phenotype distributions. For all situations, multivariate methods performed better than univariate methods. Results for each combination of two minor allele frequency (MAF) values and 5 models are presented in the supplemental materials (Supplemental Figures 16).
3.1.3. Empirical False Positive Rate
We computed empirical false positive rate. To compute empirical false positive rate, we permuted phenotypes over individuals for each case to generate null data. The selection rate of each SNP pair in null data is = 0.0053. To compute empirical false positive rate, we counted the number of detecting a specific SNP combination, SNP1 and SNP2, as the best model. Overall, empirical false positive rates of each method are closed to the expected value 0.0053. Results for empirical false positive rates of each method are presented in the supplemental materials (Supplemental Tables 16).
3.2. Real Biological Data Analysis
For reallife data analysis, three lipidrelated phenotypes’ data, retrieved from the Korean Association Resource (KARE) project [31], were considered to evaluate the proposed multiCMDR. Three lipidrelated phenotypes consisted of highdensity lipoprotein cholesterol (HDL), lowdensity lipoprotein cholesterol (LDL), and triglyceride (TG). After removing those observations with at least one missing phenotype value, there were 8,581 samples remaining. The largest absolute value of correlation between three phenotypes was 0.39 (Figure 3). Among 344,596 SNPs, we used 324 SNPs selected in [5] for this analysis.
We then applied the proposed multiCMDR to search for the best second interaction model, again by using 10fold CV. Table 1 displays the best and order SNP combinations, identified by the proposed multiCMDR. In addition to the best model, which has the highest CVC, Table 1 shows other candidate models selected from the best models, in every 10 training datasets. To see if these SNP combinations have been previously detected, one previous study [5] reported the best SNP combinations found in this study, including those described in Table 1.

For order analysis, rs1106280 was selected as the best model with the highest CVC. rs11066280 was identified as significantly associated with metabolism, TGs, and HDLs [5, 15] and was selected as the best lipidrelated phenotypes in a order analysis from univariate analysis of HDL using QMDR [5]. The second best model, rs10503669, has been reported to associate with LPL [16]. The third best model, rs2074356, associated with HDL [1]. All pvalues selected by the multiCMDR method were <.
For order analysis, the proposed multiCMDR identified the best two SNP combinations, rs11216126 and rs4244457, where rs11216126 is reported to be related to HDL [17]. rs4244457 (LPL) occurs in the gene for the key enzyme responsible for the lipolytic processing of TGrich lipoproteins [5]. Note that rs4244457 was selected as the most lipidrelated SNP in a  and order analysis, using a multiQMDR method for testing association with LDL [5]. Moreover, rs11600380, rs10503669, and rs16940212 were previously reported to relate to TG, LDL, and HDL, respectively [16, 18, 19]. Each of those three SNPs was also reported in previous studies, but as far as we know, there were no simultaneously reported order interactions.
4. Discussion
For GGI analysis for multiple quantitative traits, we proposed multiCMDR. Analyzing correlated multivariate phenotypes was shown to have higher power to detect susceptible genes and GGIs, by using more information from data [32]. The main feature differences between multiQMDR and multiCMDR lies in how to define groups for each combination cell. MultiQMDR uses summary scores obtained by principal component analysis to classify highrisk and lowrisk groups. The observations of each cell are assigned to the highrisk group if the local mean is greater than or equal to the global mean; otherwise the observations are assigned to the lowrisk group. On the other hand, multiCMDR divides groups using clustering. By comparing the global and local ratios, as calculated by using the membership degrees obtained through fuzzy kmeans clustering, the observations of each cell are assigned to , if the local ratio is greater than or equal to the global ratio; otherwise the observations are assigned to .
This proposed multiCMDR was shown to be less sensitive for outliers and nonsymmetric distributions than other methods. 10fold crossvalidation and Hotelling’s statistic were used to select the best model. In the simulation study, we showed that the proposed multiCMDR could be used effectively in case of bivariate gamma distribution. While the proposed method did not seem to have advantage of computing time over the multiQMDR method, it was higher for the skewed distribution. In reallife data analysis, multiCMDR detected the best SNPs and 2way interactions for lipidrelated traits (HDL, TG, and LDL). The best SNPs, selected by our method, have been reported to associate with similar traits [1, 5, 15–19]. While our proposed method performs well for nonsymmetric distributions, it would be always worth to try appropriate transformations to make nonsymmetric distributions symmetric.
In terms of computation time efficiency, multiQMDR was slightly faster than multiCMDR. Using an AMD Ryzen 2700x desktop machine with 16G RAM, multiQMDR took 145.8841 seconds on average (100 repetitions) to conduct real data analysis for the firstorder interaction, whereas multiCMDR took 162.7906 seconds on average. For simulation dataset with 400 sample size and 20 SNPs, multiQMDR took 17.3334 seconds on average to conduct the order interaction, while multiCMDR took 19.3947 seconds on average. That is, when the number of SNPs is small, the difference in computation time is small. R program to conduct multiCMDR is available at our github repository (https://github.com/stat17hb/MultiCMDR).
5. Conclusion
For the analysis of GGIs associated with multiple quantitative traits, we proposed a new extension of the MDR algorithm that includes clustering. Using fuzzy kmeans clustering, we divided samples into two groups and trimmed outliers in noise cluster. By fuzzy kmeans clustering, we can capture numerous attributes of multivariate data. Therefore, this is a very productive way to use values calculated from clusters to set thresholds to assign observations to specific groups, in that the proposed multiCMDR uses a fuzzy kmeans clustering method. Unlike kmeans clustering, where each observation is assigned to only one cluster, fuzzy kmeans clustering provides each observation with a degree of membership to each cluster. Fuzzy kmeans clustering is especially useful when the cluster boundary is not clear, and it also allows outliers to be clustered into a noise cluster and reflects individual membership degrees of elements in the same cluster. We expect that multiCMDR would improve the identification of genegene interactions associated with numerous multifactorial human pathologies.
Data Availability
The Korea Association Resource (KARE) project data will be publicly distributed by the Distribution Desk of Korea Biobank Network (https://koreabiobank.re.kr/). The data request should be made directly to Distribution Desk of Korea Biobank Network. Any inquiries should be sent to admin@koreabiobank.re.kr.
Disclosure
This paper has been presented at 2018 annual meeting of the Western North American Region of the International Biometric Society (WNAR), Edmonton, Canada. Our earlier work on univariate CLMDR was presented at 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Kansas City, USA.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this article.
Acknowledgments
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (NRF2017R1A2B4011504, 2013M3A9C4078158). This research was also supported by the Bio & Medical Technology Development Program of the NRF funded by the Korean government, MSIP (No. 2016M3A9B694241).
Supplementary Materials
There were 10 combinations of minor allele frequencies (MAFs) and 5 models for each simulation setup. The MAF was 0.2, up to model 5. From models 6 to 10, the MAF was 0.4. We also considered correlation values of 0, 0.25, and 0.5. CMDR (multiCMDR), MCMDR2 (multiCMDR without trimming), MCMDR3 (multiCMDR without membership score), MQMDR (multiQMDR), QMDR. Y1 (QMDR with ), and QMDR Y2 (QMDR with ). (Supplementary Materials)
References
 S. Basu, Y. Zhang, D. Ray, M. B. Miller, W. G. Iacono, and M. McGue, “A Rapid GeneBased GenomeWide Association Test with Multivariate Traits,” Human Heredity, vol. 76, no. 2, pp. 53–63, 2013. View at: Publisher Site  Google Scholar
 H. J. Cordell, “Detecting genegene interactions that underlie human diseases,” Nature Reviews Genetics, vol. 10, no. 6, pp. 392–404, 2009. View at: Publisher Site  Google Scholar
 M. D. Ritchie, L. W. Hahn, N. Roodi et al., “Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer,” American Journal of Human Genetics, vol. 69, no. 1, pp. 138–147, 2001. View at: Publisher Site  Google Scholar
 L. W. Hahn, M. D. Ritchie, and J. H. Moore, “Multifactor dimensionality reduction software for detecting genegene and geneenvironment interactions,” Bioinformatics, vol. 19, no. 3, pp. 376–382, 2003. View at: Publisher Site  Google Scholar
 W. Yu, M. Kwon, and T. Park, “Multivariate Quantitative Multifactor Dimensionality Reduction for Detecting GeneGene Interactions,” Human Heredity, vol. 79, no. 34, pp. 168–181, 2015. View at: Publisher Site  Google Scholar
 C. Yang, L. Chuang, and Y. Lin, “Multiobjective differential evolutionbased multifactor dimensionality reduction for detecting genegene interactions,” 2017. View at: Google Scholar
 C. Yang, H. Yang, and L. Chuang, “PBMDR: A particle swarm optimizationbased multifactor dimensionality reduction for the detection of multilocus interactions,” Journal of Theoretical Biology, vol. 461, pp. 68–75, 2019. View at: Publisher Site  Google Scholar
 D. Gola, J. M. Mahachie John, K. van Steen, and I. R. König, “A roadmap to multifactor dimensionality reduction methods,” Briefings in Bioinformatics, vol. 17, no. 2, pp. 293–308, 2016. View at: Publisher Site  Google Scholar
 M. Germain, N. Saut, N. Greliche et al., “Genetics of venous thrombosis: insights from a new genome wide association study,” PLoS ONE, vol. 6, no. 9, 2011. View at: Publisher Site  Google Scholar
 Y. Chung, S. Y. Lee, R. C. Elston, and T. Park, “Odds ratio based multifactordimensionality reduction method for detecting genegene interactions,” Bioinformatics, vol. 23, no. 1, pp. 71–76, 2007. View at: Publisher Site  Google Scholar
 S. Yeoun Lee, Y. Chung, R. C. Elston, Y. Kim, and T. Park, “Loglinear modelbased multifactor dimensionality reduction method to detect genegene interactions,” Bioinformatics, vol. 23, no. 19, pp. 2589–2595, 2007. View at: Publisher Site  Google Scholar
 X.Y. Lou, G.B. Chen, L. Yan et al., “A generalized combinatorial approach for detecting genebygene and genebyenvironment interactions with application to nicotine dependence,” American Journal of Human Genetics, vol. 80, no. 6, pp. 1125–1137, 2007. View at: Publisher Site  Google Scholar
 Y. Lee, H. Kim, T. Park, and M. Park, “Clusterbased multifactor dimensionality reduction method to identify genegene interactions for quantitative traits in genomewide studies,” in Proceedings of the 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM '17), pp. 1772–1776, 2017. View at: Google Scholar
 Y. Lee, H. Kim, T. Park, and M. Park, “Genegene interaction analysis for quantitative trait using clusterbased multifactor dimensionality reduction method,” International Journal of Data Mining and Bioinformatics, vol. 20, no. 1, p. 1, 2018. View at: Publisher Site  Google Scholar
 N. Kato, F. Takeuchi, Y. Tabara et al., “Metaanalysis of genomewide association studies identifies common variants associated with blood pressure variation in east Asians,” Nature Genetics, vol. 43, no. 6, pp. 531–538, 2011. View at: Publisher Site  Google Scholar
 C. J. Willer, S. Sanna, A. U. Jackson et al., “Newly identified loci that influence lipid concentrations and risk of coronary artery disease,” Nature Genetics, vol. 40, no. 2, pp. 161–169, 2008. View at: Publisher Site  Google Scholar
 Y. J. Kim, M. J. Go, C. Hu et al., “Largescale genomewide association studies in East Asians identify new genetic loci influencing metabolic traits,” Nature Genetics, vol. 43, no. 10, pp. 990–995, 2011. View at: Publisher Site  Google Scholar
 F. Asselbergs, Y. Guo, E. van Iperen et al., “LargeScale GeneCentric Metaanalysis across 32 Studies Identifies Multiple Lipid Loci,” American Journal of Human Genetics, vol. 91, no. 5, pp. 823–838, 2012. View at: Publisher Site  Google Scholar
 M. J. Go, J. Hwang, D. Kim et al., “Effect of Genetic Predisposition on Blood Lipid Traits,” Genomics & Informatics, vol. 10, no. 2, pp. 99–105, 2012. View at: Publisher Site  Google Scholar
 J. C. Souto, L. Almasy, M. Borrell et al., “Genetic susceptibility to thrombosis and its relationship to physiological risk factors: the GAIT study. Genetic Analysis of Idiopathic Thrombophilia,” American Journal of Human Genetics, vol. 67, no. 6, pp. 1452–1459, 2000. View at: Publisher Site  Google Scholar
 S. Oh, I. Huh, S. Y. Lee, and T. Park, “Analysis of multiple related phenotypes in genomewide association studies,” Journal of Bioinformatics and Computational Biology, vol. 14, no. 05, p. 1644005, 2016. View at: Publisher Site  Google Scholar
 H. Xu, X. Sun, T. Qi et al., “Multivariate Dimensionality Reduction Approaches to Identify GeneGene and GeneEnvironment Interactions Underlying Multiple Complex Traits,” PLoS ONE, vol. 9, no. 9, pp. 1–12, 2014. View at: Publisher Site  Google Scholar
 H. Jung, S. Leem, S. Lee, and T. Park, “A novel fuzzy set based multifactor dimensionality reduction method for detecting gene–gene interaction,” Computational Biology and Chemistry, vol. 65, pp. 193–202, 2016. View at: Publisher Site  Google Scholar
 S. Leem and T. Park, “An empirical fuzzy multifactor dimensionality reduction method for detecting genegene interactions,” BMC Genomics, vol. 18, 2, pp. 1–12, 2017. View at: Publisher Site  Google Scholar
 H. Jung, S. Leem, and T. Park, “Fuzzy setbased generalized multifactor dimensionality reduction analysis of genegene interactions,” BMC Medical Genomics, vol. 11, no. S2, pp. 11–20, 2018. View at: Publisher Site  Google Scholar
 S. Leem and T. Park, “EFMDRFast: An Application of Empirical Fuzzy Multifactor Dimensionality Reduction for Fast Execution,” Genomics & Informatics, vol. 16, no. 4, p. e37, 2018. View at: Publisher Site  Google Scholar
 C.H. Yang, L.Y. Chuang, and Y.D. Lin, “Epistasis Analysis using an Improved Fuzzy Cmeansbased Entropy Approach,” IEEE Transactions on Fuzzy Systems, vol. PP, no. L, p. 1, 2019. View at: Google Scholar
 R. N. Davé, “Characterization and detection of noise in clustering,” Pattern Recognition Letters, vol. 12, no. 11, pp. 657–664, 1991. View at: Publisher Site  Google Scholar
 D. R. Velez, B. C. White, A. A. Motsinger et al., “A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction,” Genetic Epidemiology, vol. 30, no. 8, pp. 718–727, 2007. View at: Publisher Site  Google Scholar
 Y. Stitou, N. Lasmar, and Y. Berthoumieu, “Copulas based multivariate Gamma modeling for texture classification,” in Proceedings of the IEEE Int. Conf. Data Min, pp. 1045–1048, 2009. View at: Google Scholar
 Y. S. Cho, M. J. Go, Y. J. Kim et al., “A largescale genomewide association study of Asian populations uncovers genetic factors influencing eight quantitative traits,” Nature Genetics, vol. 41, no. 5, pp. 527–534, 2009. View at: Publisher Site  Google Scholar
 J. Choi and T. Park, “Multivariate generalized multifactor dimensionality reduction to detect genegene interactions,” BMC Systems Biology, vol. 7, no. Suppl 6, pp. 1–11, 2013. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2019 Hyein Kim et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.