Abstract
Detecting SNPSNP interactions associated with disease is significant in genomewide association study (GWAS). Owing to intensive computational burden and diversity of disease models, existing methods have drawbacks on low detection power and long running time. To tackle these drawbacks, a fast selfadaptive memetic algorithm (SAMA) is proposed in this paper. In this method, the crossover, mutation, and selection of standard memetic algorithm are improved to make SAMA adapt to the detection of SNPSNP interactions associated with disease. Furthermore, a selfadaptive local search algorithm is introduced to enhance the detecting power of the proposed method. SAMA is evaluated on a variety of simulated datasets and a realworld biological dataset, and a comparative study between it and the other four methods (FHSASED, AntEpiSeeker, IEACO, and DESeeker) that have been developed recently based on evolutionary algorithms is performed. The results of extensive experiments show that SAMA outperforms the other four compared methods in terms of detection power and running time.
1. Introduction
The development of highthroughput sequencing technology makes it possible to analyze singlenucleotide polymorphisms (SNPs) from thousands of individuals [1, 2]. With the purpose of detecting the association between SNPs and a disease, genomewide association study (GWAS) plays a vital role in recognizing causes of diseases [3–5]. GWAS has been successfully applied to identify numerous SNPs associated with diverse diseases, such as about 30 loci associated with schizophrenia [6–8]. However, due to the large amount of computation imposed by the highdimensional search space, it is difficult to measure the association between SNPSNP interactions and disease in genomewide data [9–11].
In the past few years, many methods have been raised for detecting twolocus disease models. These algorithms can be categorized into exhaustive search, stochastic search, heuristic search, and swarm intelligent optimization algorithms [12]. The exhaustive search is a method which evaluates the degree of correlation between all possible SNPSNP interaction combinations and disease [13, 14] but is often computationally unaffordable for datasets with very large number of SNPs.
The random search uses probabilistic methods to find the optimal solution [15, 16]. The heuristic search is an approximate search algorithm that speeds up the search process by reducing the search space [17, 18]. However, the two kinds of searches cannot make the commitment of finding the optimal solution all the time.
In the recent years, swarm intelligent optimization algorithms arising from natural phenomena and biological system have held high attention in the detection of diseaseassociated SNPSNP interactions [19–21]. For instance, FHSASED [22] combines the harmony search algorithm with two scoring functions for the detection of SNPSNP interactions. AntEpiSeeker [23] detects diseaseassociated SNPSNP interactions by using a twostage ant colony optimization (ACO) [24, 25]. IEACO [26] automatically adjusts path selection strategies using information entropy to detect SNPSNP interactions. DESeeker [27] uses a twostage differential evolution (DE) [28, 29] algorithm to identify the SNPSNP interaction. However, it is worth noticing that all of these methods remain defective owing to their low detection power.
One promising approach for tackling the drawbacks mentioned above is to use a fast local search in the evolutionary algorithm. Hybridization of genetic algorithms (GAs) with local search (LS) has already been studied in various optimization problems [30–32]. Such a hybrid algorithm is often called a memetic algorithm (MA) [33]. Thus, we propose a fast selfadaptive memetic algorithm (SAMA) to detect twolocus SNPSNP interactions associated with disease. In the SAMA algorithm, we improve the crossover, mutation, and selection of MA. These three improved operations are more suitable for detecting twolocus SNPSNP interactions. Moreover, we incorporate a selfadaptive local search into the proposed algorithm to avoid premature convergence. We compare our algorithm with the stateoftheart methods and conduct experiments on a wide range of simulated datasets and a realworld biological dataset. The results show the proposed algorithm has improved power in detecting correct SNPSNP interactions with different disease models.
The paper is organized as follows. In Section 2, we introduce the problem definition of twolocus SNPSNP interactions associated with disease and propose the SAMA algorithm. In Section 3, we describe the experiments carried out in order to determine the detection power of our method. Finally, we present the conclusion in Section 4.
2. Methods
2.1. Problem Definition
A set of SNPs is represented by , where is an SNP and is the number of SNPs. For detecting twolocus disease models, there are combinations that can be selected. The value of each SNP is 0, 1, or 2, which represent the homozygous major genotype, the heterozygous genotype, and the homozygous minor genotype, respectively. A dataset contains samples ( cases and controls), and each sample has a set of SNPs. If the genotype distribution of a twolocus SNPSNP interaction is significantly different between cases and controls, it may lead to an increase in the risk of the disease.
2.2. The SAMA Algorithm
It is a timeconsuming task to detect SNPSNP interactions associated with disease if all possible twolocus interactions from hundreds of thousands of SNPs are considered in a genomewide scale. In this paper, a fast selfadaptive memetic algorithm (SAMA) is proposed to enhance the detection power of twolocus SNPSNP interactions in an efficient way.
Memetic algorithm (MA) [33] is inspired by natural system model and population evolution. By combining evolutionary algorithms with local search, it can provide a local improvement opportunity for the individuals in a genetic search. The framework of MA can be outlined as Figure 1, and this figure shows the basic structure of the MA algorithm. MA consists of two parts: genetic search and local search, where the local search part includes crossover, mutation, and selection. The SAMA algorithm follows the basic framework in Figure 1 to detect twolocus SNPSNP interactions associated with disease, and the process is shown in Algorithm 1.
2.2.1. Initialization
The SAMA algorithm randomly generates a initial population with individuals. An individual is expressed as where and are SNPs, and the individual is generated by where is an upward rounding operation, is a random number between 0 and 1, and is the number of SNPs in a dataset. After initialization, SAMA finds the current optimal solution with the best value of fitness function. In SAMA, the test is used as the fitness function to measure the association between twolocus SNPSNP interactions and the disease.
2.2.2. Hybrid Crossover (HC)
The crossover operator, a fundamental genetic search operator, takes advantage of the information available in the search space. In the SAMA algorithm, we use a hybrid crossover (HS) to cross two individuals. HC can be considered the hybrid between the current best individual and the individuals in the current iteration. The pseudocode of HC is shown in Algorithm 2.


In the algorithm, the current best individual and the individual in the current iteration are selected as two parents. If the random number between 0 and 1 is less than the crossover probability , the first SNP in is replaced by the first SNP in . If the random number is less than the crossover probability , the second SNP in is replaced by the second SNP in . If the conditions of and are satisfied at the same time, is replaced by .
2.2.3. Distributed Breeder Mutation (DBM)
The mutation operator is used to randomly create the diversity of individuals in a population. We use a mutation called distributed breeder mutation (DBM) in the SAMA algorithm. DBM, inspired by the breeder genetic algorithm proposed by Muhlenbein and SchlierkampVoosen [34], is a robust global search based on a solid theory. The mutated individual is calculated by the following equation: where is the mutation set to , is calculated from a distribution which prefers a small value, and the “” or “” is chosen with a probability of . Thus, is mutated in the interval between [] and [], and is mutated in the interval between [] and [].
If the mutated individual is outside the specified range , will be reinitialized. is computed according to the following equation:
is set to 0 before the mutation operation. Then, each is mutated to 1 with a probability of 1/16. The minimum step size is produced with a precision of . Algorithm 3 gives the execution process of DBM.

2.2.4. SelfAdaptive Local Search (SLS)
Local search (LS) is a simple iterative method for finding approximate solutions. If a candidate solution has better or equal fitness, LS moves the search from the current solution to the candidate solution. If LS is applied to every solution many times, the running time is very long because the additional functional evaluations required for LS is expensive. Thus, a selfadaptive LS (SLS) is introduced, which uses a probability to reduce the number of times that are used for local search. The probability that each individual is selected to allpy the SLS operation is , and the is defined by where is the switch parameter, and is an individual after HC and DBM. The initial of each individual is 1; hence, each individual will be selected at least once for SLS. If the fitness value of the individual is improved, the probability that is selected is still 1. Otherwise, is changed to . If the fitness value of is not improved after being selected times, this value is . The pseudocode of SLS is shown in Algorithm 4.

2.2.5. Elitist Selection (ES)
In the SAMA algorithm, an elitist selection is introduced to select individuals that evolve to the next iteration. After HC, DBM, and SLS, the ES operation is performed according to
If the fitness value of the individual is greater than that of the previous individual , is replaced by . Otherwise, is unchanged.
2.3. A Running Instance of SAMA
In this subsection, we give a running instance of SAMA in Figure 2. Suppose that there are five individuals in the current population. After initialization, , , , , and . Among them, obtains the highest fitness value, i.e., , and hence, is the current optimal solution ( and ).
First, we perform the HC operation. Suppose and for and , < for , < for , and < and < for . According to Algorithm 2, and are not changed and assigned directly to and , whereas the other three individuals are changed. One SNP in and is replaced; hence, is changed to and is changed to . is changed to because both SNPs in are replaced.
Next is the DBM operation. We assume that of is 0, the of and is 10, and the of and is 15. and get “”, whereas and get “.” Thus, is not changed and assigned directly to , is changed to , is changed to , is changed to , and is changed to .
After completing HC and DBM, the SLS operation is executed. , , and are not changed and assigned directly to , , and due to . For and , SLS is operated cyclically because of . is changed to and is changed to after the DMB operation in SLS.
Finally, the selection operation is performed. We suppose that , , , , and . Thus, , , and are retained to the next generation. For and , the two individuals are replaced and assigned to the next generation.
3. Results
To evaluate of the performance of the SAMA algorithm, we test it on both simulated and realworld biological datasets. we compare it with FHSASED, AntEpiSeeker, IEACO, and DESeeker on these datasets. For the simulated datasets, we adopt three twolocus disease models. For the realworld biological dataset, we run SAMA on an agerelated macular degeneration (AMD) data [35].
3.1. Simulated Datasets
In this subsection, we carry out the experiments in three simulated disease models (Models 13) [36]. Model 1 is a twolocus multiplicative model in which the disease prevalence increases multiplicatively with the incremental presence of the disease genotype interaction. Model 2 is a twolocus threshold model, in which does not increase until the number of disease genotype interactions pass the threshold. Model 3 is a twolocus concrete mode that simulates the effects of SNPSNP interactions on susceptibility to traits. In the three models, is set to 0.1, and the minor allele frequencies () is 0.05, 0.10, 0.20, and 0.50. The genetic heritability () is 0.005 in Model 1, and is 0.02 in Models 2 and 3. According to the combination of these values, 12 penetrance tables are obtained (see Table 1). 200 datasets corresponding to each penetrance table are generated using GAMETES_2.0 [37]. 100 SNPs are generated in the first 100 datasets, whereas the number of SNPs is 2000 in the other 100 datasets.
3.2. Parameter Setting
In the experiments, we set the same maximum number of iterations for the five algorithms, that is, the maximum iteration number for datasets with SNPs is set to and the maximum iteration number for datasets with is set to . The maximum number of iterations is less than the number of iterations using an exhaustive algorithm. Furthermore, the other parameters of the five compared algorithm are shown in Table 2.
3.3. Performance Evaluation Criteria
With the purpose of conducting the experiments comprehensively, we introduce two measurements: detection power and running time. The detection power is defined below: where is the datasets that are generated by the same penetrance table ( in the experiments) and is the number of datasets in which the twolocus SNPSNP interaction associated with disease is detected.
3.4. Experiments on Simulated Datasets
Figures 3 and 4 present the detection power of the five compared algorithms on the three disease models. It is indicated from the figures that the SAMA algorithm is better than or equal to FHSASED, AntEpiSeeker, IEACO, and DESeeker on most settings, with the exception of in Model 1 with 200 SNPs. SAMA detects all diseaseassociated SNPSNP interactions on six settings for the datasets with 200 SNPs, and the algorithm detects all diseaseassociated SNPSNP interactions on two settings for the datasets with 2000 SNPs. On the datasets with 200 SNPs, the other four algorithms can be comparable with SAMA because they also have good performance. On the datasets with 2000 SNPs, the detection power obtained by our algorithm is significantly greater than that of the other four algorithms, especially in Model 3. Followed by FHSASED and DESeeker, these two algorithms also show not bad performance. Next is IEACO. The performance of AntEpiSeeker performance is the worst in this experiment. The above analysis reveals that the proposed algorithm is more effective for detecting twolocus SNPSNP interactions.
Tables 3 and 4 show the running time of the five compared algorithms on the three disease models. As illustrated in the two tables, the running time of our method is less than that of the other four methods. This demonstrates that SAMA can efficiently decrease the running time in detecting twolocus SNPSNP interactions.
3.5. Experiments on a RealWorld Biological Dataset
According to the results of the simulated experiments, SAMA performs well for detecting twolocus SNPSNP interactions. In this section, we conduct experiments on a realworld biological dataset [35]. The purpose of the experiment is to detect twolocus SNPSNP interactions associated with the disease by using the five compared algorithms. The five algorithms are run 10 times, and Figure 5 is drawn according to the obtained values. In the figure, a solid dot has two values, one is value, and the other is value. The value represents the value, and the value denotes the SNPSNP interaction detected by an algorithm with a certain value. For the SAMA algorithm, 31 solid dots are detected, that is, 31 twolocus SNPSNP interactions are detected. It can be seen evidently that the number of solid dots found by the proposed algorithm is more than that found by the other four algorithms. Followed by AntEpiSeeker, this algorithm detects 27 solid dots. Next is DESeeker and FHSASED. The DESeeker algorithm detects 23 solid dots, and the FHSASED algorithm detects 22 solid dots. The number of interactions found by IEACO is relatively less. This algorithm only finds 21 solid dots. The above analysis shows that SAMA can detect more twolocus SNPSNP interactions than the other algorithms under the same number of iterations.
Table 5 presents the twolocus SNPSNP interactions with values less than 1.0e06 detected by our method. In the table, the number of twolocus SNPSNP interactions found by the SAMA algorithm with values less than 1.0e08, 1.0e07, and 1.0e06 are 1, 9, and 21, respectively. Table 6 gives the number of twolocus SNPSNP interactions detected by SAMA under different parameters. It can be seen from the Table 5 that rs380390 and rs1329428 are interacted with many other SNPs. The two SNPs are are located in the CFH gene, and the CFH gene has been commonly association with AMD [16, 38–40]. Furthermore, many SNPs included in detected SNPSNP interactions are located in nongene coding regions (NA). There are seven interactions between the CHF gene and NA when the value is less than 1.0e07, and there are ten interactions between the CHF gene and NA when the value is between 1.0e07 and 1.0e06. The CHF gene has one interaction with the KDM4C gene, and it has two interactions with the MED27 gene. SNP rs2224762 is located in the KDM4C gene that can regulate chromosome segregation during mitosis [41]. This gene that may be associated with AMD has been reported before [22, 42]. SNPs rs7467596 and rs9328536 in the MED27 gene are related to melanoma [43], and the mutation in the MED27 gene may be associated with AMD [42]. Moreover, SAMA detected some new twolocus SNPSNP interactions that have not been reported before. For example, rs1329428 has a interaction with rs10272438 and rs1740752 has a interaction with rs943008. SNP rs10272438 resides in the BBS9 gene which is involved in parathyroid hormone action in bones. SNP rs943008 resides in the NEDD9 gene, which is closely related to cancer. However, these twolocus SNPSNP interactions require further examination in future studies. It can be seen from the Table 6 that the parameters we set before can find the most number of twolocus SNPSNP interactions.
4. Conclusion
In the paper, we propose the SAMA algorithm to detect twolocus SNPSNP interactions associated with disease. The global search ability of SAMA is greatly increased by using HC, DBM, and EC. The selfadaptive behavior of SLS enhances the local search ability of SAMA without significantly increasing the running time. When using simulated datasets, the experimental results indicate that SAMA is more effective than FHSASED, AntEpiSeeker, IEACO, and DESeeker in terms of detection power and running time. When utilizing the realworld biological dataset, the experiments show that the proposed algorithm successfully detected known diseaseassociated SNPSNP interactions and some new suspected interactions. However, the SAMA algorithm still has some limitations. First, the detection power of SAMA is low for the disease models with small . Furthermore, the current version of SAMA cannot detect highorder SNPSNP interactions (). As far as we know, there does not exist a powerful method for detecting highorder SNPSNP interactions in GWAS. Therefore, detecting highorder SNPSNP interactions associated with disease has many rooms to explore in the future.
Abbreviations
ACO:  Ant colony optimization 
AntEpiSeeker:  Twostage ant colony optimization algorithm 
AMD:  Agerelated macular degeneration 
DE:  Differential evolution 
DBM:  Distributed breeder mutation 
DESeeker:  Twostage differential evolution algorithm 
ES:  Elitist selection 
FHSASED:  Harmony search algorithm with two scoring functions 
GA:  Genetic algorithm 
GWAS:  Genomewide association study 
IEACO:  Selfadjusting ant colony optimization based on information entropy 
HC:  Hybrid crossover 
LS:  Local search 
MA:  Memetic algorithm 
MAF:  Minor allele frequency 
SAMA:  Selfadaptive memetic algorithm 
SNP:  Singlenucleotide polymorphism 
SLS:  Selfadaptive local search. 
Data Availability
The data used to support the findings of this study are included within the article, which are described in detail in [30, 32], respectively.
Conflicts of Interest
The auhors declare that they have no conflicts of interest.
Acknowledgments
This work was supported in part by the National Natural Science Foundation Program of China under grant 61772124.