Abstract

Gene-gene interaction studies focus on the investigation of the association between the single nucleotide polymorphisms (SNPs) of genes for disease susceptibility. Statistical methods are widely used to search for a good model of gene-gene interaction for disease analysis, and the previously determined models have successfully explained the effects between SNPs and diseases. However, the huge numbers of potential combinations of SNP genotypes limit the use of statistical methods for analysing high-order interaction, and finding an available high-order model of gene-gene interaction remains a challenge. In this study, an improved particle swarm optimization with double-bottom chaotic maps (DBM-PSO) was applied to assist statistical methods in the analysis of associated variations to disease susceptibility. A big data set was simulated using the published genotype frequencies of 26 SNPs amongst eight genes for breast cancer. Results showed that the proposed DBM-PSO successfully determined two- to six-order models of gene-gene interaction for the risk association with breast cancer (odds ratio > 1.0; value ). Analysis results supported that the proposed DBM-PSO can identify good models and provide higher chi-square values than conventional PSO. This study indicates that DBM-PSO is a robust and precise algorithm for determination of gene-gene interaction models for breast cancer.

1. Introduction

Genome-wide association studies (GWAS) for the analysis of gene-gene interaction are important fields for detecting the effects of cancer and disease [14]. Such studies usually entail the collection of a vast number of samples and SNPs selected from several related genes of disease in order to identify the association amongst genes. Disease effect, in general, is influenced by the best association between SNPs from several genes; these SNPs could have a potential association to provide information for disease analysis. Therefore, a method for searching high-order interactions is needed to determine the potential association between several loci.

Good models of the association between SNPs from several genes are usually hidden in the large number of possible models. The sum of all possible models of association between case data and control data can be computed by , where represents a total number of SNPs, is a selected number of SNPs, and is the number of genotypes. Data mining and machine learning methods have been proposed for use in GWAS data analysis. These computational approaches were developed to examine epistasis in family-based and case-control association studies [512]. The genetic algorithm (GA), particle swarm optimization (PSO), and chaotic particle swarm (CPSO) methods were proposed to identify the models of gene-gene interaction. However, the ability to determine the relative model quality needs to be improved. In mathematics, the problem space for identifying good models is not linear and the algorithm converges easily to a local optima, since no better models are found near the best model in that region. PSO often leads to premature convergence, especially in complex multipeak search problems. Therefore, the use of chaotic sequences to improve the PSO has been proposed to identify models of gene-gene interaction [7]. An improved PSO using a double-bottom chaotic maps (DBM-PSO) [13] has been shown to overcome the respective disadvantages of PSO and CPSO. In this study, DBM-PSO is applied to assist statistical methods in the analysis of associated variations to disease susceptibility.

A total of 26 SNPs obtained from eight related genes of breast cancer (EGF, IGF1, IGF1R, IGF2, IGFBP3, IL10, TGFB1, and VEGF) were used to test the various methods for comparison of the association models. It is proposed that the interactions between polymorphisms of breast cancer-related genes may have synergistic effects on the pathogenesis of cancer and disease; this would explain differences in disease susceptibility. The quality of a model of gene-gene interaction can be assessed by determining its odds ratio (OR), confidence intervals, and value. We systematically evaluate the model effects from two- to five-order interactions to compare the DBM-PSO with other PSOs methods.

2. Methods

2.1. Problem Description

To identify the quality of the models of gene-gene interaction problem, the model includes SNPs and their corresponding genotypes. The set represents a possible model as a solution in the problem space; each parameter is a real number. The chi-square test is used to design the PSO and DBM-PSO fitness functions. The objective is to search for a vector which has its own best fitness value according to the evaluation of fitness function that is, , for all , where is a nonempty large finite set serving as the search space and .

2.2. Particle Swarm Optimization

Particle swarm optimization (PSO) is a population-based stochastic optimization technique [14]. The conception of PSO is based on a robust theory of swarm intelligence to search for an optimal resolution of complex problems. Swarm intelligence describes an automatically evolving system based on simulating the social behaviour of organisms, for example, knowledge sharing. Therefore, valuable information can be shared amongst swarm members to suggest a common objective which leads individuals toward an optimal direction. PSO has been used to solve several types of optimization problems [15], including function optimization and parameter optimization [16] and shows promise for nonlinear function optimization [1722]. In PSO, possible solutions are represented as the particles. During generation, particle positions are adjusted according to the updated velocity toward a significant objective. The objective of each particle is defined based on the particle’s previous experience and knowledge commonly held by the population . Thus, particles can effectively converge into a solution-rich area to find the better solution. Finally, the particles follow the current best particle in the search space until a predefined number of generations are reached. The PSO procedure entails population initialization, objective function evaluation, identification of and , particle updating, and the termination condition. These steps are described in detail in the following section.

2.3. Double-Bottom Map Particle Swarm Optimization

Double-bottom map particle swarm optimization (DBM-PSO) was proposed by Yang et al. in 2012 [13]. While PSO is easily complicated by the existence of nonlinear fitness function with multiple local optima, this is not an issue for DBM-PSO. A local optima, , can be described as , where represents any -norm distance measure. In PSO, the flexibilities of given constraints and vector space in the problem influence the determination of the best solution. Generally speaking, and independently influence search exploitation and exploration, and the effect of and on the convergence behaviour is very important in PSO. Recently, chaos approaches have been proposed to overcome the inherent disadvantages of PSO. Chaotic maps are easily applied in PSO to prevent entrapment of the population in a local optima [23]. DBM-PSO proposes a new type of chaotic map, called double-bottom maps, to improve the search ability of PSO. Double-bottom maps are used to design an updating function to balance the exploration and exploitation for PSO search capability. The superiority of the double-bottom map over other chaotic maps lies in the fact that it provides high frequencies in the three regions over time, that is, 0.0, 0.5, and 1.0. Ideally, the distribution ratios of 0.0, 0.5, and 1.0 can be effective in balancing the search behaviour; however, the double-bottom map is designed to satisfy this PSO property.

Algorithm 1 shows the DBM-PSO pseudocode and explains all processes in DBM-PSO to identify the best model of gene-gene interaction. The difference between PSO and DBM-PSO is that the proposed double-bottom map is applied in the updating function of the PSO process (symbol 14 of Algorithm 1). All steps in DBM-PSO for identifying the models of gene-gene interaction problems are explained below.

(01) begin
(02)  Randomly initialize particles swarm and DBMr
(03)  for   to the number of iteration
(04)Evaluate fitness values of particles by FITNESS( )
(05)for   to number of particles
(06)  Find pbest by (13)
(07)  Find gbest by (14)
(08)  for   to the number of dimension of particle
(09)    Update the velocities of particles by (15)
(10)     Update the positions of particles by (16)
(11)   next  
(12)next  
(13)Update the inertia weight value by (17)
(14)Update the value of DBMr by (18)
(15) next  
(16) end

2.4. Initializing Particles and DBMr

In DBM-PSO, a point in the search space is a set which includes the real element . Each particle is a possible solution to the corresponding problem. The subsequent iteration is denoted by . Since the elements in a set are likely to change over a sequence of iterations, (1) represents the particle in the population of iteration as

In this study, a particle in the population represents a solution, that is, a model of gene-gene interaction. A particle contains two separate sets: a set of selected SNPs and a set of genotypes. For each element in , a certain range within the value is restricted. The values are related to physical components or measurement, that is, natural bounds. The initial population (at ) process covers a certain range as much as possible by uniformly randomizing individuals within the search space constrained according to the minimum and maximum bounds, which are represented by and and and , respectively. Equation (2) shows all genotypes. The homozygous reference genotype is represented as 1, while the heterozygous genotype is represented as 2, and the homozygous variant genotype is represented as 3:

The particles are generated by (3). Particles are initialized by generating the random set in a particle: where and represent a limited SNP, while and represent the limited possible genotypes. For example, let ; thus represents the    in the first generation (at ) of selected SNPs (1, 3, 4) and genotypes (2, 1, 2) and can be described by the SNPs associated with the genotypes as follows: (1, 2), (3, 1), and (4, 2).

All random values in the particles are generated with a random value between 0.0 and 1.0 for each independent run.

2.5. Evaluating the Qualities of Particles Using Fitness Function

In the DBM-PSO process, the fitness function measures the quality of particles in the population. The studies of gene-gene interaction focus on the combinations of SNP genotypes to identify the highest chi-square (χ2) value between breast cancer cases and noncancer cases; the value is called the fitness value in DBM-PSO. Algorithm 2 shows the fitness value computation pseudocode. In (4) and (5), symbols and are, respectively, the sizes of case data and control data, while in (4), (5), (6), and (7), and are, respectively, the sets of case data and control data. The in (4) is used to count the number of including the ; that is, . The in (5) is used to count the number of including the ; that is, . The in (6) represents the total number of unmatched in the ; that is, . The in (7) represents the total number of unmatched in the ; that is, . Equation (9) computes the difference between case data and control data and is used to determine whether the model is associated with risk or protection. Equation (10) is used to compute the fitness value if the objective is to search the risk association model. Equation (11) is used to compute the fitness value if the objective is to search the protection association model. Equation (12) is the chi-square (χ2) function and is used to compute the χ2 value between breast cancer cases and noncancer cases in this study. Consider where

(01) FITNESS( )
(02)   Compute using (4)
(03)   Compute using (5)
(04)   Compute using (6)
(05)   Compute using (7)
(06)   Compute RorP using (9)
(07)   if the objective is search of risk association model
(08)   Compute fitness_value using (10)
(09)   else if the objective is search of protection association model
(10)   Compute fitness_value using (11)
(11)    End if
(12)   Return  fitness_value
(13) End

2.6. Updating the of Particles and of Population

Each particle can be improved according to the two objectives, and , to search for a better solution. indicates the best value of a position previously visited by the particle, and its position is denoted by . Equations (13) are the updating functions for a particle’s position and value, respectively, as follows: where indicates the best value of all values for a particle and its position is denoted by . Equations (14) provide the updating function for position and value, respectively, as follows:

2.7. Updating Particle Velocities and Positions

DBM-PSO executes a search for optimal solutions by continuously updating particle positions in all iterations. Equations (15) and (16) are used to update the velocity and a position of the particle, respectively, as follows: where and are acceleration constants that control how far a particle moves in a given iteration. Random values, and , in (15) are generated by a function based on the results of the double-bottom map with values between 0.0 and 1.0; they are described in the following section. Velocities and are a particle’s new and old velocities, respectively. Positions and are the particle’s current and updated positions, respectively. Variable is the inertia weight and is described in the following section.

2.8. Updating Particle Inertia Weight Values

Variable in DBM-PSO is called the inertia weight which is used to control the impact of a particle’s previous velocity. Throughout all iterations, decreases linearly from 0.9 to 0.4 [24], and the equation can be written as where represents the th iteration and represents the iteration size. Values and represent the maximal and minimal values of , respectively.

2.9. Updating Particle DBM Values

In DBM-PSO, two random values in the updating function are generated by the following double-bottom map function:

2.10. Parameter Settings

In this study, all methods used the same parameters to test the search ability for the identification of the models of gene-gene interaction. The population size is 100 and the maximal iteration is 100. The value of inertia weight is set from 0.9 to 0.4 [25]. Both learning factors, and , are equal to 2 [26]. All tests are implemented in Java as a single thread in a PC environment running 32-bit Windows 7 with an Intel coreTM2 Quad CPU Q6600 at 2.4 GHz and 4 GB of RAM.

2.11. Statistical Analysis

The model of associations between SNPs can be evaluated by odds ratio (OR) and its 95% CI and value [27]. OR can evaluate the models to quantitatively measure the risk of disease; value can evaluate whether the results are statistically significant for the difference between the case data and control data. All statistical analyses are implemented using SPSS version 19.0 (SPSS Inc., Chicago, IL).

3. Results and Discussion

3.1. Data Set

The growth factor-related genes of breast cancer, including genes of EGF, IGF1, IGF1R, IGF2, IGFBP3, IL10, TGFB1, and VEGF with 26 SNPs, were tested in this study. A genotype generator is used to generate a large simulated data set according to the genotype frequencies. Algorithm 3 shows the genotype generator pseudocode to explain how the data set was generated. The genotype frequencies of SNPs are collected from Pharoah et al.’s breast cancer association study [39], which explains the significance of these SNPs of genes in breast cancer.

(01) begin
(02) for   to the number of SNP
(03)   compute size of “AA” genotype in n-SNP
(04)   compute size of “Aa” genotype in n-SNP
(05)   compute size of “aa” genotype in n-SNP
(06)   generate three genotypes into a set according each size
(07)   randomly sort the elements of
(08) next  
(09) set dataset = is the number of SNP}
(10)  end

3.2. Evaluation of Breast Cancer Susceptibility Using 26 SNPs from Eight Growth Factor-Related Genes

Table 1 shows the performance (OR and 95% CI) for estimating the effect of a single SNP from eight growth factor-related genes (EGF, IGF1, IGF1R, IGF2, IGFBP3, IL10, TGFB1, and VEGF). Amongst the 26 SNPs in the eight genes, eight SNPs in four genes display a statistically significant OR   for breast cancer. Six SNPs have a risk association for breast cancer, including rs5742678-GG, rs1549593-AA, rs6220-GG, IGFIR-10-aa, rs2132572-GA and -AA, and rs1800470-CC. The highest and lowest OR values are 1.33 and 1.09, respectively. Two SNPs have a protection association for breast cancer, including rs2229765-AA and rs2854744-CC. The highest and lowest OR values are 0.88 and 0.82, respectively. The other SNPs show no statistically significant OR for breast cancer.

3.3. Analysis of Models for Gene-Gene Interaction with Risk Association between the Case and Control Data Sets Using PSO, CPSO, and DBM-PSO

Table 2 shows the 2- to 7-order risk association models for gene-gene interaction. The results are compared with the χ2 value, with a high value indicating a good result. The model of 2-SNPs with their corresponding genotypes, SNPs (1, 7) with genotypes 1-3, [rs5742678-CC]-[IGF1R-10-aa], is identified as having 9.451 χ2 value to explain the difference between the case and control data sets for three methods. However, the results of 3- to 7-SNPs clearly indicate that the DBM-PSO algorithm exhibited an improved search ability over PSO and CPSO in terms of the comparison with the χ2 value. For example, in 3-SNPs, DBM-PSO is identified as having a χ2 value of 8.772, but those of PSO and CPSO are 3.364 and 3.997, respectively. Table 2 shows the (OR) and its 95% CI, which estimate the impact of the risk association model on the occurrence of breast cancer. A bigger OR value (>1) indicates a stronger risk association between the SNPs with combined genotypes and the disease. DBM-PSO shows high OR (1.346–10.018) values for models with a high association for the risk of breast cancer, and the value (<0.05) indicates that the models have a statistically significant difference between patients and nonpatients. Aside from a 3-SNP model of CPSO, the values of models in 3- to 7-SNPs of PSO and CPSO show no statistical significance, indicating that PSO and CPSO have difficulty in identifying statistically significant models for risk association for breast cancer. However, DBM-PSO successfully identifies good models for risk association for breast cancer.

3.4. Analysis of Models of Gene-Gene Interaction with Protection Association between Case and Control Data Sets Using PSO, CPSO, and DBMPSO

Table 3 shows the 2- to 7-order protection association models. The OR values (<1) estimate the impact of the protection association model on the occurrence of breast cancer. High χ2 values in the models indicate good results, and the value indicates that the model has a statistically significant difference between patients and nonpatients. The results of 3- to 7-SNPs show that DBM-PSO possesses higher χ2 values than PSO and CPSO, indicating that DBM-PSO is better to search for good protection association models than other methods. DBM-PSO has OR values ranging from 0.755 to 0.850, with a value of <0.05 for protection with breast cancer. The 2-SNP and 3-SNP models in PSO and CPSO show a statistically significant difference between patients and nonpatients , and the 4-SNP model in CPSO also shows a statistically significant difference. Although CPSO provides better OR values than DBM-PSO in the 5-, 6-, and 7-SNP models, the values indicate that these models are not statistically significant. DBM-PSO successfully identifies good models for protection association for breast cancer.

3.5. Discussion

Effects between SNPs from several genes could contribute to disease development. Case-control studies are the main method to determine the association between SNPs. Many breast cancer studies have analysed the associations between important related genes [2834], hypothesizing that disease risk may be associated with the cooccurrence of SNPs displaying a jointed effect, including genes related to DNA repair [35, 36], chemokine ligand-receptor interactions [37], and estrogen-response genes [4].

Evolutionary algorithms are applied to identify good models of gene-gene interaction [7, 9]. Previous studies have used the difference between case and control data sets to design the fitness function, allowing for the identification of models with high difference values for all SNP combinations. However, the highest difference between the case and control data sets is not necessarily statistically significant . The chi-square test is a statistical tool to evaluate the difference between the observed and expected data sets under specific hypothetical conditions. A property of the chi-square test is that the chi-square value is inversely proportional to value. Therefore, the chi-square test is used to design the fitness function in this study. PSO and CPSO [7] were used to search for good models based on the new fitness function, but the results (Tables 2 and 3) fail to identify high-order associations. However, DBM-PSO effectively identified good risk and protection association models of gene-gene interactions for breast cancer. Statistical methods, such as value, OR, and its 95% CI, provide strong validation of the search ability of DBM-PSO.

PSO and DBM-PSO use the fitness functional computation to calculate complexity. DBM-PSO can be observed in (15) and (18). Equation (18) is only used to amend the original PSO updating equation (15). Therefore, DBM-PSO does not increase the complexity of the PSO search process. The computational complexity of DBM-PSO is big-O, where is the number of iterations and is the number of particles.

The results of DBM-PSO are influenced by its parameters, including double-bottom chaotic maps (18), population size, iteration size, and and in the updating function (15). Yang et al. [13] tested the 22 most commonly used representative benchmark functions, selecting the optimal parameters (4π) in the proposed double-bottom chaotic maps. Therefore, the parameter is suggested as 4π in (18). The population and iteration sizes could be adjusted according to the size of the data set. Population size suggested a setting from 50 to 200 and the suggested number of iterations ranges from 100 to 1000. and are both suggested to be 2 [38].

4. Conclusion

We proposed a new fitness function to identify good models of gene-gene interaction for the investigation of polygenic diseases and cancers. The fitness function based on chi-square test addresses the disadvantage of previously proposed fitness functions, in that the highest difference between the case and control data sets is not necessarily statistically significant . Our proposed DBM-PSO showed to be able to successfully determine the 26 SNP cross interactions for risk and protection models of gene-gene interactions in breast cancer. The results indicate that DBM-PSO can successfully use the chi-square test to identify good models by evaluating the difference between the observed and expected data sets under specific hypothetical conditions.

Conflict of Interests

The authors declare that they have no conflict of interests regarding the publication of this paper.

Acknowledgments

This study was partly supported by the National Science Council of Taiwan for Grants NSC102-2221-E-151-024-MY3, NSC102-2622-E-151-003-CC3, NSC101-2221-E-214-075, NSC101-2622-E-151-027-CC3, NSC100-2221-E-151-049-MY3, and NSC100-2221-E-151-051-MY2, the National Sun Yat-sen University-KMU Joint Research Project (no. NSYSU-KMU 103-p014), and the Ministry of Health and Welfare, Taiwan (MOHW103-TD-B-111-05).