Abstract

Gene expression programming (GEP), improved genetic programming (GP), has become a popular tool for data mining. However, like other evolutionary algorithms, it tends to suffer from premature convergence and slow convergence rate when solving complex problems. In this paper, we propose an enhanced GEP algorithm, called CTSGEP, which is inspired by the principle of minimal free energy in thermodynamics. In CTSGEP, it employs a component thermodynamical selection (CTS) operator to quantitatively keep a balance between the selective pressure and the population diversity during the evolution process. Experiments are conducted on several benchmark datasets from the UCI machine learning repository. The results show that the performance of CTSGEP is better than the conventional GEP and some GEP variations.

1. Introduction

Gene expression programming (GEP) [1, 2], improved genetic programming (GP) with linear representation [3, 4], is an artificial problem solver inspired in natural genotype/phenotype system. GEP combines both the simple, linear string of chromosomes with fixed length to represent the solutions similar to the ones utilized in genetic algorithm (GA) and the ramified structures with different sizes and shapes similar to the parse trees of GP [3, 5, 6]. Thus, GEP has the advantages of both GA and GP, while overcoming some of their individual limitations [3, 4]. Because of its high performance, GEP has attracted increasing attention recently as an efficient and effective data mining approach. Moreover, it has been successfully applied to many fields, such as function finding [79], symbolic regression [1013], parameter optimization [14], rule mining [15], classification [3, 16], time series forecasting [2], prediction of flow number of asphalt mixes [17], prediction of material load [18, 19], prediction of the strength of concrete [20], engineering design [21], and machine scheduling [22, 23].

Although GEP has been successfully employed in a variety of areas, in practical applications, it is found that the conventional GEP usually suffers from premature convergence and slow convergence rate resulting in poor solution quality and/or large computational cost [24]. The main reason is that the conventional GEP cannot quantitatively keep a balance between the selective pressure and the population diversity during the evolution process. Therefore, this may lead to trapping in the local optimum and/or slowing down the search speed.

In general, increasing selective pressure and promoting population diversity in GEP are often in conflict with each other [3, 4]. This means that increasing selective pressure may lead to more individuals being close to the best individual, and then the average fitness of the population is better. Hence, this can accelerate the convergence speed of the population. However, increasing selective pressure may result in an evolutionary state of which most of the individuals are approaching the best individual. As a result, the population diversity is significantly reduced after some generations, increasing the possibility of trapping into local optimum solutions. On the contrary, promoting population diversity can make the individuals distribute widely in the search space and increase the probability of finding the global optimum, but this may slow down the convergence speed.

To the best of our knowledge, there has been little research focusing on how to quantitatively balance the selective pressure and population diversity of GEP during the evolution process. Therefore, this motivates us to investigate a selection mechanism that can quantitatively keep a balance between the selective pressure and population diversity of GEP to enhance the global search ability and simultaneously to accelerate the convergence speed. Our work along this idea has produced a novel GEP based on component thermodynamical selection operator (CTS), called CTSGEP. This proposed approach, inspired by the principle of minimal free energy in thermodynamics, seeks to map the selective pressure and the population diversity into the mean energy and the entropy, respectively. In order to quantitatively balance the selective pressure and the population diversity of GEP, in the CTS, when selecting individuals for the next generation from the parent and offspring individuals, the selected individuals for the next generation should satisfy the principle of minimal free energy.

The rest of the paper is organized as follows. Section 2 introduces the notations and terminologies of GEP that are useful for the review of the previous works of GEP in Section 3. The proposed algorithm, CTSGEP, is elaborated in Section 4, with detailed explanations on the component thermodynamical selection operator. The computational results and comparisons are provided in Section 5. Finally, we end the paper with some conclusions in Section 6.

2. GEP Basic Concepts

2.1. Chromosomes Representation

The most innovative feature of GEP is the improved representation of chromosomes. GEP separates the genotype from the phenotype of the chromosomes [3], which is one of the greatest limitations of both GA [24, 25] and GP. In GEP, individuals are represented by linear strings and called chromosomes. In addition, the chromosomes consist of genes and link operators, in which the link operators connect the genes. The link operators usually can be arithmetic operators, such as +, , , and . Moreover, the genes of GEP can be categorized into two types [2]: genotype and phenotype. The genotype is the code of genes similar to that used in GA and the genetic operators directly manipulate the genotype, while the phenotype is the decoding of the genes consisting of the same kind of ramified structures with different sizes and shapes similar to the parse trees of GP. For instance, the detailed transformation process of gene “” can be shown in Figure 1. Hence, the merits are obvious to separate the genotype from phenotype of the chromosomes. On the one hand, the representation of the chromosomes is simple and compact. Therefore, the genetic operators are easy to implement and very efficient. On the other hand, this mechanism makes GEP able to solve complex problems.

In GEP, each gene is composed of two parts: a head and a tail. The head contains functional symbols (e.g., +, , , , etc.) and terminal symbols, but the tail contains only terminal symbols. Moreover, the length of the head is, selected by the user, determined by specific problems, while the length of the tail is a function of and . In addition, should satisfy (1), which makes sure that any gene can be decoded to a correct mathematical expression, where is the number of arguments for the function that takes the most arguments:

For example, we consider a gene composed of , , , , , , , where represents the square root function. In the set of functional symbols , −, , , , is 2. We assume is 4; it can be concluded that . Thus, the length of the gene is .

2.2. Genetic Operators

There are many genetic operators in GEP, including selection operator, mutation operator, transposition-insertion operator, and recombination operator. These genetic operators should be subject to the following conditions. The length of the head and that of the tail are subject to formula (1). The tail contains only terminal symbols [2]. Moreover, these conditions ensure that the genetic operators can generate new genes that are decoded to correct mathematical expression. Therefore, these operators are simple and easy to implement. The detailed description of these operators can be referred in [1, 2].

2.3. Fitness Functions

Generally, different fitness functions are suitable for different problems. The choice of fitness functions is quite crucial for GEP. This is mainly because the fitness functions may directly affect the convergence speed and the solution quality. In GEP, there are many kinds of fitness functions: absolute error fitness function, relative error fitness function, and logic synthesis fitness function [1, 2]. They are described as follows: where is a constant, which determines the range of , is the value calculated by the individual for the sample instance , is the target value for sample instance , is the total number of sample instances, and is the number of sample instances correctly predicted. In general, fitness functions (2) and (3) are employed to solve function regression problems and fitness function (4) is applied to Boolean concept learning problems.

2.4. The Framework of GEP

The framework of GEP is similar to that of GA [26]. The major difference between GEP and GA is the representation of chromosomes. However, the essential idea of GEP is the same as the one of GA [2], which is based on the concepts of natural selection and survival of the fittest. The procedure of GEP is described in Algorithm 1.

GEP Algorithm
Step  1   Initialize the parameters, , generate an initial population P(t);
Step  2   while (FES < Max_FES)
  {
     Evaluate population P(t);
     Save the best individual;
     Execute selection operator;
     Execute mutation operator;
     Execute transposition-insertion operator;
     Execute recombination operator;
     
  }
Step  3   Output the best individual.

3. Previous Work

In order to enhance the performance of the traditional GEP algorithm, many scholars recently have proposed several GEP variants. Moreover, these GEP variants can be classified into two categories: accelerating convergence speed and promoting population diversity.

3.1. Accelerating Convergence Speed

In order to accelerate the convergence speed of the traditional GEP, Karakasis and Stafylopatis [3] proposed a novel GEP for data mining tasks, which combined the principle inspired by the immune system, namely, the clonal selection principle. In the proposed algorithm, a receptor-editing step was added in order to achieve faster exploration of the antibody-antigen binding space. Experimental results showed that the proposed GEP variant outperformed the conventional GEP in terms of both prediction accuracy and computational efficiency. Zhang et al. [27] introduced an improved gene expression programming (IGEP), which employed a dynamic mutation operator to enhance the efficiency. The proposed algorithm can obtain better prediction results for the prediction of retention times for a larger set of pesticides than heuristic method. Further, IGEP as a nonlinear method had good generalized performance. By applying parallel taboo search, Rao et al. [28] presented an enhanced GEP to improve the local search ability of the conventional GEP. Wu et al. [29] proposed a parallel niche GEP based on general multicore processor to improve the evolution efficiency and the parallel model of niche GEP was designed by OpenMP. Based on analyzing the intelligibility and efficiency of expression-tree-based expression on GEP, Chen et al. [7] introduced a reduced GEP, of which the chromosomes were evaluated directly on the reduced gene without being expressed them into expression trees. Moreover, the result of the evolution by reduced GEP was simplified and easier to be understood and explained.

3.2. Promoting Population Diversity

For maintaining good population diversity of the conventional GEP, Jiang et al. [30] proposed an adaptive GEP algorithm based on cloud model. The proposed GEP algorithm employed an adaptive cloud strategy to determine the mutation and crossover rate dynamically to improve the population diversity. Li et al. [31] introduced an improved GEP (AMACGEP) by statistical analysis and critical velocity, which utilized statistical analysis of repeated bodies to enhance the diversity of the initial population. Moreover, it proposed a dynamic mutation operator to improve the diversity of individuals. Liu et al. [32] proposed a population diversity-oriented GEP (Mod-GEP) for function finding, in which two strategies including population updating and population pruning were used to increase the diversity of population. The experimental results showed that Mod-GEP can obtain more satisfactory solution than GP, GEP, and some other GEP variants. Zhang and Xiao [33] presented a population diversity strategy GEP (GEP-PDS). The presented GEP-PDS inherited the advantage of superior population producing strategy and various population strategies to maintain the diversity of population. Further, Zhang et al. [34] proposed an improved GEP based on block strategy (BS-GEP), in which the population was divided into several blocks according to the individual fitness of each generation and the genetic operators were reset differently in each block to preserve the population diversity. In addition, BS-GEP was also utilized in prediction of software failure sequence.

4. The Proposed CTSGEP Algorithm

4.1. Motivations

As pointed out in Section 3, some researchers have developed various GEP variants to improve the selective pressure in order to accelerate the convergence speed, whereas this may increase the possibility of trapping in local minima solutions [3, 27, 28]. Meanwhile, for the sake of decreasing the possibility of trapping in local minima solutions, many scholars have also attempted to encourage the population diversity during the evolution process. However, this may decelerate the searching speed [13, 3133]. Therefore, a feasible solution to overcome these deficiencies of GEP cannot only improve one of the selective pressures or population diversities. Thus, a better approach is to keep a balance between the selective pressure and the population diversity during the evolution process. Actually, the essence of reconciling the conflicts between the selective pressure and the population diversity is to solve a biobjectives optimization problem that can be formulated as follows.

In the parent population of size , offspring individuals are created by GEP genetic operators. Hence, there are individuals in total. Further, the biobjectives optimization problem is to select individuals from the parent and offspring individuals for the next generation population , which make sure that the selective pressure measured by the average fitness and population diversity of the next generation population satisfy Min .

Notice that, in the above formulation, without loss of generality, we assume that the larger fitness value implies the better individual in GEP. In addition, the selective pressure can be measured by the average fitness .

Many existing approaches, such as evolutionary multiobjective optimization algorithms, can tackle the above biobjectives optimization problem. However, the solving process of this biobjectives optimization problem is executed for every generation of GEP. Therefore, the computational complexity of the solving process should be low. Otherwise, it may lead to very slow convergence speed of the overall GEP algorithm. Thus, approaches with high computational complexity (e.g., evolutionary multiobjective optimization algorithms) may not be suitable. Furthermore, it is unrealistic to obtain the accurate solution of the biobjectives optimization problem, because the computational complexity is .

Based on the above considerations, we present a novel method, called CTS, to obtain the approximation solution of the above biobjectives optimization problem with very low computational complexity. Its primary idea is inspired by the principle of minimal free energy in thermodynamics. The principle of minimal free energy refers to [35, 36]; in the annealing process, a metal, starting with high temperature and disordered state, is gradually cooled in order that the system at any temperature approximately reaches thermodynamic equilibrium. This cooling process can be regarded as an adaptation procedure to achieve the stability of the final crystalline solid. In addition, any change from nonequilibrium to equilibrium of the system at each temperature follows the principle of minimum free energy. This means the system will change spontaneously to reach a lower total free energy and the system achieves equilibrium when its free energy seeks a minimum [36]. The free energy is defined by where is the mean energy of the system and is the entropy. According to the principle of minimal free energy, we can know that any change of the system can be viewed as a result of the competition between the mean energy and the entropy, and the temperature determines their relative weights in the competition [36]. In other words, the two objectives, namely, the mean energy and the entropy, are in conflict with each other, and the temperature is the weight between the mean energy and the entropy. Moreover, the final objective can be converted into the minimal free energy. Thus, this is similar to the relationship between the selective pressure and the population diversity addressed before. Therefore, we can solve the above biobjectives optimization problem according to the principle of minimal free energy.

4.2. Basic Concepts of Component Thermodynamical Selection Operator

In order to utilize the principle of minimal free energy to reconcile the conflicts between the selective pressure and the population diversity, we should first map the selective pressure and the population diversity into the mean energy and the entropy, respectively. According to the characteristics of GEP and our previous works in [36, 37], we give the following definitions.

Definition 1. Let be the search space; for any GEP individual , its fitness value is and the characteristic of the fitness value is that the larger fitness value indicates that the individual is better. The absolute energy of individual is defined by

Definition 2. Let be the GEP population of generation . The absolute energy window is defined as follows.(1)When , , where (2)When , , where where is the offspring population.

Definition 3. Let and be the GEP population of generation and the absolute energy window, respectively. For any GEP individual , its normalization energy within the absolute energy window is defined by

Definition 4. The th rank in the absolute energy window is defined by where , , and . is a scaling factor, is the number of ranks, and if , it denotes that is located in the rank .

Definition 5. Let and be the GEP population of generation and the absolute energy window, respectively. Moreover, the number of individuals located in the th rank is . The rank entropy is defined as follows

Definition 6. Let and be the GEP population of generation and the absolute energy window, respectively. The free energy is defined as follows: where is the temperature and is the mean energy, which is defined by:

Definition 7. Let and be the GEP population of generation and the absolute energy window, respectively. For any GEP individual   located in the rank where the number of the individuals is , the component free energy of individual is defined by

From the above definitions, we can obtain the following conclusion and the proof can be referenced in our previous work [36, 37]:

As we know, our objective is the minimal free energy. Therefore, according to this conclusion, we can calculate the free energy by computing the mean of the component free energy of every individual in the population. Hence, the minimal free energy can be approximatively obtained by the minimal component free energy of every individual in the population. Next, we will present the component thermodynamical selection operator of GEP based on this conclusion.

4.3. Component Thermodynamical Selection Operator of GEP

Based on the definitions in Section 4.2, we will introduce the component thermodynamical selection operator (CTS) of GEP. The main idea of CTS is to pick individuals, the component free energy of the picked individuals are the largest ones from the parent and offspring population, and then eliminate the individuals. Further, it can be proved that the remaining individuals approximately satisfy the principle of minimal free energy. The proof is similar to our previous work [36, 37]. The pseudocode of CTS operator is presented in Algorithm 2.

Component thermodynamical selection operator of GEP
 Step  1  Combine offspring population with parent population to
generate a temporary population ;
 Step  2  Compute the component free energy of each individual in
population ;
 Step  3  Pick the M largest component free energy individuals from
population ;
 Step  4  Eliminate the M picked individuals from population to
generate the population for the next generation.

In the CTS of GEP, we first calculate the component free energy of the individuals of parent and offspring population, and then eliminate the largest component free energy individuals to compose the next generation population. Using this method, we can select individuals for the next generation with very low computational cost and the computational complexity is . Furthermore, the process of computing the component free energy of each individual in the temporary population is shown in Algorithm 3, where is the number of ranks, is an array which recorded the number of individuals in each rank, and is the temporary population.

The process of computing the component free energy of each individual
   Step  1  / initialize the number of individuals in each rank and compute rank /
  for ( )
  {
        NR[i] = 0; / initialize the number of individuals in each rank /
        Compute rank according to (10)
  }
   Step  2  / Compute the number of individuals in each rank and obtain the rank of
each individual /
  for
  {
      for
      {
         if
          {
       NR ;
        ;
       break;
          }
        }
  }
   Step  3  / Compute the component free energy of each individual /
  for
  {
    = NR ;
   Compute the component free energy of individual
   according to (9) and (14);
  }

4.4. Algorithm Description of the Proposed CTSGEP

Similar to the traditional GEP, CTSGEP starts with initializing a population of individuals. Then at each temperature , it evolves generations. At each generation, new individuals are created by the uniform selection, mutation, transposition-insertion, and recombination operators, and then select individuals from the individuals for the next generation using CTS. This process is repeated until the termination criterion is reached. The CTSGEP algorithm description is summarized in Algorithm 4.

GEP Based on Component Thermodynamical Selection
 Step  1   Create a random initial population ;
 Step  2   Evaluate the population , and calculate the absolute energy of each individual
according to (6);
 Step  3   , , ;
 Step  4   Compute the absolute energy window according to (7);
 Step  5   while (FES < MAX_FES)
     {
       for ( ; ; )
       {
         Create M new individuals by the uniform selection, mutation,
      transposition-insertion and recombination operator;
        Establish the offspring population by the M new individuals;
        Evaluate the population , and calculate the absolute energy of each
      individual according to (6);
        Save the best individual;
       Compute the absolute energy window according to (8);
       Utilize CTS operator to select N individuals from for the next generation;
         ;
       }
       ;
       ;
     }
 Step  6   Output the best individual.

5. Numerical Experiments

5.1. Experimental Setup

In order to evaluate the performance of our proposed CTSGEP algorithm for function finding, in this section we compare CTSGEP algorithm with the traditional GEP and some GEP variations on the function finding data sets, including IGEP [27], AMACGEP [31], and Mod-GEP [32]. In addition, all of the compared algorithms are implemented with C++ program language.

The function finding datasets are taken from the UCI machine learning repository [38]. There are about 200 test instances for the function finding problems in UCI [38], and we randomly select 15 test instances, which are instances 10, 21 35, 44, 49, 52, 76b, 84b, 103, 126a, 148c, 155, 163, 182c, and 203.

In our experimental studies, for each algorithm and each test instance, 30 independent runs are conducted with 400000 function evaluations (FES) as the termination criterion. To fairly compare the mentioned algorithms, the common parameter settings of all the algorithms, as used or recommended in [1, 2, 31], are shown as follows:(i)head length: 20,(ii)gene length: 41,(iii)number of genes: 5,(iv)linking function: +,(v)function set: +, −, , , pow, sqrt, sin, cos, log, and exp,(vi)population size: 100,(vii)mutation probability: 0.08,(viii)one-point recombination rate: 0.3,(ix)two-point recombination rate: 0.3,(x)gene recombination rate: 0.3,(xi)IS transposition rate: 0.1,(xii)RIS transposition rate: 0.1,(xiii)gene transposition rate: 0.1.

In addition, the other parameter values of IGEP [27], AMACGEP [31], and Mod-GEP [32] are the same as their original papers. , , , , and in CTSGEP are set to 20, 20, 10, 2, 100, respectively. In our experiments, as recommended in [2], the average and standard deviation of the mean square error (MSE) are recorded for measuring the performance of each algorithm. The mean square error is calculated by [2] where is the target value for sample , is the predicted value by the algorithms for sample , and is the total number of samples in each dataset.

5.2. Comparison between CTSGEP and Other GEP Algorithms

The mean and the standard deviation of the MSE obtained by each algorithm for 15 test instances are summarized in Table 1. All the results are obtained from 30 independent runs. In addition, the best results among the five algorithms are marked in boldface. In order to have statistically sound conclusions, two-tailed -test at a 0.05 significance level is conducted on the experimental results. The last three rows of Table 1 summarize the experimental results.

Clearly, CTSGEP is the best among the five algorithms on the 15 test instances. It performs significantly better than GEP, IGEP, AMACGEP, and Mod-GEP on fifteen, fourteen, thirteen, and ten test instances according to the two-tailed -test, respectively. In addition, GEP cannot outperform CTSGEP on any test instance, while IGEP, AMACGEP, and Mod-GEP only surpass CTSGEP on one, one, and three test instances, respectively.

To compare the performance of these algorithms on the 15 test instances, the average ranking of the Friedman test is conducted by the suggestions considered in [39, 40]. Table 2 reports the average ranking of the five GEP algorithms on the 15 test instances. These GEP algorithms can be sorted by the average ranking into the following order: CTSGEP, Mod-GEP, AMACGEP, IGEP, and GEP. Thus, the best average ranking is obtained by the CTSGEP algorithm, which outperforms the other four GEP algorithms.

To compare the performance differences between CTSGEP and the other four GEP algorithms, we conduct a Wilcoxon signed-ranks test [41, 42] with a significance level equal to 0.05. Table 3 shows the resultant values when comparing between CTSGEP and the other four GEP algorithms. The values below 0.05 are typed in bold. From the results, it can be observed that CTSGEP is significantly better than GEP, IGEP, and AMACGEP algorithms. Besides, CTSGEP is not significantly better than Mod-GEP. However, CTSGEP performs better than Mod-GEP according to the average rankings shown in Table 2.

In summary, CTSGEP is the winner on these 15 test instances. This can be because CTSGEP could quantitatively keep a balance between the selective pressure and the population diversity during the evolution process, whereas IGEP only employs a dynamic mutation operator to enhance the convergence speed, while AMACGEP and Mod-GEP merely maintain the diversity of population.

For the convenience of illustration, the evolution of the mean MSE derived from GEP, IGEP, AMACGEP, Mod-GEP, and CTSGEP versus the number of FES is plotted in Figures 2, 3, and 4 for some typical test instances. From Figures 2, 3, and 4, it is clear that CTSGEP exhibits faster and more stable convergence, for it can obtain a compromise between the selective pressure and the population diversity.

5.3. Parameter Sensitivity Study

In this section, we conduct a series of experiments to study the two important parameters of CTSGEP, which are the offspring population size and the number of ranks . The former is related to the selective pressure, while the latter is correlated with the population diversity.

5.3.1. Sensitiveness to Offspring Population Size

An experiment is conducted to investigate the sensitivity of CTSGEP algorithm to variations in offspring population size based on the 15 test instances described in Section 5.1 over 30 independent runs. Obviously, the offspring population size is related to population size . Therefore, we set which varies from % to % with a step equal to 5 in the experiment. In addition, all the other parameters of CTSGEP are the same as those in Section 5.1. Results for some typical test instances, reported in Figure 5, show that the performance of CTSGEP changes with offspring population size . Here, we omit plots for all other test instances as they exhibit a similar behavior. The -coordinate of each plot in Figure 5 represents the offspring population size , while the -coordinate stands for the average MSE over 30 independent runs. It can be easily seen from Figure 5 that CTSGEP performs best when the offspring population size is selected in the range [].

5.3.2. Sensitiveness to Number of Ranks

The impact of the number of ranks is investigated using the 15 test instances described in Section 5.1 over 30 independent runs. We fix the parameters of CTSGEP the same as those in Section 5.1 except that ranges from to with a step of 5. The results for some typical test instances are shown in Figure 6. Here, we also omit results for all other test instances since they show the similar tendency as well. In the figure, it is clear that CTSGEP works best with the number of ranks .

6. Conclusion

GEP is an increasingly popular tool for data mining. However, it tends to suffer from premature convergence and slow convergence rate when solving complex problems. Aiming at this drawback of GEP, we present a novel GEP based on the component thermodynamical selection operator. CTSGEP, proposed in this paper, is inspired by the principle of minimal free energy in thermodynamics, which maps the selective pressure and the population diversity into the mean energy and the entropy, respectively. Further, due to the chosen individuals for the next generation satisfying the principle of minimal free energy, the proposed approach can quantitatively keep a balance between the selective pressure and population diversity of GEP.

The experimental studies in this paper were conducted on 15 test instances of function finding problems taken from the UCI machine learning repository. CTSGEP was compared with the conventional GEP and three GEP variations, that is, IGEP, AMACGEP, and Mod-GEP. The experimental results demonstrated that its overall performance was better than the four competitors. Moreover, the parameters sensitivity study of CTSGEP was also experimentally investigated.

In the future, we will perform more detailed evaluation of CTSGEP for the large scale data-mining problems, which is considered as a challenge by the data mining community. In addition, it is also interesting to study how to incorporate parameter adaptation schemes to CTSGEP.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of the paper.

Acknowledgments

The authors would like to thank anonymous reviewers for their detailed and constructive comments which help them to improve the quality of this work significantly. This work was supported in part by the National Natural Science Foundation of China (nos. 61070008, 61303137, and 61364025), by Startup Foundation for Ph.D. of JiangXi University of Science and Technology (no. jxxjbs13028), and by the Natural Science Youth Foundation of Hebei Educational Committee (no. QN20131053).