Abstract

As one of the most effective function mining algorithms, Gene Expression Programming (GEP) algorithm has been widely used in classification, pattern recognition, prediction, and other research fields. Based on the self-evolution, GEP is able to mine an optimal function for dealing with further complicated tasks. However, in big data researches, GEP encounters low efficiency issue due to its long time mining processes. To improve the efficiency of GEP in big data researches especially for processing large-scale classification tasks, this paper presents a parallelized GEP algorithm using MapReduce computing model. The experimental results show that the presented algorithm is scalable and efficient for processing large-scale classification tasks.

1. Introduction

In recent years, Gene Expression Programming (GEP) [1] algorithm has been widely studied due to its significant function mining ability. Comparing to the other machine learning algorithms such as support vector machine and neural networks, the most remarkable characteristic of GEP is that it can explicitly mine the mathematical equation of the dependent variable and independent variables from dataset. As a result, the equation can be easily stored and employed in future study of the data. Similar to the Genetic Algorithm, GEP algorithm also simulates the processes of biological evolution to mine a function with the best fitness to represent the data relations. During the evolution, the algorithm employs selection, crossover, and mutation operations to generate offspring. Each individual of the offspring is assessed by a fitness function. The individual that has a better fitness has a higher chance to be selected to produce a next generation. The evolution keeps evolving until a satisfied function that can describe the data relations is found.

As an effective data analyzing approach, classification has been researched a lot. The classification algorithms, especially supervised classification algorithms, for example, artificial neural networks (ANNs), show remarkable classification abilities. However, ANNs are also function-fitting algorithms fundamentally, although the algorithms cannot output the functions explicitly as GEP does. This point motivates us that GEP can also be employed to deal with the supervised classification tasks using the following idea:(1)Let the training data be and the encoded classes be .(2)Train the GEP algorithm using and to mine a function .(3)Input the to-be-classified data to ; observe the output , which can represent the classes that should belong to. Therefore, is classified.It also motivates us that, based on the works [2, 3], GEP-based classification has great potential to be further applied into large-scale classification tasks.

Unfortunately several works [46] pointed out that to process large-scale tasks using GEP may encounter the low efficiency issue. The reason is that, as a heuristic algorithm, GEP needs an extremely long time to mine the best-fitted function for large volume of data. Therefore to improve the large-scale classification efficiency using GEP is also focused by this paper. As a result, this paper presents a parallelized GEP algorithm in enabling large-scale classification. The algorithm is designed and implemented in the MapReduce distributed computing environment. Following a number of tests based on standard benchmark datasets have been carried out. The experimental results reveal that the parallelized GEP algorithm shows advantages in dealing with large-scale classification tasks.

The rest of the paper is organized as follows: Section 2 reviews the related work; Section 3 presents the parallelization of GEP; Section 4 discusses the experimental results; and Section 5 concludes the paper.

As an effective function mining algorithm, GEP has been widely applied in numbers of researches. Sabar et al. [7] employed GEP to design a hyperheuristic framework in order to solve the combinatorial optimization problems. Their experimental results show that the proposed framework has great potential to solve the problems. Hwang et al. [8] employed GEP to predict the Qos (Quality of Service) traffic in the Ethernet passive optical network. The authors combined GEP algorithm to tackle the queue variation during waiting times as well as reducing the high priority packet delay. Deng et al. [9] also employed GEP and rough set to assess the security risk in cyber physical power system. Based on their studies, security risk levels of cyber physical power system can be accurately predicted.

However, several works [46] pointed out that GEP has low efficiency issue for processing complicated tasks. To solve the issue, several researchers focused on improving the algorithm parameters of GEP. Xue and Wu [10] proposed Symbiotic Gene Expression Programming (SGEP) based on symbiotic algorithm, estimation of distribution algorithm, and evolution processes improved GEP. The experimental results indicate that SGEP outperforms GEP in terms of efficiency. Chen et al. [11] pointed out that the most computationally expensive computation of GEP is the evolution in the expression tree. Therefore they proposed Reduced-GEP algorithm which is based on the chromosome reduction. The experimental results show that the algorithm is effective and efficient in calculating the fitness and reducing the size of chromosome. In research [12], inspired by the diversity of chromosome arrangements in biology, an unconstrained encoded Gene Expression Programming was proposed. The approach can enlarge the function searching space, which enhances the parallelism and the adaptability of the standard GEP algorithm.

Another effective way of solving the efficiency issue of GEP is to use parallel computing or distributed computing. Du et al. [13] proposed an asynchronous distributed parallel GEP algorithm. They aimed at speeding up the convergence of finding the optimal solution using MPI (Message Passing Interface) [14]. In each processor, a standalone GEP algorithm is running. And then the processors exchange their best individuals and continue evolving. Until a termination message is sent to the processors, the algorithm stops. Based on the experimental results, the authors claimed that the proposed algorithm can greatly speed up the algorithm convergence. However, they have not evaluated the algorithm using large volume of data. And also MPI is highly depending on the homogeneous hardware environment, which limits the algorithm adaption. Du et al. also proposed a MapReduce [15] based distributed GEP algorithm to process large populations and datasets. Similar to [13], each Map computes the fitness and in each Reduce the selection, mutation, and crossover operations are executed. The output is output into the distributed file system where the exchanges of the best individuals occur. Although the authors claimed that they achieved algorithm speedup, two issues should be discussed. Firstly the algorithm needs a large number of Reduces, which generates system overhead because of IO operations [2] in reducers. Secondly their algorithm needs a large number of iterations. However, MapReduce does not support iteration originally. Instead, the algorithm has to submit a number of MapReduce jobs to the cluster, which generates tremendously large overhead [3]. Browne and dos Santos [16] also discussed the parallelization of GEP using the island model. However, their algorithm does not focus on the parallelization. And the algorithm performance for processing the large volume of dataset has not been evaluated.

The improvement of GEP presented in this paper mainly focuses on parallelizing the GEP algorithm for executing large-scale classification. Our algorithm first employs the Hadoop framework [17] as the underlying infrastructure. And secondly combining with the ensemble techniques, the algorithm is able to supply efficiency, scalability, and accuracy.

3. Algorithm Design

3.1. Classification Using GEP

Based on the selection, crossover, mutation, and fitness, GEP is able to mine a function from the given dataset. Therefore, let denote the training dataset; denote an instance in ; denote the length of ; denote the th class of ; denote the testing dataset; denote an instance in ; denote the coded identifier of class; denote a threshold; represent the number of correctly classified instances; and represent the number of training instances. The classification using GEP is shown in Algorithm 1.

For each
   Encode representing the class of
For each in
     Input and into GEP
Let be and be .
Initiate GEP components: function set, link function, selection
  mutation, crossover and fitness .
In each generation, GEP mines ,
  if
   
GEP keeps running
  until terminating condition (determined generation or fitness) met
Output which represents
Let denote a threshold
  For each in
   Compute
   if
     
   else
      is an outlier
Classification terminates
3.2. MapReduce and Hadoop

MapReduce computing model contributes two main functions Map and Reduce to facilitate the development of the distributed computing applications. Map function executes main computation and Reduce function collects the intermediate output of Maps and generates the final output. Each Map computes the data instances one by one in the form of key-value pair . And then the computed result is output as an intermediate output . The Reduces collects the intermediate output of all Maps. Afterwards each Reduce merges the inputs having the same key and generates the final output.

Hadoop framework [17, 18] is a Java based implementation of MapReduce. Two types of nodes including one NameNode and several DataNodes consist of a typical Hadoop cluster. The NameNode manages the metadata, whilst the DataNode executes a number of Map (mappers) and Reduce (reducers) operations in parallel. Both the NameNode and DataNodes contribute their resources including processors, memory, hard disks, and network adaptors to form the Hadoop Distributed File System (HDFS) [17]. HDFS not only is responsible for high performance data storage but also manages data processing for the mappers and reducers, which supplies fault tolerant, load balancing, scalability, and heterogeneous hardware support. Figure 1 shows the structure of Hadoop framework.

3.3. Parallelization of GEP in Enabling Large-Scale Classification

In the training phase, let denote the number of mappers. Therefore, the training dataset can be divided into a number of data chunks, in which each chunk is represented by . satisfiesFirstly mappers input data chunks in parallel. And then each mapper initiates one sub-GEP starting the function mining according to the input training data. As long as a function (a classifier) has been mined in each mapper, totally a number of classifiers are generated.

In the classification phase, the testing dataset is also divided into chunks. Each previously trained classifier inputs one testing data chunk and executes the classification. Therefore, the classification using GEP can be parallelized. However, one problem should be mentioned that, due to the data separation of the training dataset, each sub-GEP in each mapper is only trained by a subset of the original dataset. The insufficient training may lead to the loss of classification accuracy. In order to parallelize GEP classification avoiding accuracy loss, ensemble techniques including bootstrapping and majority voting are employed.

Bootstrapping is based on the idea of controlling the number of times that the training instances appear in the bootstrapping samples, so that, in the number of bootstrapping samples, each instance appears the same number of times [19]. To create balanced bootstrapping samples, the following steps could be followed:(1)Construct a string of the instances repeating times so that a sequence is achieved.(2)A random permutation of the integers from 1 to is taken. Therefore the first bootstrapping sample can be created from , moreover the second bootstrapping sample from , and so on.(3)Repeat step until , which is the Bth bootstrapping sample.Based on the bootstrapping, the instance distributions of the original dataset can be simulated in the samples, in which the original data information can be kept more. The majority voting is a commonly used combination technique. The ensemble classifier predicts a class for a test instance that is predicted by the majority of the base classifiers [20].

By employing the bootstrapping and majority voting, the parallelized GEP classification algorithm works as follows.

In the training phase:(1)The algorithm firstly generates a number of bootstrapped sample sets using the training dataset T. Each set is saved in one data chunk stored in the HDFS.(2)Each mapper initiates a sub-GEP and inputs one data chunk from HDFS.(3)Each mapper trains its sub-GEP according to step to in Algorithm 1. As long as the training terminates, the mined function is collected by reducer and saved into HDFS in the value pairs .(4)As in each mapper its sub-GEP mines the function individually, therefore, finally a number of different functions can be saved, which means a number of weak classifiers are created.Figure 2 shows the training phase of the algorithm.

In the classification phase:(1)Each mapper retrieves one function from HDFS so that the mapper becomes one weak classifier.(2)And then all the mappers input the same one instance from the testing dataset .(3)In each mapper, when is input, a value of can be computed according to . Comparing the values of , , and according to step in Algorithm 1, can be classified.(4)The th mapper outputs its intermediate output .(5)One reducer collects all the intermediate outputs from all the mappers. And then, it merges the outputs having the same key into one group. In the group, the majority voting is executed to vote the final classification result.Figure 3 shows the classification phase.

4. Algorithm Evaluation

4.1. Experimental Environment

In order to evaluate the algorithm performance, a physical Hadoop cluster constituted by one NameNode and four DataNodes is established. The details of the cluster are listed in Table 1.

The datasets employed in the experiments are Iris dataset [21] and Wine dataset [22]. The details of the datasets are listed in Table 2.

The parameters of GEP used in the experiments are listed in Table 3.

4.2. Accuracy of the Classification

Let rightNum represent the number of the correctly classified instances and wrongNum represent the number of the wrongly classified instances. Therefore the classification accuracy is defined as

In the following tests, increasing numbers of instances are selected from the two datasets as the training instances, whilst the rest instances are the testing instances. The bootstrapping number is four and the algorithm starts eleven mappers for executing the parallelized GEP classification. The experimental results are shown in Figures 414 and The Functions in Each Sub-GEP for Classifying Iris Dataset and The Functions in Each Sub-GEP for Classifying Wine Dataset in the appendix.

Figure 4 shows the classification accuracy of the Iris dataset with an increasing number of the training instances from 12 to 105. It can be observed that the parallel GEP algorithm performs highly stable and outperforms the standalone GEP algorithm. The visualizations of the classification results are shown by Figures 11, 12, and The Functions in Each Sub-GEP for Classifying Iris Dataset in the appendix.

Wine dataset has also been employed to evaluate the classification accuracy. Comparing to Iris dataset, each instance of Wine dataset has 13 attributes, which may impact the classification accuracy. The experimental result is shown in Figure 5.

Figure 5 shows the classification accuracy of the Wine dataset with an increasing number of the training instances from 12 to 118. The result shows that, due to more attributions of the instances, the accuracy of the parallel GEP starts fluctuating. However, in most of the tests the parallel GEP still outperforms the standalone GEP in terms of accuracy. It further indicates the ensemble techniques can help to improve the classification accuracy. The visualizations of the classification results are shown by Figures 13, 14, and The Functions in Each Sub-GEP for Classifying Wine Dataset in the appendix.

For further evaluating the effectiveness of the proposed algorithm, we also implemented backpropagation neural network (BPNN). The comparisons of the classification accuracy are shown in Figure 6.

Figure 6 indicates that, in terms of Iris dataset classification accuracy, the parallel GEP algorithm outperforms BPNN. Although the neural network also performs well in the classification, when the number of the training instances is small, it gives lower classification accuracy.

Figure 7 indicates that, in terms of Wine dataset classification accuracy, the parallel GEP algorithm greatly outperforms BPNN. Due to more attributions in the dataset, it is difficult for BPNN to correctly classify most of the testing instances. Contrarily, parallel GEP can still keep a higher accuracy.

It should be noticed that the bootstrapping number which represents the number of times the training instances appear in the bootstrapping samples also impacts the algorithm accuracy. Therefore Figure 8 is generated to show the classification results with increasing bootstrapping numbers. Wine dataset is selected as the experimental dataset, in which a number of 118 instances are the training instances and the remaining 60 instances are the testing instances.

In Figure 8, it can be observed that when the bootstrapping number is less than 6, the classification precision keeps increasing. And then, the precision varies slightly. Figure 8 significantly tells that initially enlarging the bootstrapping number improves the classification. However, when the bootstrapping number reaches a certain value, the performance in terms of classification precision cannot be further improved.

4.3. Running Time of the Classification

In this section, Wine dataset is selected as the experimental dataset. The algorithm processing time for the increasing training data sizes has been evaluated. In the following tests, firstly the bootstrapping number is 4, which means each training instance appears 4 times. The number of the training instances is 118 whilst the testing instances remain 60. And then the training data size is duplicated from approximately 0.5 MB to 1024 MB. It should be pointed out that, because of the duplication, the bootstrapping number will change from 4 to , where represents the duplicated times. However, this section only focuses on the algorithm efficiency. Therefore although the varying bootstrapping numbers may affect the classification precision slightly according to Figure 8, the algorithm processing time with increasing training data sizes is highlighted in Figure 9.

Figure 9 shows that when the training data size is small, the performances of the standalone GEP and the parallel GEP are nearly the same. However, when the data size becomes larger, the parallel GEP outperforms the standalone GEP. When the data size increases more than 256 MB, the standalone GEP cannot finish the classification due to memory limitation. Contrarily, the parallel GEP still works fine even if the data size increases to 1024 MB.

To further compare the classification efficiency to the other classification algorithms, the MapReduce based parallel bac propagation neural network algorithms (MRBPNN 1, 2, and 3) [2] are also implemented. The comparisons are shown in Figure 10.

Figure 10 shows that in terms of the algorithm running time, the parallel BPNN algorithms MRBPNN 1 and 2 outperform the parallel GEP. The main reason is that GEP needs longer time to evolve. Contrarily, MRBPNN 1 and 2 need shorter time to train the neurons. Although parallel GEP performs slower than MRBPNN 1 and 2 do, it can supply higher classification accuracy according to Figures 6 and 7.

5. Conclusion

This paper presents a MapReduce and ensemble techniques based parallel Gene Expression Programming algorithm in enabling large-scale classification. The parallelization of GEP mainly focuses on paralleling the training phase (function mining phase) which is the most time consuming and computational intensive process. The experimental results show that the presented algorithm outperforms the standalone GEP and BPNN in terms classification accuracy. In the algorithm executing time evaluations, the presented parallel GEP also shows remarkable performance comparing to the standalone GEP. Although the parallel GEP works slower than MRBPNN 1 and 2, it can supply higher classification accuracy, which enables the presented parallel GEP to be one of the effective tools dealing with large-scale classification.

Appendix

In the appendix, the details of classifying Iris and Wine dataset have been listed. Figure 11 visualizes the classification result of the standalone GEP for classifying Iris dataset.

Figure 11 indicates that GEP has ability to mine a function f to classify instances into three classes. In this case, the mined function f is represented by :

The Iris dataset classification results of the eleven sub-GEPs employed by parallel GEP are shown in Figure 12.

Figure 12 shows that the eleven sub-GEPs have different classification results due to differently mined functions . However, because of the majority voting in the reducer, parallel GEP is able to output a correct classification result. The eleven mined functions are listed as follows.

The Functions in Each Sub-GEP for Classifying Iris Dataset;;;;;;;;;;;Figure 13 visualizes the classification result of the standalone GEP for classifying Wine dataset.

In this case, the mined function is represented by :

The Wine dataset classification results of the eleven sub-GEPs employed by parallel GEP are shown in Figure 14.

The eleven mined functions are listed as follows.

The Functions in Each Sub-GEP for Classifying Wine Dataset + + + + + + + +

Competing Interests

The authors declare that there is no conflict of interests regarding the publication of this article.

Acknowledgments

The authors would like to appreciate the support from National Natural Science Foundation of China (no. 51437003).