Research Article | Open Access
Using Genetic Programming with Prior Formula Knowledge to Solve Symbolic Regression Problem
A researcher can infer mathematical expressions of functions quickly by using his professional knowledge (called Prior Knowledge). But the results he finds may be biased and restricted to his research field due to limitation of his knowledge. In contrast, Genetic Programming method can discover fitted mathematical expressions from the huge search space through running evolutionary algorithms. And its results can be generalized to accommodate different fields of knowledge. However, since GP has to search a huge space, its speed of finding the results is rather slow. Therefore, in this paper, a framework of connection between Prior Formula Knowledge and GP (PFK-GP) is proposed to reduce the space of GP searching. The PFK is built based on the Deep Belief Network (DBN) which can identify candidate formulas that are consistent with the features of experimental data. By using these candidate formulas as the seed of a randomly generated population, PFK-GP finds the right formulas quickly by exploring the search space of data features. We have compared PFK-GP with Pareto GP on regression of eight benchmark problems. The experimental results confirm that the PFK-GP can reduce the search space and obtain the significant improvement in the quality of SR.
Symbolic regression (SR) is used to discover mathematical expressions of functions that can fit the given data based on the rules of accuracy, simplicity, and generalization. As distinct from linear or nonlinear regression that efficiently optimizes the parameters in the prespecified model, SR tries to seek appropriate models and their parameters simultaneously for a purpose of getting better insights into the dataset. Without any prior knowledge of physics, kinematics, and geometry, some natural laws described by mathematical expressions, such as Hamiltonians, Lagrangians, and other laws of geometric and momentum conservation, can be distilled from experimental data by the Genetic Programming (GP) method on SR .
Since SR is an NP-hard problem, some evolutionary algorithms were proposed to find approximate solutions to the problem, such as Genetic Programming (GP) , Gene Expression Programming (GEP) , Grammatical Evolution (GE) [4, 5], Analytic Programming (AP) , and Fast Evolutionary Programming (FEP) . Moreover, recent researches in SR problem have taken into account machine learning (ML) algorithms [8–10]. All of the above algorithms randomly generate candidate population. But none of them can use various features of known functions to construct mathematical expressions adapted for describing the features of given data. Therefore, these algorithms may exploit huge search space that consists of all possible combinations of functions and its parameters.
Nevertheless, a researcher always analyzes data, infers mathematical expressions, and obtains results according to his professional knowledge. After getting experimental data, he observes the data distribution and their features and analyzes them with his knowledge. Then, he tries to create some mathematical models based on natural laws. He can obtain the values of coefficients in these models through regression analysis methods or other mathematical methods. And he evaluates the formulas which are mathematical models with the values by using various fitness functions. If the researcher finds some of the formulas that fit the experimental data, he can transform and simplify these formulas and then obtain the final formula that can represent the data. Furthermore, his rich experience and knowledge can help him to reduce the searching space complexity so that he can find the best fit mathematical expression rapidly. As the researchers use their knowledge to discover the best fitted formulas, the methods that inject domain knowledge into the process of SR problem solving have been proposed to improve performance and scalability in complex problem [11–13]. The domain knowledge, which is manually created by the researcher’s intuition and experience, is of various formulas which are prior solutions to special problems. If the domain knowledge automatically generates some fitted formulas that are used in evolutionary search without the researcher involvement, the speed of solving SR problem will be quickened. A key challenge is how to build and utilize the domain knowledge just like the researcher does.
In this paper, we present a framework of connection between Prior Formula Knowledge and GP (PFK-GP) to address the challenge:(i)We classify a researcher’s domain knowledge into PFK Base (PFKB) and inference ability after analyzing the process that a research discovered formulas from experimental data (Section 2). PFKB contains two primary functions: classification and recognition. The aim of two functions is to generate feature functions which can represent the feature of experimental data.(ii)In order to implement classification, we use the deep learning method DBN [14, 15] which, compared with other shallow learning methods (Section 3.1), can classify experimental data into a special mathematical model that is consistent with data features. However, the classification method may lead to overfitting because the method can only categorize experimental data into known formula models which come from the set of training formula models.(iii)Therefore, recognition is used to overcome the overfitting. It can extract mathematical models of functions that can show full or partial features of experimental data. Three algorithms GenerateFs, CountSamePartF, and CountSpecU (see Algorithms 2, 3, and 4) are designed to implement recognition. For example, from the dataset generated by , the basic functions sin, exp, and cube can be found by the above three algorithms. In Figure 1, the function sin shows the periodicity of data, and exp or cube shows the growth rate of data. Therefore, these basic functions (called feature functions) can describe some features of the dataset.(iv)The inference ability is concluded to the searching ability of evolutionary algorithm. As researches infer mathematical models, GP is used to combine, transform, and verify these models. These feature functions that are generated by PFKB are selected to be combined into the candidate population in the light of algorithm randomGenP (see Algorithm 5). With the candidate population, GP can get convergent result quickly because it searches answers in a limit space which is composed of various feature functions. Through experiment on eight benchmark problems (Table 5 ), the results demonstrate that PFK-GP, compared with Pareto optimization GP [16, 17], shows the significant improvement in accuracy and convergence.
2.1. Definition and Representation of Mathematical Expression
In this section, we will define concepts about SR problem and show how to represent these concepts by applying BNF expression. For SR problem, the word “formula” is the general term which describes mathematical expression that fits the given data. We define a formula model is a special mathematical model in which formulas have the same relationships and variables except for different coefficient values. Relationships can be described by operators, such as algebraic operators, functions, and differential operators (http://en.wikipedia.org/wiki/Mathematical_model). Therefore, a formula model is a set where each element is a formula. For example, the two formulas and belong to the formula model . Data that are represented by different formulas in one formula model may have similar features which are data distributions, data relationships between different variables, data change laws, and so on, because these formulas have the same relationships.
In order to represent a formula model and its corresponding formulas, we define the following BNF expressions: , , , , ∣ , , ,
where is a formula model. is a parameter set. indicates a coefficient set. is a set of binary functions, while is a set of unary functions. is a set of atomic functions which does not contain any subfunctions. is a set of complex functions which contains complex functions in and atomic functions in . With the above definitions, any formulas and its corresponding model can be shown by these BNF expressions. For instance, the formula is represented by and , and its subfunction is represented by . The constants , , and are shown by elements in . With these expressions, a formula model can be transformed into one tree. And the tree is a candidate individual in population of GP solving SR problem. Every subtree in the tree is a subformula which can show special data features. A subtree that shows features of experimental data is called feature-subtree. If a tree has more feature-subtrees, the tree is more likely to fit the data. How to construct the tree consisting of feature-subtrees is a key step in our method which is implemented by the algorithm randomGenP (see Algorithm 5).
2.2. The Process of Researcher Analyzing Data
The process that a researcher tries to solve SR problems is shown in Figure 2. He depends heavily on his experience which is obtained through a long-term accumulation of study and research. After a researcher collected experimental data, he discovers regular patterns from data by using the methods of data analysis and visualization. He then constructs formula models which were consistent with these regular patterns according to his experiences. After that, he computes the coefficient values in formula models by using appropriate regression methods and obtains some formulas from different formula models. According to results of evaluating these formulas, he chooses the formula that is most fitted to the data. If the formula cannot represent data features, he needs to reselect a new formula model and do the above steps until one fitting formula is found.
We think the researcher’s experience and knowledge have two roles in processing SR problem. One role is Prior Formula Knowledge (PFK) which can help a researcher to quickly find fitted formulas that match experimental data features. Through study and work, the researcher accumulates his domain knowledge of various characteristics of formula model. When the researcher observes experimental data, he can apply his domain knowledge to recognize and classify the data. The other is the ability of inference and deduction which can help the researcher to combine, transform, and verify mathematical expression. We conclude that the PFK contains two primary functions: classification and recognition.
Classification. when experimental data features are in accord with characteristics of one formula model in PFK, the dataset can be categorized into the model. The prerequisite of classification is that different formula models have different characteristics in PFK Base. As shown in Figure 3, six families of curves are generated by six formula models taking different coefficient values. The curves in the same family show similar data features while the curves in different families show different data features. Therefore, we can infer that the curves (including surfaces and hypersurfaces) generated by different formula models can be classified according to their data features.
Although many machine learning algorithms such as linear regression , SVM , Boosting , and PCVMs  can be used to identify and classify data, it is difficult for these algorithms to classify these curves. That is because these algorithms depend on features that are extracted manually from data, while these features from different complex curves are difficult to be represented by a feature vector which is built based on the researcher’s experiences. In contrast to these algorithms, DL can automatically extract features and have a good performance for the recognition of complex curves, such as image , speech , and natural language . The GenerateFs algorithm (see Algorithm 2) based on DBN is shown to classify the data.
Recognition. Some formulas can represent remark features of curves generated by formula model. For example, after observing the curve in Figure 1, a researcher can easily infer that the formula sin or cos is one of formulas that constitute the curve because data in curve show periodicity. Therefore, these formulas are called feature functions that can be recognized or extracted by PFK. Algorithms CountSamePartF and CountSpecU (see Algorithms 3 and 4) are built to recognize the feature functions.
Recognition can help the researcher overcome overfitting of results that are generated by classification because classification can help researcher to only identify formula models from training set while recognition can help the researcher identify subformula models that are consistent with local data features.
The ability of inference and deduction is one of main measurements for evaluating performance of artificial intelligence methods. In the SR problem, GP, compared with other methods such as logical reasoning, statistical learning, and genetic algorithm, is a revolutionary method of searching fitting formulas because it can seek the appropriate formula models and their coefficient values simultaneously by evolving the population of formula individuals. Therefore, in the paper, we use GP as the method of inferring and deducing formulas.
To optimize GP, researchers have proposed various approaches, such as optimal parsimony pressure ; Pareto front optimization  and its age-fitness method  are used to control bloat and premature convergence in GP. In order to reduce the space complexity of searching formulas, the methods of arithmetic  and machine learning are injected into GP. In the paper, with the algorithm randomGenP (see Algorithm 5) about generating population and the method of Pareto front optimization, PFK-GP can research the formula model in the appropriate space and can find right formulas quickly.
3. Genetic Programming with Prior Formula Knowledge Base
3.1. Formula Prior Knowledge Base
The FPK needs to have the ability of identifying and classifying the formula model based on data features. Although the features between formula models are different, it is difficult to extract features from data which are generated by these models because different formula models represent seemly similar but different features. Based on the above definitions in formula model, the features among functions in set are different. The features between the function and the function may be similar if is the parameter of . As shown in Figure 4, these functions , , and belonging to constitute , such as and , and are shaped into the final function . shows periodicity in results; shows slow variation trend in results; shows high variation trend in results. So, these features in the three functions are different. However, there are similar features between and , because the two functions have high variation trend in result. The shallow learning method such as SVM can hardly identify complex features of functions shown in Figure 5.
In this paper, DBN is used to classify data into a special formula model according to features that are automatically extracted from data. Generally, DBN is made up of multiple layers that are represented by Restricted Boltzmann Machine (RBM). As RBM is an universal approximate of discrete models , we can speculate that DBN which has many layers of RBM can recognize the features of data generated by complex function just like Convolutional Neural Network (CNN) classifying image . The process of DBN recognizing formulas is illustrated in Figure 4. DBN can extract features from data, layer by layer. So, the lower RBM layer can represent the simple functions and the higher RBM layer can transform the simple functions into more complex functions.
We use the data generated by the different formula models (see Table 7) as training samples. DBN is trained by these samples. The model that is finally gained by DBN training methods is PFKB, which is aimed at identifying the formula model that can represent features of the data. The process of DBN training is outlined in algorithm TrainDBN (Algorithm 1) which uses the same steps as mentioned in [14, 15].
PFKB is only changed with formula models. If there are no new trained formula models in an application, the algorithm TrainDBN will not be executed. When the number of trained formula models is large enough, little new formula model will appear, and PFKB will seldom be changed. In the paper, TrainDBN is performed exactly once in order to generate PFKB.
3.2. Classification and Recognition with PFKB
In order to deal with the problem of how to classify and recognize formula model from data, we should consider the problem from two aspects. One situation is that data can be represented by a special formula model from PFKB, while the other one is that data cannot be represented by a formula model from PFKB. In the first case, we exploit PFKB to identify formula models of data by DBN classification. Based on ordered results of DBN classification, we gain a set of formula models () which are most similar to features of the data. The process that deals with the first case is outlined in algorithm GenerateFs. The algorithm is fast because PFKB has been built by TrainDBN, and is small integer value.
In the second case, when a researcher observes laws that are hidden in experimental data, he often tries to find some formulas which are consistent with partial features of the data. Therefore, we propose the two assumptions as follows.
Assumption 1. More formula models s have the same subformula model pf in the set Fs which is the result of GenerateFs running, more strongly that the pf can express features of data.
In order to compute the same pf in Fs, we express the formula model as the string of expression and seek the same part of them by using intersection between the two strings (without considering the elements in sets X and A). Define the intersection between two expressions as follows:For example, , , , The method, which obtains pf whose frequency of occurrence in f is larger than threshold t, is described as the algorithm CountSamePartF.
For verifying Assumption 1, we apply the dataset from (see Table 5) as the testing set and get the identifying results from Table 7 through the algorithm GenerateFs. The intersections between the top two formulas are and , which are partial mathematical expressions in . And we use the dataset from Table 6 to test and get the identifying results through GenerateFs algorithm. The intersection between two expressions is as follows: , , . We find that the elements which have more frequency of occurrence in the intersections set are more likely to express some data features. The above two experiments illustrate that Assumption 1 is rational.
Assumption 2. If function exists in Fs obtained by GenerateFs and the number of the same is larger than threshold , we can conclude that can show some local data features.
The function except is common function, which has a high probability of occurrence in mathematical expressions. Therefore, it is difficult to express special data features. Compared with , the function can show obvious features of data. For instance, presents the periodicity of data and represents data features about extreme increase or decrease. The method, which obtains the special function that can show the local data features, is outlined as the algorithm CountSpecU.
For verifying Assumption 2, we also choose the dataset which are generated from (see Table 5) as the testing data and apply the CountSpecU algorithm to calculate the special among . The result of the algorithm is shown in Table 1. We find the result ( and are the operators of the same kind) is part of . Hence, we can discover that the u set, which is gained by the algorithm CountSpecU, can show local features of the dataset.
3.3. GP with Prior Formula Knowledge Base
In order to deal with SR problem, GP is executed to automatically composite and evolve mathematical expression. The process of GP is similar to the process that a researcher transforms formula models and obtains fitting formulas based on his knowledge. Since those algorithms in PFKB, which is created based on analyzing the process of how a research infers fitted formulas, can recognize formula models that are consistent with data features, we combine these formula models of PFKB recognizing into the process of GP in order to reduce the searching space and increase the speed of discovering right solutions.
When initializing GP algorithm, we can select candidate formulas from Fs, C, and specU as individuals in population of GP. The sets Fs, C, and specU are gained by the above algorithms in PFKB. Therefore, the PFKB is injected into the process of population generating. And this population can contribute to reserving data features as much as possible and reducing the searching space because these individuals commonly have good fitness value. With the population, GP algorithm can speed up the convergence and improve the accuracy of SR results. However, it may lead to the bias results. To overcome the problem, some random individuals must be imported into the population. The process of population creating is as follows.
Firstly, the elements in sets Fs and are inserted into the population. Then, the set specU and the candidate function sets and U are merged into the new candidate function queue Q. And the number of elements in specU is twice as much as the other elements in Q because . Those elements in specU are more likely to be part of individuals in the population after applying the method traditionalRandomIndividual  which is designed to generate randomly individuals from the special function set. At last, the rest of individuals of population are created by traditionalRandomIndividual with sets B and U. The process of population generating is described as the algorithm randomGenP.
Generally, , where is the number of individuals in population. Furthermore, in order to enhance the affection of PFKB in the process of GP evolution, the method randomGenP is used to create new individuals in every few generations of evolutionary computation. Meanwhile, the method of Pareto front  is introduced into the algorithm PFK-GP to balance the accuracy against the complexity of model. The detail of algorithm PFK-GP is shown in Algorithm 6.
In the experiments, we employ DBN in the DeepLearnToolbox  to classify formula models and build the algorithm PFK-GP based on GPTIPS . The 39 formula models in Table 7 are composed of formulas from [31, 32] and some formulas are created by ourselves. The data generated by these 39 formula models is used as training data of algorithm DBN to create PFKB. The formula models in Table 5 are used to generate the testing data for verifying accuracy of algorithms GenerateFs and PFK-GP. The formula models in Table 6 are devoted to validating the two algorithms CountSamePartF and CountSpecU (see Algorithms 3 and 4).
For most formula models from Tables 5, 6, and 7, we sampled them by equal step taking their parameter values from the range [−49, 50]. For some particular formulas, we also sample them with a special equal step from special numerical scope. For example, the value in is in the range [0, 99], the value in ranges between 1 and 100. We create 500 groups of different parameters value in each formula model. The coefficients in these formula models are fetched with equal step from the range [−2.996 3.0]. When all coefficients of a formula model take special values, the formula model generates a formula, namely, a sample of the formula model. We create 7500 groups of different coefficients in each formula model. So, each formula model has 7500 samples where each sample has 500 groups of different parameters value. We take 6000 samples of these samples as training data and the others as test data.
We adopt DBN as the classification model and compare it with SVM that is implemented by the tool libsvm . The training and testing data for the two algorithms are originated from formula models . The parameter values in DBN and SVM are illustrated as Table 2. We take the first five formulas from Fs generated by GenerateFs as a result set of recognition. If the test formula is included in the set, we think that the recognition result is correct. The accuracy of recognition results of DBN and SVM is showed as Figure 5. The DBN method can help to classify all kinds of test data into its fitted formula models. However, the SVM method can only correctly classify several kinds of test data. The overall average accuracy of DBN classification is 99.65%, while the accuracy of SVM is 26.72%. The result demonstrates that DBN is more suitable for recognizing data generated by mathematical expression, because DBN can automatically extract features from the data, layer by layer, and is similar to composition of formula which is constituted by its subformulas.
We set parameters in GP with Pareto optimization (PO-GP)  and PFK-GP as shown in Table 3. For data generated by (see Table 7, coefficient is −2.098, is −2.998), PO-GP and PFK-GP deal with the SR problem, respectively. The result was illustrated as Figure 6, where
We could find that after processing test data of formula model , PFK-GP found its best model at the first generation and its fitness is higher, while PO-GP found its best model until 718th generation and its fitness is much lower than that in PFK-GP. The PFK-GP can get the right formulas quickly because the model recognized by the algorithm GenerateFs is inserted into the initialized population of evolutionary computation. For the formula models whose characteristics are consistent with data features in PFKB, they can be recognized with high probability and can be combined into population of PFK-GP. The PFK-GP can firstly search the coefficients in these formula models and get the mathematical expression with good fitness value. Therefore, the algorithm GenerateFs can speed up the process of PFK-GP dealing with SR and can improve the accuracy of SR results.
In order to test whether PFK-GP can overcome overfitting or not, a dataset is created by which has not existed in the training models of PFKB. The two algorithms PO-GP and PFK-GP are, respectively, applied to process the dataset. The two algorithms, which run, respectively, 100 and 1000 generations, have similar convergence curves in Figure 8. However, PFK-GP can find better fitness results compared with PO-GP, because PFK-GP searches fitted solution in the space includes more functions whose data features are in accord with . Since the initial population, which is generated by the algorithms (CountSamePartF and CountSpecU) in PFKB, contains subformulas in formula models which are recognized by PFKB and represents data features of these subformulas, PFK-GP can find the right formulas which are more fitted to the raw dataset.
In order to observe overall performance of the PFK-GP, we select six datasets as testing set. Three of them generated by formula models () from Table 7 are involved in the process of training DBN, while the other three generated by formula models () from Table 5 are not involved in that process. The two algorithms PFK-GP and PO-GP are executed, respectively, ten times in order to gain the right formulas from the six different datasets. The six results of mean training error gained by the two algorithms are shown in Figure 9. And the average results from six groups of mean training errors are listed in Figure 7. The PFK-GP(E) and GP(E) are the average results of , and , while PFK-GP(P) and PO-GP(P) are the average results of , and . We can conclude that the comprehensive performance of the PFK-GP is better than that of the PO-GP based on the results in Figures 7 and 9, because the algorithm PFK-GP utilizes the method GenerateFs to find the fitted formula model directly and the methods CountSamePartF and CountSpecU to identify subformula models which have data features consistent with test set. The best mathematical expressions PFK-GP and PO-GP found are listed in Table 4.