Stock Market Forecasting Using Restricted Gene Expression Programming
Stock index prediction is considered as a difficult task in the past decade. In order to predict stock index accurately, this paper proposes a novel prediction method based on S-system model. Restricted gene expression programming (RGEP) is proposed to encode and optimize the structure of the S-system. A hybrid intelligent algorithm based on brain storm optimization (BSO) and particle swarm optimization (PSO) is proposed to optimize the parameters of the S-system model. Five real stock market prices such as Dow Jones Index, Hang Seng Index, NASDAQ Index, Shanghai Stock Exchange Composite Index, and SZSE Component Index are collected to validate the performance of our proposed method. Experiment results reveal that our method could perform better than deep recurrent neural network (DRNN), flexible neural tree (FNT), radial basis function (RBF), backpropagation (BP) neural network, and ARIMA for 1-week-ahead and 1-month-ahead stock prediction problems. And our proposed hybrid intelligent algorithm has faster convergence than PSO and BSO.
Stock market plays a leading and crucial role in the market mechanism, which connects the savers and investors [1, 2]. The operating mechanism of the stock market reflects the situation of national economy and is recognized as the signal system of the national economy [3, 4]. Because of some uncontrollable factors, such as economic growth, economic cycle, interest rate, fiscal revenue and expenditure, money supply, and price, the prediction of the stock market index is considered to be a difficult job [5–7].
Many machine learning (ML) methods containing statistical models, artificial neural networks, and hybrid prediction models have been proposed to model and predict the stock index. As a classical statistical model, the ARIMA model has proposed to predict the New York Stock Exchange (NYSE) and Nigeria Stock Exchange (NSE), and the results revealed that the ARIMA model performed better for short-term prediction [8–10]. Compared with the ARIMA model, the artificial neural network (ANN) model has more strong prediction and modeling ability. Adebiyi et al. made the comparison of ARIMA and ANN models for stock price prediction and found that the stock forecasting model based on ANN approach had superior performance over ARIMA models .
In the past decades, many ANN models have been employed for solving real problems, especially stock market prices forecasting [12, 13]. Dong et al. presented backpropagation (BP) neural networks for stock prediction . Feedforward ANN was proposed to predict price movement of the stock market . Akita et al. proposed a novel deep learning method based on paragraph vector and long short-term memory (LSTM) to predict the Tokyo Stock Exchange . Rout et al. used the radial basis function (RBF) neural network to forecast DJIA and S&P 500 stock indices . Wang et al. proposed a novel method based on complex-valued neural network (CVNN) and Cuckoo search (CS) algorithm to forecast stock price . Chen et al. presented the flexible neural tree (FNT) ensemble technique to analyze 7-year Nasdaq-100 main index values and 4-year NIFTY index values .
However, the existing methods mainly trained the black box with the training sample. The model could change its internal structure and parameters to make it approximate to the training sample. The gained model could not display the distinct input-output relationship and deeply understand the internal mechanisms of real-world problems. And, in most of these methods, all variables are input into the models, which easily lead to overfitting problem. Recently the methods based on mathematical formulations have been proposed to predict time series, which could clearly indicate the mathematical relationship between the input data and output data. Zuo et al. proposed that gene expression programming (GEP) was utilized to identify differential equation for time series prediction . Graff et al. proposed genetic programming (GP) to forecast time series . Grigioni et al. proposed a modified power-law mathematical model to predict the blood damage sustained by red cells with the load history . Mina et al. proposed a beta-function formula to forecast the maxillary arch form . Chen et al. identified ordinary differential equations (ODEs) to forecast the small time scale traffic measurements data and proved that the ODE model was more feasible and efficient than ANN models .
As a classical nonlinear differential equation, the S-system model has been proposed to predict time series and identify genetic networks. Zhang and Yang proposed a restricted additive tree (RAT) to represent the S-system model for stock market index forecasting . However, the RAT method has nonlinear structure and is implemented inconveniently. In this paper, a novel stock index prediction method based on S-system model is proposed. Restricted gene expression programming (RGEP) is proposed to encode and optimize the structure of S-system. In order to optimize the parameters of the S-system model accurately, a new hybrid intelligent algorithm based on the brain storm optimization (BSO) algorithm and particle swarm optimization (PSO) algorithm is proposed.
Dow Jones Index, Hang Seng Index, and NASDAQ Index are old and famous stock indexes in the world, which are usually utilized to reflect the development of the global economy. Shanghai Stock Exchange Composite Index and SZSE Component Index represent the general trend of China’s stock market and economic development. These five stock indexes have been considered as the standard datasets to evaluate the performance of stock prediction models [26–30]. Thus, Dow Jones Index, Hang Seng Index, NASDAQ Index, Shanghai Stock Exchange Composite Index, and SZSE Component Index are collected to validate the performance of our proposed method.
2. Background Concepts and Related Technologies
2.1. Data Description
Let stock time series data to be ( is the number of time points). Generally, the data from the past time points are used to predict the data at the current time point. Figure 1 shows an example of data partition with input variables. The data in the box are utilized as the input vector, and the data on the right side of the box is the prediction value. Two forecasting strategies, 1-week-ahead () and 1-month-ahead (), are utilized in this paper.
2.2. S-System Model
The S-system model has a complex and powerful structure, which captures the dynamic nature of the real system, and achieves a good performance in the terms of precision and flexibility [31, 32]. The nonlinear differential equation in S-system is described as follows:where is the number of equations, is the variable, and are the rate constants of production function and consumption function, and and are the kinetic orders.
2.3. Brain Storm Optimization Algorithm
Brain storm optimization (BSO) algorithm is a new swarm intelligence optimization algorithm, which was proposed by Shi in the year 2011 . In BSO, the cluster algorithm is proposed to search the local optimal solution and the global optimal solution is obtained through the comparison of all local optimal solutions. Mutation strategy is utilized to enhance the diversity of the algorithm and avoid obtaining local optimal solution . The BSO process is described as follows:(1)Initialize the population and generate N potential solutions ().(2)The k-means clustering algorithm is utilized to divide the N individuals into k classes. The fitness value of each individual is calculated. The best individual in each category is selected as the central individual.(3)Select randomly the central individual of a class and mutate it with a random disturbance.(4)Update the individual with the following four methods.(a)Select randomly a class (the probability is proportional to the number of individuals in each class). A new individual () is generated by adding the random perturbation to the central individual (), which is defined as follows:where is the Gaussian random function and is the factor that balances the random number, which is defined as follows:where is a logarithmic S-transform function, is the maximum number of iterations in the algorithm, is the number of current iterations, is the gradient which is utilized to control the logarithmic S-transformation function, and is the random number in the interval [0, 1].(b)Randomly select a class and an individual in the selected class. A new individual is created with the selected individual and Gaussian value by equations (2) and (3).(c)Select randomly two classes, and two central individuals from the two classes are utilized as the candidate individuals and , which are fused with the following formula:where is a random number in the interval [0, 1].
After merging the candidate individuals, the individual is updated according to the formula (2).(d)Two candidate individuals and are selected randomly from the two selected classes. The fusion and updating operators are implemented with equations (2) and (4).
After the new individual is generated, its fitness value is calculated. Compared with the fitness values of the candidate individuals, the individuals with the better fitness values are selected to the next generation. When N new individuals are generated, enter the next iteration process.(5)When the maximum iteration number is reached, algorithm stops; otherwise, go to step (2).
2.4. Particle Swarm Optimization Algorithm
The particle swarm optimization (PSO) algorithm is a classical swarm intelligent method . In PSO, each potential solution is presented by a particle. A swarm of particles moves in order to search the food source, with the moving velocity vector . At each step, each particle searches the optimal position separately in the space, which is recorded in a vector . The global optimal position is searched among all the particles, which is kept as .
At each step, a new velocity for the particle i is updated by the following equation:where is the inertia weight and impacts on the convergence rate of PSO, which is calculated adaptively as ( is the maximum number of iterations in the algorithm and is the number of current iterations), and are the positive constants, and and are uniformly distributed random numbers in [0, 1].
With the updated velocities, each particle changes its position according to the following equation:
3.1. Restricted Gene Expression Programming
The restricted gene expression programming (RGEP) as the improved version of GEP was proposed to identify the S-system model for gene regulatory network (GRN) inference . The flowchart of RGEP is described as follows:(1)Initialize the population. One example of chromosome in population is depicted in Figure 2. Each chromosome contains two genes and each gene contains head part and tail part, which are created randomly using the function set (F) and variable set (T):where is an operation of variables multiplying, is the variable, is the number of input variables, and is the constant.
In order to make the chromosome similar to the S-system, each gene is allocated the corresponding parameters. For gene 1, is given as its coefficient and each variable is given exponent . For gene 2, is given as its coefficient and each variable is given exponent . Two genes are connected by the subtraction operation (−). Figure 3 shows the expression tree (ET) of Figure 2, and its corresponding S-system model is expressed as follows:(2)According to the given fitness function, evaluate the population with the training samples. In this process, the S-system model is solved by the fourth-order Runge–Kutta method . For the differential equation , the solution is as follows:where is the step size.(3)If the optimal solution appears, RGEP is terminated; otherwise, turn to (4).(4)Selection, recombination, and mutation are used for reproduction of each chromosome, which are introduced in Reference .
In the initial stage of structural optimization, the symbols of the chromosome in RGEP are randomly selected, including function symbols and variable symbols. With training data, reproduction operators are used to optimize and change the chromosomal symbols in the optimization process. The optimized S-system structure does not contain all the input variables. According to the training data, RGEP could automatically select the appropriate input variables. In Figure 2, we can find that the coefficients and and the exponents are needed to be optimized. In this paper, the parameters in each chromosome are optimized by a hybrid intelligent algorithm based on BSO algorithm and PSO algorithm.
3.2. Hybrid Optimization Algorithm
The BSO algorithm is suitable for solving the problem of multipeak and high-dimensional function. The PSO algorithm has the advantages of easy realization, high accuracy, and fast convergence. But these two methods are easy to converge prematurely and fall into local optimum. In order to improve the diversity of population, a novel hybrid intelligent algorithm based on BSO and PSO (BSO-PSO) is proposed. In the BSO-PSO algorithm, the half of individuals are selected randomly and optimized by BSO. And the other individuals are optimized by PSO. The flowchart is described in Figure 4.
3.3. Time Series Data Forecasting Using S-System
The flowchart of time series forecasting using the S-system model is described in Figure 5. During the training phase, the S-system model is optimized according to the genetic operators of RGEP, hybrid intelligent algorithm, and training dataset. During the test phase, the optimal S-system is used to make the prediction of the stock index. The detailed process is described as follows.
3.3.1. Training Phase
(1)Initialize the S-system population with the structure and parameters. Each S-system is encoded as the RGEP chromosome, which is described in Figure 2.(2)With the training samples, the S-system is solved by equation (4) and the fitness value of each S-system is calculated. Search the best S-system according to the fitness values. If the optimal model is found, the algorithm stops.(3)Selection, recombination, and mutation are used to search the optimal structure of the S-system. Go to step (2).(4)At some iterations in RGEP, BSO-PSO algorithm is used to optimize the parameters of RGEP chromosomes. In this process, the structure of the S-system model is fixed. According to the structure of the model, the number of parameters (, , ) is counted. With the hybrid intelligent algorithm, search and update the optimal parameters of each S-system.
3.3.2. Testing Phase
With the data at the previous time point, the optimal S-system model obtained in the training phase is solved and the data at the current time point are predicted. Repeat this procedure until that the data at all testing time points have been predicted. According to the predicted data and target data, the predicted error is calculated.
4. Results and Discussion
4.1. Data and Evaluation Standard
Five stock indexes such as Dow Jones Index (DJI), Hang Seng Index (HSI), NASDAQ Index (NASI), SSE (Shanghai Stock Exchange) Composite Index (SSEI), and SZSE Component Index (SZSEI) are proposed to test the performance of our method. Seventy percent of the data are used for training, and 30% of the data are used for testing. The descriptions of five stock indexes are listed in Table 1.
RMSE (root mean square error), MAP (mean absolute percentage), and MAPE (mean absolute percentage error), (coefficient of multiple determinations for multiple regressions), ARV (average relative variance), and VAF (variance accounted for) are proposed to evaluate the performance of our method [30, 39]:where is the number of stock sample points, is the real stock value at the time point, is the predicting stock value at the time point, and is the mean of stock indexes.
4.2. Prediction Results
In order to test the performance of our method clearly, five states of the art methods (Deep Recurrent Neural Network (DRNN) , FNT , RBFNN , BPNN , and ARIMA ) are also used to predict five stock indexes.
For 1-week-ahead prediction problem, function set is and variable set is in the RGEP method. By optimizing S-system models by our method, we could obtain the optimal phenotypes and expression trees (ETs) with five stock indexes, which are described in Figure 6. Five optimal S-system models gained are listed in Table 2 for five stock datasets. The forecasting results of five stock indexes by our method are depicted in Figure 7. From Figure 7, it can be clearly seen that the predicting curves are very near to the target ones, and the errors are nearly zero.
Comparison results of different prediction models’ performance on five stock indexes are listed in Table 3. From Table 3, among the past five states of the art methods, the DRNN model performs best for five stock indexes prediction. But in terms of six indicators (RMSE, MAP, ARV, MAPE, R2, and VAF), our proposed method has better performance than the DRNN model. In terms of RMSE, our method is 34.8% lower than DRNN for DJI dataset, 46.4% lower than DRNN for HSI dataset, 40.4% lower than DRNN for NASI dataset, 19.8% lower than DRNN for SSEI dataset, and 7.4% lower than DRNN for SZSEI dataset. In terms of ARV, our method is 58.7% lower than DRNN for DJI dataset, 67.1% lower than DRNN for HSI dataset, 68.8% lower than DRNN for NASI dataset, 36.9% lower than DRNN for SSEI dataset, and 16.5% lower than DRNN for SZSEI dataset. In terms of MAPE, our method is 37.5% lower than DRNN for DJI dataset, 48% lower than DRNN for HSI dataset, 42.9% lower than DRNN for NASI dataset, 35.2% lower than DRNN for SSEI dataset, and 18% lower than DRNN for SZSEI dataset. In terms of VAF, our method is closer to 100% than DRNN for five stock indexes. It could be seen clearly that our proposed method could improve the prediction accuracy sharply.
For 1-month-ahead prediction problem, function set is and variable set is in the RGEP method. With five stock indexes, we obtain five optimal phenotypes and expression trees (ETs), which are described in Figure 8. According to five ETs, the S-system models gained are listed in Table 4. The forecasting results of five stock indexes by our method are depicted in Figure 9. From Figure 9, we could see clearly that the predicting and target curves are very close.
Six prediction models are used to forecast five stock indexes, and the prediction results are listed in Table 5. From Table 5, it can be seen that the five indicators (RMSE, ARV, MAPE, R2, and VAF) of our method are all the best of these six methods with the three datasets (DJI, HIS, and NASI). The DRNN model has the highest MAP, which are 2.1368, 2.9568, and 6.3901, respectively. For SSEI and SZSEI datasets, our proposed method has the best performance in terms of RMSE, MAP, ARV, MAPE, R2, and VAF. In terms of ARV, our method is closer to 0 than other five methods. In terms of R2, our method is closer to 1. In terms of VAF, our method is closer to 100%. Thus, our proposed forecasting model tends to be more accurate.
4.3. Hybrid Intelligent Algorithm Analysis
In order to test the performance of our proposed hybrid intelligent algorithm, we use BSO and PSO to optimize the parameters of S-system models in the comparison experiments. Through 20 runs, with DJI dataset, the a-week-ahead prediction results by three evolutionary methods are listed in Table 6, which contains the best value, worse value, mean value, and standard error (SD) of the mean of 20-run RMSEs. From Table 6, we can see that through 20 runs, the best RMSE values by three evolutionary methods are very close, but the other three indicators seem to have a big difference. Our hybrid intelligent algorithm could obtain smaller worse RMSE, mean RMSE, and SD than PSO and BSO, which indicates that our hybrid intelligent algorithm is more robust and not easier to fall into local optimum than PSO and BSO.
Figure 10 depicts the comparison of the RMSE convergence rate obtained from the application of our hybrid intelligent algorithm, BSO and PSO with DJI dataset for a-week-ahead prediction. Figure 10 reveals that our proposed intelligent algorithm has faster convergence than PSO and BSO in the early stage of the optimization process. When the number of iterations reaches 200, the RMSE convergence rate is dropping to 10−3 that indicates the significant minimization of error.
4.4. Restricted Gene Expression Programming Analysis
In order to test the performance of restricted gene expression programming for S-system optimization, the restricted additive tree is used to optimize the structure of the S-system model in the comparison experiments. Through 20 runs, with five stock indexes, the a-week-ahead prediction results by RGEP and RAT are depicted in Figure 11, which contains the best values, worse values, and mean values of 20-run RMSEs. From Figure 11, it could be clearly seen that RGEP could obtain smaller best, worse, and mean RMSE values than RAT, which reveal that RGEP could search the optimal S-system model more easily than RAT.
In this paper, a novel stock prediction method based on the S-system model is proposed to forecast the stock market. An improved gene expression programming (RGEP) is proposed to represent and optimize the structure of the S-system model. A hybrid intelligent algorithm based on BSO and PSO is used to optimize the parameters of the S-system model. Our proposed method is tested by predicting five real stock price datasets such as DJI, HIS, NASI, SSEI, and SZSEI. The results of predicting the stock price a-week-ahead and a-month-ahead reveal that our method could predict the stock index accurately and performs better than DRNN, FNT, RBFNN, BPNN, and ARIMA.
The convincing performance of our method is mainly due to three aspects. The first is that the nonlinear ordinary differential equation model S-system has strong nonlinear modeling and forecasting ability. Table 6 and Figure 10 show that our hybrid intelligent algorithm is more robust and not easier to fall into local optimum than PSO and BSO. From Tables 2 and 4, we can see that the optimal S-system models contain a portion of input variables. This is because our method can automatically select the proper input variables according to different stock data, which also prevents overfitting problem.
The five stock indexes could be downloaded freely at https://hk.finance.yahoo.com/.
Conflicts of Interest
There are no conflicts of interest regarding the publication of this paper.
This work was supported by the Natural Science Foundation of China (no. 61702445), Shandong Provincial Natural Science Foundation, China (no. ZR2015PF007), the Ph.D. research startup foundation of Zaozhuang University (no. 2014BS13), and Zaozhuang University Foundation (no. 2015YY02).
T. A. Marsh and R. C. Merton, “Dividend variability and variance bounds tests for the rationality of stock market prices,” American Economic Review, vol. 76, no. 3, pp. 483–498, 1986.View at: Google Scholar
A. A. Ariyo, A. O. Adewumi, and C. K. Ayo, “Stock price prediction using the ARIMA model,” in Proceedings of 16th International Conference on Computer Modelling and Simulation, pp. 106–112, East Lansing, MI, USA, September 2015.View at: Google Scholar
O. O. Mathew, A. F. Sola, B. H. Oladiran, and A. A. Amos, “Prediction of stock price using autoregressive integrated moving average filter ((ARIMA (p,d,q))),” Global Journal of Science Frontier Research, vol. 13, no. 8, pp. 79–88, 2013.View at: Google Scholar
B. Y. Bejarbaneh, E. Y. Bejarbaneh, M. F. M. Amin, A. Fahimifar, D. Jahed Armaghani, and M. Z. A. Majid, “Intelligent modelling of sandstone deformation behaviour using fuzzy logic and neural network systems,” Bulletin of Engineering Geology and the Environment, vol. 77, no. 1, pp. 345–361, 2016.View at: Publisher Site | Google Scholar
G. Dong, K. Fataliyev, and L. Wang, “One-step and multi-step ahead stock prediction using back propagation neural networks,” in Proceedings of 9th International Conference on Information, Communications & Signal Processing, pp. 1–5, Bangkok, Thailand, October 2014.View at: Google Scholar
R. Akita, A. Yoshihara, T. Matsubara, and K. Uehara, “Deep learning for stock prediction using numerical and textual information,” in Proceedings of 15th International Conference on Computer and Information Science (ICIS), pp. 1–6, Okayama, Japan, June 2016.View at: Google Scholar
H. Wang, B. Yang, and J. Lv, “Complex-valued neural network model and its application to stock prediction,” in Proceedings of 16th International Conference on Hybrid Intelligent Systems, pp. 21–28, Marrakech, Morocco, November 2016.View at: Google Scholar
N. Noman and H. Iba, “Inference of genetic networks using S-system: information criteria for model selection,” in Proceedings of 8th Annual Conference on Genetic and Evolutionary Computation, vol. 1, pp. 263–270, Seattle, WA, USA, July 2006.View at: Google Scholar
Y. Shi, “Brain storm optimization algorithm,” in Proceedings of Second International Conference Advances in Swarm Intelligence, vol. 6728, no. 3, pp. 1–14, Chongqing, China, June 2011.View at: Google Scholar
J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Proceedings of IEEE International Conference on Neural Networks, vol. 4, no. 8, pp. 1942–1948, Perth, Australia, November 1995.View at: Google Scholar
D. J. Armaghani, M. F. M. Amin, S. Yagiz, R. S. Faradonbeh, and R. A. Abdullah, “Prediction of the uniaxial compressive strength of sandstone using various modeling techniques,” International Journal of Rock Mechanics and Mining Sciences, vol. 85, pp. 174–186, 2016.View at: Publisher Site | Google Scholar
M. Hermans and B. Schrauwen, “Training and analysing deep recurrent neural networks,” in Proceedings of Advances in Neural Information Processing Systems, pp. 190–198, Lake Tahoe, NV, USA, December 2013.View at: Google Scholar