-Norm Multikernel Learning Approach for Stock Market Price Forecasting
Linear multiple kernel learning model has been used for predicting financial time series. However, -norm multiple support vector regression is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we adopt -norm multiple kernel support vector regression () as a stock price prediction model. The optimization problem is decomposed into smaller subproblems, and the interleaved optimization strategy is employed to solve the regression model. The model is evaluated on forecasting the daily stock closing prices of Shanghai Stock Index in China. Experimental results show that our proposed model performs better than -norm multiple support vector regression model.
Forecasting the future values of financial time series is an appealing yet difficult activity in the modern business world. As explained by Deboeck and Yaser [1, 2], the financial time series are inherently noisy, nonstationary, and deterministically chaotic. In the past, many methods were proposed for tackling this kind of problem. For instance, the linear models for forecasting the future values of stock prices include the autoregressive (AR) model , the autoregressive moving average (ARMA) model , and the autoregressive integrated moving average (ARIMA) model . Over the last decade, nonlinear approaches have received increasing attention in financial time series prediction and have been proposed for a satisfactory answer to the problem. For example, Yao and Tan  used time series data and technical indicators as the input of neural networks to increase the forecast accuracy of exchange rates; Cao and Tay [6, 7] applied support vector machine (SVM) in financial forecasting and compared it with the multilayer back-propagation (BP) neural network and the regularized radial basis function (RBF) neural network; Qi and Wu  proposed a multilayer feed-forward network to forecast exchange rates; Pai and Lin  invested a hybrid ARIMA and support vector machines model in stock price forecasting; Pai et al.  presented a hybrid SVM model to exploit the unique strength of the linear and nonlinear SVM models in forecasting exchange rate; Kwon and Moon  proposed a hybrid neurogenetic system for stock trading; Hung and Hong  presented an improved ant colony optimization algorithm in a support vector regression (SVR) model, called SVRCACO, for selecting suitable parameters in exchange rate forecasting; Jiang and He  introduced local grey SVR (LG-SVR) integrated grey relational grade with local SVR for financial times eries forecasting; and so on.
In comparison with the previous models, SVR with a single kernel function can exhibit better prediction accuracy because it conceives the structural risk minimization principle which considers both the training error and the capacity of the regression model [14, 15]. However, the researchers have to determine in advance the type of kernel function and the associated kernel hyper parameters for SVR. Unsuitably chosen kernel functions or hyper parameter settings may lead to significantly poor performance [16, 17].
In recent years there has a lot of interest in designing principled regression algorithms over multiple cues, based on the intuitive notion that using more features should lead to better performance and decreasing the generalization error. When the right choice of features is unknown, learning linear combinations of multiple kernels is an appealing strategy. The approach with a optimization process is called multiple kernel learning (MKL). A first step towards a more realistic model of MKL was achieved by Lanckriet et al. , who showed that, given a candidate set of kernels, it is computationally feasible to simultaneously learn a support vector machine and a linear kernel combination at the same time. In MKL we need to solve a joint optimization problem while also learning the optimal weights for combing the kernels. Several practitioners have adopted the linear multiple kernels to deal with the practical problems. For example, Rakotomamonjy et al.  addressed the MKL problem through a weighted 2-norm regularization formulation and proposed an algorithm, named Simple MKL, for solving this MKL problem. Bach  proposed the asymptotic model consistency of the group Lasso. Zhang and Shen  presented multimodal multitask learning algorithm for joint prediction of multiple regression and classification variables in Alzheimer’s disease. Especially, Chi-Yuan Yeh and his coworkers  developed a two-stage MKL algorithm by incorporating sequential minimal optimization and the gradient projection method. The new method  performed better than previous ones for forecasting the financial time series. Previous approaches to multiple kernel learning (MKL) have promoted sparse kernel combinations to support interpretability and scalability. Unfortunately, sparsity at the kernel level may harm the generalization performance of the learner, therefore -norm MKL is rarely observed to outperform trivial baselines in practical applications . To allow for robust kernel mixtures that generalize well, the researchers extend -norm MKL to arbitrary norms, that is, -norm MKL (). For example, Marius Kloft et al. developed two efficient interleaved strategies for -norm MKL and showed that it can achieve better accuracy than -norm MKL for real-world problems ; Francesco Orabona et al. presented a MKL optimization algorithm based on stochastic gradient descent for -norm MKL, which possessed a faster convergence rate as the number of kernels grows .
In this paper, a multiple kernel learning framework is established for learning and predicting the stock prices. We present a regression model for the future values of stock prices, that is, -norm multiple kernel support vector regression (-norm MK-SVR), where . We decompose the optimization problem into smaller subproblem and adopt the interleaved optimization strategy to solve the regression model. Our experimental results show that -norm MK-SVR performs a better performance.
The rest of this paper is arranged as follows. Section 2 details the processing of the -norm MK-SVR model construction and describes the algorithm for our regression model. Experimental results are presented in Section 3. Section 4 concludes the paper and provides some future research directions.
2. Forecasting Methodology
2.1. -Norm Multiple Kernel Support Vector Regression
In this section, the idea of -norm multiple kernel support vector regression (-norm MK-SVR) is introduced formally.
Let , where and , be the training set. Each is the desired output value for the input vector . Consider a function that maps the samples into a high, possibly infinite, dimensional space. A regression model is learned from the previous and used to predict the target values of unseen input vectors. SVR is a nonlinear kernel-based regression method which tries to locate a regression hyperplane with small risk in high-dimensional feature space . Considering the soft margin formulation, the objective function and constraints for SVR should be solved, as follows:
SVR model usually uses a single mapping function and hence a single kernel function . Although the SVR model has good function approximation and generalization capabilities, it is not fit for dealing with a data-set which has a locally varying distribution. For resolving this problem, we can construct a MK-SVR model. Combining multiple kernels instead of using a single one, -norm MK-SVR model can catch up the varying distribution very well. Therefore we can use the composite feature map which has a block structure: to map the input space to the feature space, where are weights of component functions. Given a set of base kernels which correspond the previous feature maps , linear MK-SVR aims to learn a linear combination of the base kernels as . In learning with MK-SVR we aim at minimizing the loss on the training data with respect to the optimal kernel mixture in addition to regularizing to avoid overfitting. The primal can therefore be formulated as Previous research to MK-SVR employs the regularizer of the form which can promote sparse kernel mixtures. However, sparsity is not always desirable, since the information carried in the zero-weighted kernels is lost. Therefore we propose to use nonsparse and thus more robust kernel mixtures by employing an -norm constraint with , that is, , and , . In (3), let , , , and the first equation be divided with , then the following -norm MK-SVR is obtained:
An alternative approach previous equations has been considered by studiers. For example, Zien and Ong  upperbound the value of the regularizer and incorporate the regularizer as an additional constraint into the optimization problem. According to this thought, -norm MK-SVR model (4) can be transformed into the following form:
It can be shown (see the Appendix for details) that the dual of (5) is where , , , , and is the dual norm of . Suppose the optimal , and are found by solving (6), the regression hyperplane for -norm MK-SVR model is given by where is obtained from any and , with . In the following section, an efficient algorithm is proposed for solving the optimization problem (6).
2.2. An Optimistic Algorithm
-norm MK-SVR model (6) can be trained with several algorithms, for example, the Sequential Minimal Optimization algorithm  and multi-kernel learning with online-bath optimization . In this paper, the interleaved optimization is used for the optimization scheme according to the idea of . As a matter of fact, we can exploit the structure of -norm MK-SVR cost function by alternating between optimizing the linear combination of the base kernels and the remaining variables as and . We can do so by setting up a two-stage optimization algorithm. The basic idea of the algorithm is to divide the optimization variables of -norm MK-SVR problem (6) into two groups, on one hand and on the other. Our procedure will alternatingly operate on those two stages via a block coordinate descent algorithm. Therefore the optimization will be carried out analytically and the will be computed in the dual. The two stages are iteratively performed until the specified stopping criterion is met, as shown in Figure 1.
In the first stage, the variables are kept fixed, that is, the are known. Then the optimal in -norm MK-SVR model (6) can be calculated analytically by the following process.
Set the ’s first partial derivatives with respect to , and let it be : In the optimal point holds, so the previous equation yields where , and .
In the second stage, the following algorithm is used. We give a chunking-based training algorithm (Algorithm 1) via analytical update for -Norm MK-SVR. Kernel weighting and are optimized in an interleaving way. The basic idea of this algorithm is to divide the optimal problem into an inner subproblem and an outer subproblem. The algorithm alternates between solving the two subproblems until convergence.
In every iteration process, the inner subproblem ( and step) identifies the constraint that maximises (6) with fixing kernel weighting . The outer subproblem ( step) is also called the restricted master problem. is computed with the (10), .
The interleaved optimization algorithm is depicted in Algorithm 1, and the details of it are as follows.
Assume the original values of and are 0, for all , and the initial value of is , for all , where is a constant.
2.2.2. Chunking and Carrying out with SVR
In the iteration process, the procedure is standard in chunking-based SVR solvers and is carried out by , where is chosen as described in . We implement the greedy second-order working set selection strategy of . Rather than compute the gradient repeatedly, we speed up variable selection by caching, separately for each kernel. The cache needs to be updated every time we change and in the reduced variable optimisation. In Algorithm 1, (4) and (5) compute the objective values of SVR. Finally, the analytical value of is carried out in (10).
2.2.3. Stopping Criterion
When the duality gap falls below a prespecified threshold, that is, , we terminate the algorithm and output , , .
3. Experimental Results
In this section, two experiments on a real financial time series have been carried out to assess the performance of -norm MK-SVR. The motivation behind the two experiments are to compare the performance of our proposed method with that of other methods, that is, single kernel support vector regression (SKSVR)  and -norm MK-SVR . All calculations are performed with programs developed in MATLAB R2010a.
3.1. Experiment I
Firstly, we compare the performance of -norm MK-SVR with that of SKSVR. In this experiment, the daily stock closing prices of Shanghai Stock Index in China for the period of January 2003 to December 2007 are used, and the training/validating/testing data set is generated by a one-season moving-window testing approach. Following the way done in , three data sets, data1 to data3, are formed. For instance, data1 contains the daily stock closing prices from January 2003 to December 2006 are selected as the training data set, the daily stock closing prices from January 2007 to March 2007 are selected as the validating data set, the daily stock closing prices from April 2007 to June 2007 are selected as the testing data set. The corresponding time periods for data 1 to data 3 are listed in Table 1.
According to , we can derive training patterns based on the original daily stock closing prices for SKSVR and -norm MK-SVR. Let × be the -day exponential moving average of the th day, where is the th day daily stock closing prices and , then the output variable can be defined as Let be the input vector and let be the lagged relative difference in percentage of price (RDP). Moreover, We can obtain a transformed closing price by subtracting a -day EMA from the closing price, that is,
Based on in the previously mentioned, the input variables can be defined as , , , , and . We adopt the root mean squared error (RMSE) for performance comparison, that is, where and are desired output and predicted output, respectively.
There are three parameters that should be determined in advance for SKSVR,that is, , , and for using RBF kernel. The forecasting performance of SKSVR is examined with and . Because the forecasting performance obtained by SKSVR is effected by the parameter , we try with different settings of it from 0.01 to 3 with a stepping factor of 0.05. Figure 2 shows the RMSE for performance on the three data sets by SKSVR. The figure shows that SKSVR requires different settings for different data sets to obtain the best performance. For example, the best performance for data 1 occurs when . The best RMSE values obtained by SKSVR are listed in Table 2.
For -norm MK-SVR training model, we adopt RBF kernel . A kernel combining 60 different RBF kernels is considered,that is, with step . Hence, the kernel matrix is combined with a weighted sum of 60 kernel matrices, that is, where denotes the kernel weight for the first kernel matrix with and denotes the kernel weight for the second kernel matrix with , and so on. For the three data sets, the RMSE values obtained by -norm MK-SVR are listed in Table 2, too. Obviously when , , and , -norm MK-SVR model performs better than SKSVR one for data1 data set, data2 data set, and data3 data set, respectively.
3.2. Experiment II
Secondly, we compare the performance of -norm MK-SVR with that of -norm MK-SVR. In this experiment, the daily stock closing prices of Shanghai Stock Index in China for the period of January 2008 to December 2011 are used, and the training/validating/testing data set is generated by a one-season moving-window testing approach. Following the way done in Tay and Cao , three data sets, D-I to D-III, are formed. The corresponding time periods for D-I to D-III are listed in Table 3.
We also adopt RMSE (13) for performance comparison. For -norm MK-SVR and -norm MK-SVR training model, a kernel combining 40 different RBF kernels is considered, that is, , . Hence, the kernel matrix is combined with a weighted sum of 40 kernel matrices,that is, where denotes the kernel weight for the first kernel matrix with and denotes the kernel weight for the second kernel matrix with , and so on. For the three data sets, the RMSE values obtained by -norm MK-SVR and -norm MK-SVR are listed in Table 4. Obviously when , , and , -norm MK-SVR model performs better than -norm MK-SVR one for D-I data set, D-II data set, and D-III data set, respectively. Figure 3 shows the forecasting results for D-I and D-II by the two regression models.
Furthermore, we can use a statistical test proposed by Diebold and Mariano  to assess the statistical significance of the forecasts by -norm MK-SVR model. The loss-differential series of -norm MK-SVR and -norm MK-SVR are shown in Figures 4 and 5. According to , we adopt the asymptotic test as the test statistic, where is the loss-differential series of -norm MK-SVR and -norm MK-SVR models, and denote the forecasting errors; is the weighted sum of the available sample autocovariances: , where is the sample size, , and is the lag window, defined as where ; reports the number of forecasting steps ahead.
We denote as the forecasting accuracy of -norm MK-SVR and as the forecasting accuracy of -norm MK-SVR. Under the null hypothesis: , the test was performed at the and significant levels . The test results are shown in the following Table 5. For the three data sets, all asymptotic tests reject . The test result shows that -norm MK-SVR model indeed improves the forecasting accuracy in comparison with -norm MK-SVR model.
We briefly mention that the superior performance of -norm MK-SVR model () is not surprising. When we use the sparsity-inducing norm (), some of the kernel weights are forced to become zero, and the corresponding kernel will be eliminated leading to some information loss. The daily stock closing prices do not carry large parts of overlapping information, and the information is discriminative. So a nonsparse kernel mixture can access more information and perform more robustly.
4. Summary and Prospect
In this paper, an -norm MK-SVR model for stock market price forecasting is proposed. The model conceives an optimization scheme of unprecedented efficiency and provides a really efficient implementation. In an empirical evaluation, we show that -norm MK-SVR can improve predictive accuracies on relevant real-world data sets. Although we focus on volatility forecasting of stock markets in this paper, our -norm MK-SVR model could be applied to more general financial forecasting problems. Therefore in the future we will apply our -norm MK-SVR model for other financial markets, such as exchange markets.
-Norm MK-SVR Dual Formulation
In this appendix, we detail the dual formulation of -norm MK-SVR. We again consider -norm MK-SVR with a general convex loss,
In the following, we build the Lagrangian of (A.1). By introducing Lagrangian multipliers , , , and , the Lagrangian saddle point problem is given by Set the Lagrangian’s first partial derivatives with respect to , , , and , and let them be to reveal the optimality conditions
Resubstituting the previous equations to the Lagrangian yields the following which can also be written as
For standard support vector regression formulations, the hinge loss function can be defined as . This loss is also convex with a sub-gradient bounded by . As is known to all, the Fenchel-Legendre conjugate of a function is defined as , and the dual form is denoted by (the norm defined via the identity . According to (A.3), (A.5), and Fenchel-Legendre conjugate of the hinge loss function, we can obtain the following dual: where , , , , and .
In the following, we find at optimality. Let us solve for the unbounded ; then we can obtain the optimal as Obviously, , so we can ignore the corresponding constraint from the optimization problem and plug (A.7) into (A.6). Then the following dual optimization problem for -norm MK-SVR is written as
For the choice of norm, holds in the optimal point so that the -term can be discarded . Therefore the previous equations reduce to an optimization problem that depends on and as
Now, -norm MK-SVR model has been constructed.
The authors would like to thank the handling editor and the anonymous reviewers for their constructive comments, which led to significant improvement of the paper. This work was partially supported by the National Natural Science Foundation of China under Grant no. 51174236.
J. W. Hall, “Adaptive selection of U.S. stocks withneural nets,” in Trading on the Edge: Neural, Genetic, and Fuzzy Systems for Chaotic Nancial Markets, G. J. Deboeck, Ed., John Wiley & Sons, New York, NY, USA, 1994.View at: Google Scholar
Y. S. Abu-Mostafa and A. F. Atiya, “Introduction to financial forecasting,” Applied Intelligence, vol. 6, no. 3, pp. 205–213, 1996.View at: Google Scholar
D. G. Champernowne, “Sampling theory applied to autoregressive schemes,” Journal of the Royal Statistical Society B, vol. 10, pp. 204–231, 1948.View at: Google Scholar
G. E. P. Box and G. M. Jenkins, Time Series Analysis: Forecasting and Control, Prentice Hall, Englewood Cliffs, NJ, USA, 3rd edition, 1994.
P. F. Pai, W. C. Hong, C. S. Lin, and C. T. Chen, “A hybrid support vector machine regression for exchange rate prediction,” International Journal of Information and Management Sciences, vol. 17, no. 2, pp. 19–32, 2006.View at: Google Scholar
W. M. Hung and W. C. Hong, “Application of SVR with improved ant colony optimization algorithms in exchange rate forecasting,” Control and Cybernetics, vol. 38, no. 3, pp. 863–891, 2009.View at: Google Scholar
H. Jiang and W. He, “Grey relational grade in local support vector regression for financial time series prediction,” Expert Systems with Applications, vol. 39, no. 3, pp. 2256–2262, 2012.View at: Google Scholar
V. N. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, NY, USA, 1998.
N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel- Based Learning Methods, Cambridge University Press, Cambridge, UK, 2000.
G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan, “Learning the kernel matrix with semidefinite programming,” Journal of Machine Learning Research, vol. 5, pp. 27–72, 2004.View at: Google Scholar
A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet, “SimpleMKL,” Journal of Machine Learning Research, vol. 9, pp. 2491–2521, 2008.View at: Google Scholar
F. R. Bach, “Consistency of the group lasso and multiple kernel learning,” Journal of Machine Learning Research, vol. 9, pp. 1179–1225, 2008.View at: Google Scholar
D. Zhang, D. Shen, and The Alzheimer’s Disease Neuroimaging Initiative, “Multimodal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease,” NeuroImage, vol. 59, pp. 895–907, 2012.View at: Google Scholar
M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien, “-norm multiple kernel learning,” Journal of Machine Learning Research, vol. 12, pp. 953–997, 2011.View at: Google Scholar
F. Orabona, L. Jie, and B. Caputo, “Multi kernel learning with online-batch optimization,” Journal of Machine Learning Research, vol. 13, pp. 165–191, 2012.View at: Google Scholar
S. V. N. Vishwanathan, Z. Sun, N. Theera-Ampornpunt, and M. Varma, “Multiple kernel learning and the SMO algorithm,” in Advances in Neural Information Processing Systems, 2010.View at: Google Scholar
C. C. Chang and C. J. Lin, “LIBSVM: a library for support vector machines,” in ACM Transactions on Intelligent Systems and Technology (TIST '11), vol. 2, no. 3, pp. 1–27, ACM, 2011.View at: Google Scholar
R. E. Fan, P. H. Chen, and C. J. Lin, “Working set selection using second order information for training support vector machines,” Journal of Machine Learning Research, vol. 6, pp. 1889–1918, 2005.View at: Google Scholar