`Computational Intelligence and NeuroscienceVolume 2012 (2012), Article ID 601296, 10 pageshttp://dx.doi.org/10.1155/2012/601296`
Research Article

## -Norm Multikernel Learning Approach for Stock Market Price Forecasting

1School of Mathematics and Statistics, Central South University, Changsha, Hunan 410075, China
2Wengjing College, Yantai University, Yantai, Shandong 264005, China
3School of Mathematics and Information Science, Yantai University, Yantai, Shandong 264005, China

Received 10 August 2012; Revised 18 November 2012; Accepted 27 November 2012

Copyright © 2012 Xigao Shao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Linear multiple kernel learning model has been used for predicting financial time series. However, -norm multiple support vector regression is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we adopt -norm multiple kernel support vector regression () as a stock price prediction model. The optimization problem is decomposed into smaller subproblems, and the interleaved optimization strategy is employed to solve the regression model. The model is evaluated on forecasting the daily stock closing prices of Shanghai Stock Index in China. Experimental results show that our proposed model performs better than -norm multiple support vector regression model.

#### 1. Introduction

Forecasting the future values of financial time series is an appealing yet difficult activity in the modern business world. As explained by Deboeck and Yaser [1, 2], the financial time series are inherently noisy, nonstationary, and deterministically chaotic. In the past, many methods were proposed for tackling this kind of problem. For instance, the linear models for forecasting the future values of stock prices include the autoregressive (AR) model [3], the autoregressive moving average (ARMA) model [4], and the autoregressive integrated moving average (ARIMA) model [4]. Over the last decade, nonlinear approaches have received increasing attention in financial time series prediction and have been proposed for a satisfactory answer to the problem. For example, Yao and Tan [5] used time series data and technical indicators as the input of neural networks to increase the forecast accuracy of exchange rates; Cao and Tay [6, 7] applied support vector machine (SVM) in financial forecasting and compared it with the multilayer back-propagation (BP) neural network and the regularized radial basis function (RBF) neural network; Qi and Wu [8] proposed a multilayer feed-forward network to forecast exchange rates; Pai and Lin [9] invested a hybrid ARIMA and support vector machines model in stock price forecasting; Pai et al. [10] presented a hybrid SVM model to exploit the unique strength of the linear and nonlinear SVM models in forecasting exchange rate; Kwon and Moon [11] proposed a hybrid neurogenetic system for stock trading; Hung and Hong [12] presented an improved ant colony optimization algorithm in a support vector regression (SVR) model, called SVRCACO, for selecting suitable parameters in exchange rate forecasting; Jiang and He [13] introduced local grey SVR (LG-SVR) integrated grey relational grade with local SVR for financial times eries forecasting; and so on.

In comparison with the previous models, SVR with a single kernel function can exhibit better prediction accuracy because it conceives the structural risk minimization principle which considers both the training error and the capacity of the regression model [14, 15]. However, the researchers have to determine in advance the type of kernel function and the associated kernel hyper parameters for SVR. Unsuitably chosen kernel functions or hyper parameter settings may lead to significantly poor performance [16, 17].

In recent years there has a lot of interest in designing principled regression algorithms over multiple cues, based on the intuitive notion that using more features should lead to better performance and decreasing the generalization error. When the right choice of features is unknown, learning linear combinations of multiple kernels is an appealing strategy. The approach with a optimization process is called multiple kernel learning (MKL). A first step towards a more realistic model of MKL was achieved by Lanckriet et al. [18], who showed that, given a candidate set of kernels, it is computationally feasible to simultaneously learn a support vector machine and a linear kernel combination at the same time. In MKL we need to solve a joint optimization problem while also learning the optimal weights for combing the kernels. Several practitioners have adopted the linear multiple kernels to deal with the practical problems. For example, Rakotomamonjy et al. [19] addressed the MKL problem through a weighted 2-norm regularization formulation and proposed an algorithm, named Simple MKL, for solving this MKL problem. Bach [20] proposed the asymptotic model consistency of the group Lasso. Zhang and Shen [21] presented multimodal multitask learning algorithm for joint prediction of multiple regression and classification variables in Alzheimer’s disease. Especially, Chi-Yuan Yeh and his coworkers [22] developed a two-stage MKL algorithm by incorporating sequential minimal optimization and the gradient projection method. The new method [22] performed better than previous ones for forecasting the financial time series. Previous approaches to multiple kernel learning (MKL) have promoted sparse kernel combinations to support interpretability and scalability. Unfortunately, sparsity at the kernel level may harm the generalization performance of the learner, therefore -norm MKL is rarely observed to outperform trivial baselines in practical applications [23]. To allow for robust kernel mixtures that generalize well, the researchers extend -norm MKL to arbitrary norms, that is, -norm MKL (). For example, Marius Kloft et al. developed two efficient interleaved strategies for -norm MKL and showed that it can achieve better accuracy than -norm MKL for real-world problems [23]; Francesco Orabona et al. presented a MKL optimization algorithm based on stochastic gradient descent for -norm MKL, which possessed a faster convergence rate as the number of kernels grows [24].

In this paper, a multiple kernel learning framework is established for learning and predicting the stock prices. We present a regression model for the future values of stock prices, that is, -norm multiple kernel support vector regression (-norm MK-SVR), where . We decompose the optimization problem into smaller subproblem and adopt the interleaved optimization strategy to solve the regression model. Our experimental results show that -norm MK-SVR performs a better performance.

The rest of this paper is arranged as follows. Section 2 details the processing of the -norm MK-SVR model construction and describes the algorithm for our regression model. Experimental results are presented in Section 3. Section 4 concludes the paper and provides some future research directions.

#### 2. Forecasting Methodology

##### 2.1. -Norm Multiple Kernel Support Vector Regression

In this section, the idea of -norm multiple kernel support vector regression (-norm MK-SVR) is introduced formally.

Let , where and , be the training set. Each is the desired output value for the input vector . Consider a function that maps the samples into a high, possibly infinite, dimensional space. A regression model is learned from the previous and used to predict the target values of unseen input vectors. SVR is a nonlinear kernel-based regression method which tries to locate a regression hyperplane with small risk in high-dimensional feature space [14]. Considering the soft margin formulation, the objective function and constraints for SVR should be solved, as follows:

SVR model usually uses a single mapping function and hence a single kernel function . Although the SVR model has good function approximation and generalization capabilities, it is not fit for dealing with a data-set which has a locally varying distribution. For resolving this problem, we can construct a MK-SVR model. Combining multiple kernels instead of using a single one, -norm MK-SVR model can catch up the varying distribution very well. Therefore we can use the composite feature map which has a block structure: to map the input space to the feature space, where are weights of component functions. Given a set of base kernels which correspond the previous feature maps , linear MK-SVR aims to learn a linear combination of the base kernels as . In learning with MK-SVR we aim at minimizing the loss on the training data with respect to the optimal kernel mixture in addition to regularizing to avoid overfitting. The primal can therefore be formulated as Previous research to MK-SVR employs the regularizer of the form which can promote sparse kernel mixtures. However, sparsity is not always desirable, since the information carried in the zero-weighted kernels is lost. Therefore we propose to use nonsparse and thus more robust kernel mixtures by employing an -norm constraint with , that is, , and , . In (3), let , , , and the first equation be divided with , then the following -norm MK-SVR is obtained:

An alternative approach previous equations has been considered by studiers. For example, Zien and Ong [25] upperbound the value of the regularizer and incorporate the regularizer as an additional constraint into the optimization problem. According to this thought, -norm MK-SVR model (4) can be transformed into the following form:

It can be shown (see the Appendix for details) that the dual of (5) is where , , , , and is the dual norm of . Suppose the optimal , and are found by solving (6), the regression hyperplane for -norm MK-SVR model is given by where is obtained from any and , with . In the following section, an efficient algorithm is proposed for solving the optimization problem (6).

##### 2.2. An Optimistic Algorithm

-norm MK-SVR model (6) can be trained with several algorithms, for example, the Sequential Minimal Optimization algorithm [26] and multi-kernel learning with online-bath optimization [24]. In this paper, the interleaved optimization is used for the optimization scheme according to the idea of [23]. As a matter of fact, we can exploit the structure of -norm MK-SVR cost function by alternating between optimizing the linear combination of the base kernels and the remaining variables as and . We can do so by setting up a two-stage optimization algorithm. The basic idea of the algorithm is to divide the optimization variables of -norm MK-SVR problem (6) into two groups, on one hand and on the other. Our procedure will alternatingly operate on those two stages via a block coordinate descent algorithm. Therefore the optimization will be carried out analytically and the will be computed in the dual. The two stages are iteratively performed until the specified stopping criterion is met, as shown in Figure 1.

Figure 1: -norm MK-SVR model learning algorithm (see [27]).

In the first stage, the variables are kept fixed, that is, the are known. Then the optimal in -norm MK-SVR model (6) can be calculated analytically by the following process.

According to , let

Set the ’s first partial derivatives with respect to , and let it be : In the optimal point holds, so the previous equation yields where , and .

In the second stage, the following algorithm is used. We give a chunking-based training algorithm (Algorithm 1) via analytical update for -Norm MK-SVR. Kernel weighting and are optimized in an interleaving way. The basic idea of this algorithm is to divide the optimal problem into an inner subproblem and an outer subproblem. The algorithm alternates between solving the two subproblems until convergence.

Algorithm 1

In every iteration process, the inner subproblem ( and step) identifies the constraint that maximises (6) with fixing kernel weighting . The outer subproblem ( step) is also called the restricted master problem. is computed with the (10), .

The interleaved optimization algorithm is depicted in Algorithm 1, and the details of it are as follows.

###### 2.2.1. Initialization

Assume the original values of and are 0, for  all , and the initial value of is , for all , where is a constant.

###### 2.2.2. Chunking and Carrying out with SVR

In the iteration process, the procedure is standard in chunking-based SVR solvers and is carried out by , where is chosen as described in [28]. We implement the greedy second-order working set selection strategy of [28]. Rather than compute the gradient repeatedly, we speed up variable selection by caching, separately for each kernel. The cache needs to be updated every time we change and in the reduced variable optimisation. In Algorithm 1, (4) and (5) compute the objective values of SVR. Finally, the analytical value of is carried out in (10).

###### 2.2.3. Stopping Criterion

When the duality gap falls below a prespecified threshold, that is, , we terminate the algorithm and output , , .

#### 3. Experimental Results

In this section, two experiments on a real financial time series have been carried out to assess the performance of -norm MK-SVR. The motivation behind the two experiments are to compare the performance of our proposed method with that of other methods, that is, single kernel support vector regression (SKSVR) [29] and -norm MK-SVR [22]. All calculations are performed with programs developed in MATLAB R2010a.

##### 3.1. Experiment I

Firstly, we compare the performance of -norm MK-SVR with that of SKSVR. In this experiment, the daily stock closing prices of Shanghai Stock Index in China for the period of January 2003 to December 2007 are used, and the training/validating/testing data set is generated by a one-season moving-window testing approach. Following the way done in [29], three data sets, data1 to data3, are formed. For instance, data1 contains the daily stock closing prices from January 2003 to December 2006 are selected as the training data set, the daily stock closing prices from January 2007 to March 2007 are selected as the validating data set, the daily stock closing prices from April 2007 to June 2007 are selected as the testing data set. The corresponding time periods for data 1 to data 3 are listed in Table 1.

Table 1: The data sets for the first experiment.

According to [29], we can derive training patterns based on the original daily stock closing prices for SKSVR and -norm MK-SVR. Let × be the -day exponential moving average of the th day, where is the th day daily stock closing prices and , then the output variable can be defined as Let be the input vector and let be the lagged relative difference in percentage of price (RDP). Moreover, We can obtain a transformed closing price by subtracting a -day EMA from the closing price, that is,

Based on in the previously mentioned, the input variables can be defined as , , , , and . We adopt the root mean squared error (RMSE) for performance comparison, that is, where and are desired output and predicted output, respectively.

There are three parameters that should be determined in advance for SKSVR,that is, , , and for using RBF kernel. The forecasting performance of SKSVR is examined with and . Because the forecasting performance obtained by SKSVR is effected by the parameter , we try with different settings of it from 0.01 to 3 with a stepping factor of 0.05. Figure 2 shows the RMSE for performance on the three data sets by SKSVR. The figure shows that SKSVR requires different settings for different data sets to obtain the best performance. For example, the best performance for data 1 occurs when . The best RMSE values obtained by SKSVR are listed in Table 2.

Table 2: The comparison of RMSE values between SKSVR and -norm MK-SVR.
Figure 2: Forecasting performance of SKSVR with different hyperparameters.

For -norm MK-SVR training model, we adopt RBF kernel . A kernel combining 60 different RBF kernels is considered,that is, with step . Hence, the kernel matrix is combined with a weighted sum of 60 kernel matrices, that is, where denotes the kernel weight for the first kernel matrix with and denotes the kernel weight for the second kernel matrix with , and so on. For the three data sets, the RMSE values obtained by -norm MK-SVR are listed in Table 2, too. Obviously when , , and , -norm MK-SVR model performs better than SKSVR one for data1 data set, data2 data set, and data3 data set, respectively.

##### 3.2. Experiment II

Secondly, we compare the performance of -norm MK-SVR with that of -norm MK-SVR. In this experiment, the daily stock closing prices of Shanghai Stock Index in China for the period of January 2008 to December 2011 are used, and the training/validating/testing data set is generated by a one-season moving-window testing approach. Following the way done in Tay and Cao [29], three data sets, D-I to D-III, are formed. The corresponding time periods for D-I to D-III are listed in Table 3.

Table 3: The data sets for the second experiment.

We also adopt RMSE (13) for performance comparison. For -norm MK-SVR and -norm MK-SVR training model, a kernel combining 40 different RBF kernels is considered, that is, , . Hence, the kernel matrix is combined with a weighted sum of 40 kernel matrices,that is, where denotes the kernel weight for the first kernel matrix with and denotes the kernel weight for the second kernel matrix with , and so on. For the three data sets, the RMSE values obtained by -norm MK-SVR and -norm MK-SVR are listed in Table 4. Obviously when , , and , -norm MK-SVR model performs better than -norm MK-SVR one for D-I data set, D-II data set, and D-III data set, respectively. Figure 3 shows the forecasting results for D-I and D-II by the two regression models.

Table 4: The comparison of RMSE values between -norm MK-SVR and -norm MK-SVR.
Figure 3: Forecasting results by -norm MK-SVR and -norm MK-SVR.

Furthermore, we can use a statistical test proposed by Diebold and Mariano [30] to assess the statistical significance of the forecasts by -norm MK-SVR model. The loss-differential series of -norm MK-SVR and -norm MK-SVR are shown in Figures 4 and 5. According to [30], we adopt the asymptotic test as the test statistic, where is the loss-differential series of -norm MK-SVR and -norm MK-SVR models, and denote the forecasting errors; is the weighted sum of the available sample autocovariances: , where is the sample size, , and is the lag window, defined as where ; reports the number of forecasting steps ahead.

Figure 4: Loss differential (-MKSVR to -MKSVR) of D-I.
Figure 5: Loss differential (-MKSVR to -MKSVR) of D-II.

We denote as the forecasting accuracy of -norm MK-SVR and as the forecasting accuracy of -norm MK-SVR. Under the null hypothesis: , the test was performed at the and significant levels [12]. The test results are shown in the following Table 5. For the three data sets, all asymptotic tests reject . The test result shows that -norm MK-SVR model indeed improves the forecasting accuracy in comparison with -norm MK-SVR model.

Table 5: Asymptotic test.

We briefly mention that the superior performance of -norm MK-SVR model () is not surprising. When we use the sparsity-inducing norm (), some of the kernel weights are forced to become zero, and the corresponding kernel will be eliminated leading to some information loss. The daily stock closing prices do not carry large parts of overlapping information, and the information is discriminative. So a nonsparse kernel mixture can access more information and perform more robustly.

#### 4. Summary and Prospect

In this paper, an -norm MK-SVR model for stock market price forecasting is proposed. The model conceives an optimization scheme of unprecedented efficiency and provides a really efficient implementation. In an empirical evaluation, we show that -norm MK-SVR can improve predictive accuracies on relevant real-world data sets. Although we focus on volatility forecasting of stock markets in this paper, our -norm MK-SVR model could be applied to more general financial forecasting problems. Therefore in the future we will apply our -norm MK-SVR model for other financial markets, such as exchange markets.

#### -Norm MK-SVR Dual Formulation

In this appendix, we detail the dual formulation of -norm MK-SVR. We again consider -norm MK-SVR with a general convex loss,

In the following, we build the Lagrangian of (A.1). By introducing Lagrangian multipliers , , , and , the Lagrangian saddle point problem is given by Set the Lagrangian’s first partial derivatives with respect to , , , and , and let them be to reveal the optimality conditions

Resubstituting the previous equations to the Lagrangian yields the following which can also be written as

For standard support vector regression formulations, the hinge loss function can be defined as . This loss is also convex with a sub-gradient bounded by . As is known to all, the Fenchel-Legendre conjugate of a function is defined as , and the dual form is denoted by (the norm defined via the identity . According to (A.3), (A.5), and Fenchel-Legendre conjugate of the hinge loss function, we can obtain the following dual: where , , , , and .

In the following, we find at optimality. Let us solve for the unbounded ; then we can obtain the optimal as Obviously, , so we can ignore the corresponding constraint from the optimization problem and plug (A.7) into (A.6). Then the following dual optimization problem for -norm MK-SVR is written as

For the choice of norm, holds in the optimal point so that the -term can be discarded [23]. Therefore the previous equations reduce to an optimization problem that depends on and as

Now, -norm MK-SVR model has been constructed.

#### Acknowledgments

The authors would like to thank the handling editor and the anonymous reviewers for their constructive comments, which led to significant improvement of the paper. This work was partially supported by the National Natural Science Foundation of China under Grant no. 51174236.

#### References

1. J. W. Hall, “Adaptive selection of U.S. stocks withneural nets,” in Trading on the Edge: Neural, Genetic, and Fuzzy Systems for Chaotic Nancial Markets, G. J. Deboeck, Ed., John Wiley & Sons, New York, NY, USA, 1994.
2. Y. S. Abu-Mostafa and A. F. Atiya, “Introduction to financial forecasting,” Applied Intelligence, vol. 6, no. 3, pp. 205–213, 1996.
3. D. G. Champernowne, “Sampling theory applied to autoregressive schemes,” Journal of the Royal Statistical Society B, vol. 10, pp. 204–231, 1948.
4. G. E. P. Box and G. M. Jenkins, Time Series Analysis: Forecasting and Control, Prentice Hall, Englewood Cliffs, NJ, USA, 3rd edition, 1994.
5. J. Yao and C. L. Tan, “A case study on using neural networks to perform technical forecasting of forex,” Neurocomputing, vol. 34, pp. 79–98, 2000.
6. L. Cao and F. E. H. Tay, “Financial forecasting using support vector machines,” Neural Computing and Applications, vol. 10, no. 2, pp. 184–192, 2001.
7. L. J. Cao and F. E. H. Tay, “Support vector machine with adaptive parameters in financial time series forecasting,” IEEE Transactions on Neural Networks, vol. 14, no. 6, pp. 1506–1518, 2003.
8. M. Qi and Y. Wu, “Nonlinear prediction of exchange rates with monetary fundamentals,” Journal of Empirical Finance, vol. 10, no. 5, pp. 623–640, 2003.
9. P. F. Pai and C. S. Lin, “A hybrid ARIMA and support vector machines model in stock price forecasting,” Omega, vol. 33, no. 6, pp. 497–505, 2005.
10. P. F. Pai, W. C. Hong, C. S. Lin, and C. T. Chen, “A hybrid support vector machine regression for exchange rate prediction,” International Journal of Information and Management Sciences, vol. 17, no. 2, pp. 19–32, 2006.
11. Y. K. Kwon and B. R. Moon, “A hybrid neurogenetic approach for stock forecasting,” IEEE Transactions on Neural Networks, vol. 18, no. 3, pp. 851–864, 2007.
12. W. M. Hung and W. C. Hong, “Application of SVR with improved ant colony optimization algorithms in exchange rate forecasting,” Control and Cybernetics, vol. 38, no. 3, pp. 863–891, 2009.
13. H. Jiang and W. He, “Grey relational grade in local support vector regression for financial time series prediction,” Expert Systems with Applications, vol. 39, no. 3, pp. 2256–2262, 2012.
14. V. N. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, NY, USA, 1998.
15. N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel- Based Learning Methods, Cambridge University Press, Cambridge, UK, 2000.
16. O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, “Choosing multiple parameters for support vector machines,” Machine Learning, vol. 46, no. 1–3, pp. 131–159, 2002.
17. K. Duan, S. S. Keerthi, and A. N. Poo, “Evaluation of simple performance measures for tuning SVM hyperparameters,” Neurocomputing, vol. 51, pp. 41–59, 2003.
18. G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan, “Learning the kernel matrix with semidefinite programming,” Journal of Machine Learning Research, vol. 5, pp. 27–72, 2004.
19. A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet, “SimpleMKL,” Journal of Machine Learning Research, vol. 9, pp. 2491–2521, 2008.
20. F. R. Bach, “Consistency of the group lasso and multiple kernel learning,” Journal of Machine Learning Research, vol. 9, pp. 1179–1225, 2008.
21. D. Zhang, D. Shen, and The Alzheimer’s Disease Neuroimaging Initiative, “Multimodal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease,” NeuroImage, vol. 59, pp. 895–907, 2012.
22. C. Y. Yeh, C. W. Huang, and S. J. Lee, “A multiple-kernel support vector regression approach for stock market price forecasting,” Expert Systems with Applications, vol. 38, no. 3, pp. 2177–2186, 2011.
23. M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien, “${\ell }_{P}$-norm multiple kernel learning,” Journal of Machine Learning Research, vol. 12, pp. 953–997, 2011.
24. F. Orabona, L. Jie, and B. Caputo, “Multi kernel learning with online-batch optimization,” Journal of Machine Learning Research, vol. 13, pp. 165–191, 2012.
25. A. Zien and C. S. Ong, “Multiclass multiple kernel learning,” in Proceedings of the 24th International Conference on Machine Learning (ICML'07), pp. 1191–1198, June 2007.
26. S. V. N. Vishwanathan, Z. Sun, N. Theera-Ampornpunt, and M. Varma, “Multiple kernel learning and the SMO algorithm,” in Advances in Neural Information Processing Systems, 2010.
27. C. C. Chang and C. J. Lin, “LIBSVM: a library for support vector machines,” in ACM Transactions on Intelligent Systems and Technology (TIST '11), vol. 2, no. 3, pp. 1–27, ACM, 2011.
28. R. E. Fan, P. H. Chen, and C. J. Lin, “Working set selection using second order information for training support vector machines,” Journal of Machine Learning Research, vol. 6, pp. 1889–1918, 2005.
29. F. E. H. Tay and L. Cao, “Application of support vector machines in financial time series forecasting,” Omega, vol. 29, no. 4, pp. 309–317, 2001.
30. F. X. Diebold and R. S. Mariano, “Comparing predictive accuracy,” Journal of Business and Economic Statistics, vol. 20, no. 1, pp. 134–144, 2002.