#### Abstract

Exchange rate is one of the key variables in the international economics and international trade. Its movement constitutes one of the most important dynamic systems, characterized by nonlinear behaviors. It becomes more volatile and sensitive to increasingly diversified influencing factors with higher level of deregulation and global integration worldwide. Facing the increasingly diversified and more integrated market environment, the forecasting model in the exchange markets needs to address the individual and interdependent heterogeneity. In this paper, we propose the heterogeneous market hypothesis- (HMH-) based exchange rate modeling methodology to model the micromarket structure. Then we further propose the entropy optimized wavelet-based forecasting algorithm under the proposed methodology to forecast the exchange rate movement. The multivariate wavelet denoising algorithm is used to separate and extract the underlying data components with distinct features, which are modeled with multivariate time series models of different specifications and parameters. The maximum entropy is introduced to select the best basis and model parameters to construct the most effective forecasting algorithm. Empirical studies in both Chinese and European markets have been conducted to confirm the significant performance improvement when the proposed model is tested against the benchmark models.

#### 1. Introduction

In the post-Bretton Woods era, the worldwide exchange markets have shifted towards the more floating and volatile era, which are characterized by high level of fluctuations and risk exposures. Given its role as one of the most important economic factors for the national economy in the increasingly open and globalized economic system, the accurate and reliable forecasting of the exchange rate movement has profound impacts throughout different levels of the economy, including government, enterprises, and academics [1].

Theoretically numerous empirical studies have been conducted to investigate the interrelationship and comovement between the exchange markets and other markets, including crude oil market, stock market, and bond market. For example, Salisu and Mobolaji [2] found the statistical evidence of bidirectional relationship between oil price and US-Nigeria exchange rate [2]. Chkili and Nguyen [3] use a regime-switching model to identify the dynamic linkages between the exchange rates and stock market returns during both calm and turbulent periods [3]. Hacker et al. [4] have found new evidence of negative linkage between the exchange rate and interest rate differentials in the wallet time scale domain [4]. Meanwhile in the literature there is much less attention paid to the exploration of linkage among different exchange markets, which represent the essential theoretical challenge. For example, Kim et al. [5] find that the conditional correlation between Japanese Yen and other Asian economies is decoupled and insignificant due to liquidity deterioration and elevated risk aversions in the international capital market [5]. But recent empirical studies suggest that the comovement across markets, with transmission as one particular case, turns out to be more complicated than what is assumed in the traditional linear framework. This may stem from the loss of information using the low frequency data and the inferior methodology to analyze more accurately the cross correlations.

Methodologically traditional linear structural models have been effective in the forecasting of the exchange rate movement over the medium to long time horizon, with acceptable approximation accuracy and computational efficiency, where the aggregated price behavior is comparably stable and stationary. These include models in both asset and monetary views as well as Keynesian view, regression models, cointegration and vector autoregressive models, multivariate stochastic models, and so forth [6]. When it comes to the shorter time horizon, these traditional models also have largely failed to demonstrate the competent forecasting performance in the empirical studies [6]. Part of the reasons can be attributed to the nonlinear data characteristics revealed in the recent empirical studies [7, 8]. Different distribution free artificial intelligence techniques such as neural network and support vector regressions, together with the innovative optimization methods, have shown superior performance under different circumstances [7, 9–11]. But they are mainly black box in nature and offer little insights into the underlying patterns as well as supporting theories behind.

Recent empirical studies on the fractal and multiscale data characteristics indicate the emergence of multiscale modeling as the important alternative [7]. Wavelet analysis, as one popular multiscale modeling technique, has been introduced to model not only horizontal dependency in the time domain such as volatility clustering (conditional heteroscedasticity) and long memory (slow decaying autocorrelation) but also vertical dependency across time domain simultaneously [12, 13]. For example, Tiwari et al. [14] identify the comovement of oil price and Indian Rupee at higher time scales, but not lower ones, using wavelet analysis [14]. Reboredo and Rivera-Castro [15] use the wavelet analysis to disentangle the oil price-exchange rate relationship in the time scale domain [15]. Orlov [16] tests for the time varying exchange rate comovements at different time scales [16]. Recently the emergence of this approach leads to the emergence of the heterogeneous market hypothesis (HMH) as the theoretical foundation to replace efficient market hypothesis (EMH) to model the heterogenous data characteristics. Methodologically this approach helps explain the heterogeneity of the underlying data comovement and transmission mechanism, behind the exhibited nonlinear data characteristics. However, in the exchange rate forecasting literature, we have only witnessed limited attempts in modeling the correlations and comovements among exchange markets when constructing the forecasting algorithm.

In the meantime, the entropy theory has been used to analyze the information content of the wavelet decomposed multiscale data structure. Wavelet entropy, relative wavelet entropy, and many other variants have been proposed in the literature to calculate the entropy of the energy distribution in the typical wavelet decomposition, as well as the cost function for the best basis algorithm to choose the optimal basis for wavelet packet transform [17, 18]. Xu et al. [19] use the modified wavelet entropy measure to differentiate between the normal and hypertension states [19]. Samui and Samantaray [20] incorporate the wavelet entropy measure in constructing the measuring index for islanding detection in distributed generation [20]. Wang et al. [21] use best basis-based wavelet packet entropy to extract feature in the decomposed structure for the follow-up classification algorithm, which performs well in EEG analysis for patient classification [21]. In the forecasting field, the entropy maximization has recently been proposed to select the best forecasters, but with much less attention attracted in the literature. For example, Bessa et al. [22] adopt the maximum entropy criteria in neural network training and find it to provide more superior performance than traditional mean square error (MSE) criteria in wind power prediction [22].

In this paper we propose aninnovative entropy optimized multivariate wavelet denoising model. Empirical studies are conducted in the closely related Chinese and European exchange markets to evaluate the additional value offered by the incorporation of nonlinear multiscale cross-markets correlations in the proposed algorithm. Our contributions are threefold. Firstly we provide the empirical evidence of multiscale heterogeneous data characteristics distinguishable by sizes. Secondly we incorporate this stylized fact in the construction of the innovative wavelet denoising-based forecasting algorithms. Thirdly we propose the maximum entropy as a measure for in-sample performance to select the best basis and decomposition level. To the best of our knowledge, work in this paper is unique and amongst the first in introducing the maximum entropy in forecasting the exchange rate movement in the multiscale domain, to select the best basis and parameters.

The rest of the paper proceeds as follows. Section 2 briefly reviewed the two relevant theories, that is, multivariate wavelet denoising theory and the entropy theory, underlying the proposed model. Section 3 proposes the multivariate wavelet analysis to analyze the time varying correlations. We further construct the multivariate wavelet denoising-based exchange rate forecasting algorithm. In Section 4 we conduct experiments to empirically test and confirm the performance superiority of the proposed algorithm against the benchmark models. Detailed analysis of experiment results is illustrated as well. Section 5 concludes with summarizing remarks.

#### 2. Multivariate Wavelet-Based Denoising Theory and Entropy Theory

The ultimate goal of denoising is to set the right boundary and remove the noises while preserving major data features. In recent years wavelet denoising algorithm dominates more traditional methods such as moving average filter, exponential smoothing filter, linear Fourier smoothing, and simple nonlinear noise reduction, as it does not assume homogeneous error structure. For example, Kwon et al. [23] point out the problems with existing denoising techniques as assuming homogeneous error structure and they proposed wavelet denoising method incorporating a variance change point detection thresholding method to deal with it in protein mass spectroscopy applications [23]. Boto-Giralda et al. [24] use the stationary wavelet-based denoising methods to improve the performance of traffic volume prediction models in intelligent transportation system [24]. Gao et al. [25] propose an adaptive denoising algorithm and contend it to be superior to wavelet-based approaches when applied to analysis of electroencephalogram (EEG) signals contaminated with noises [25]. Lotric and Dobnikar [26] and Lotric [27] integrate the neural network with the wavelet denoising method to optimize the denoising parameters dynamically and find the performance improvement in prediction accuracy [26, 27].

Generally the multivariate wavelet denoising algorithm involves the following procedures.

(1) The original data series are projected into the general higher-dimensional space into different subspaces characterized by scales using multivariate wavelet transform. These subspaces constitute a doubly infinite nested sequence of subspaces pairs of both denoised and noise data components , respectively, of as follows [28–30]: in which where is the tensor product operator.

Then for each in , the orthogonal complement space is defined as

For all , the decomposition and reconstruction in the two-dimensional case is defined as in (4), respectively: where , , and , , . In the 2-dimensional case, three orthonormal wavelet bases, including horizontal wavelet , vertical wavelet , and diagonal wavelet , are needed to produce subspaces [31].

When the data are of finite support, the discrete wavelet transform would encounter the boundary distortion issue at the edge of the data under analysis. Padding, that is, adding extra data points to either left or last data, is one approach to facilitate the transform. Different padding techniques exist, including zero padding and symmetric padding [30].

(2) The separation between subspaces pairs and is achieved by applying the threshold chosen specifically at different scales for different directions to either suppress or shrink the wavelet coefficients. The denoised and noise part are separated with finer details revealing patterns at more microscales.

The dominant threshold selection rules include Universal, Minimaxi, and Steins unbiased risk estimate (SURE) [32]. Different threshold selection rules have different targets when setting the noise reduction target. For example, the universal threshold selection rule aims to reduce the noises at maximum level possible. The threshold is selected as , where is the number of wavelet coefficients and is the estimate of the volatility level. When the sample size is large and the data series are normally distributed, the universal threshold gives the upper bound value to the noise level in the data statistically. However, this method achieves the maximal level of smoothness at the cost of lower goodness-of-fit. The denoised data risk loses some important data features. The Minimaxi threshold selection rule adopts the function fitness criteria such as MSE. The denoised data represent the best fit approximation to the original data, retaining spikes and hikes. However, this is achieved at the cost of lower function smoothness.

The mainstream shrinkage rules are hard and soft threshold selection rules [32, 33]. The hard threshold selection rule is the high pass filter, which suppresses the wavelet coefficients below the chosen threshold values and leaves the rest coefficients intact as follows: where refers to the wavelet coefficients and is the set threshold value. The soft shrinkage rules focus on the signal smoothing. It suppresses the wavelet coefficients below the set threshold value and subtracts the threshold value from the remaining wavelet coefficients. Compared to hard threshold selection rules, the data processing following soft threshold selection rules is smoother but loses the abrupt changes in the original data. It filters the signal as follows: where

(3) Processed wavelet coefficients are reconstructed into the unified data series using wavelet synthesis.

During the denoising process, there are unknown parameters that have significant impacts on the denoising performance. Other than the statistical approach to quantify them, the information theoretic approach can be brought in to quantify them as well. The entropy is a widely used statistical measure of disorder and uncertainty, quantifying the data randomness [34]. It also corresponds to a measure of information content. For a stochastic time series system, the classical Shannon entropy is defined as follows [35]:

The value of entropy lies between 0 and 1. The higher the entropy is, the higher the level of disorder and uncertainty is.

#### 3. An Entropy Optimized MultivariateWavelet Denoising-Based Exchange Rate Forecasting Model (MWVAR)

Homogeneity and rationality are two basic assumptions imposed in the traditional EMH behind major multivariate exchange rate forecasting algorithms. To recognize the multiscale properties in the high frequency data, we propose the HMH instead. In HMH, the heterogeneous market microstructure is acknowledged explicitly, by assuming different investors strategy, scale, and time horizon, just to name a few [36–42].

Following HMH framework, the exchange market receives the joint influence from market agents with different defining characteristics including investment strategies, time horizons, and investment scales.

Based on the stylized facts, we make some simplifying assumptions: (1) investment strategies within each time horizon are homogeneous and (2) investment strategies across time scales are mutually independent. Then based on the aforementioned theoretical framework, we propose the entropy optimized multivariate wavelet denoising-based exchange rate forecasting algorithm. It involves the following steps.

(1) Suppose that the return of the exchange markets are the sum of common latent factors and the individual latent factors; the multivariate wavelet denoising algorithm is used to separate data from noise using particular wavelet families. By decomposing data into the multiscale domain, data and noises are separated based on their different characteristics across scales with noise smaller in scales. Thus more subtle distinction between data and noise can be set.

(2) The denoised and noise data are supposed to be following some particular stochastic process. The conditional mean matrix for the denoised data and noise data is modeled by employing the particular conditional time series models. In this paper, we adopt vector autoregressive (VAR) processes as follows: where is the conditional mean at time , is the lag returns with parameter , and is the lag residuals in the previous period with parameter .

VAR is chosen over more theoretically sound vector autoregressive moving average (VARMA) model in this paper due to the following reasons: firstly there is lack of authoritative methodology to uniquely identify and estimate VARMA model, although some initial attempts have been made [43]. VAR is still by far the most well established and applied multivariate time series models in the literature. Secondly any invertible vector ARMA can be approximated by VAR with infinite order [44]. We use the information criteria (IC) to determine the optimal specification for the VAR model. Typical IC includes akaike information criteria (AIC) and Bayesian information criteria (BIC).

(3) Using the in-sample data, different criteria such as MSE and entropy can be used to determine the model specifications and parameters. The minimization of MSE corresponds to the minimization of error variance. The entropy maximization corresponds to the maximization of information content in the predictors and higher generalizability. Given the predicted random variable , generated with the unknown data generating process (DGP) with unknown parameters and the observation , the Shannon entropy of predictor is defined as follows: where refers to the Shannon entropy of the predictor , refers to the probability density function (PDF). The objective is to maximize the of the measurable function by adjusting different parameters of forecasting algorithm that produces .

(4) With the chosen model specifications and parameters, the forecast matrix is reconstructed from the individual forecasts, both denoised and noise parts. Thus the mean matrix can be aggregated from the individual mean matrix forecasts.

#### 4. Empirical Studies

##### 4.1. Experiment Settings

We choose the US Dollar against Chinese Renminbi (RMB) and US Dollar against European Euro (Euro) exchange rates to construct the data set for the empirical studies. The dataset in the empirical studies extends from 23 July 2007 to 30 August 2013. The starting date is chosen as the Chinese government changes its exchange rate policy from the fixed pegging to US Dollar to a basket of currency. This has significant impact on the exchange rate movement, shifting into a different regime and evolving based on different underlying mechanism with wider bands as well as higher level of fluctuations. The end date is set to include the latest data available when the research was conducted. The dataset is preprocessed for recording errors and to remove the records when the readings are interrupted due to holiday breaks in either market. This results in 1533 daily observations. Since there is no consensus on the division of dataset, either in the machine learning or econometric literature [45], when dividing the dataset, we follow the common criteria that reserves at least 60% data as the training set and retain sufficiently large size of the test set for the results to be statistically valid [46]. The dataset is divided into three subdataset, that is, the training set for the proposed wavelet denoising VAR model (36%), the model tuning set to select the best basis and relevant parameters (24%), and the test set for the out-of-sample test to evaluate the performance of different models (40%). We perform one day ahead forecast using rolling-window method.

Descriptive statistics for data characteristics are listed in Table 1.

Where and refer to the price of both euro and RMB. refers to the* P* value of the JB test statistics. Descriptive statistics in Table 1 show some interesting stylized facts. The market exhibits considerable fluctuations, as suggested by the significant volatility level. The distribution of the market price is fat-tail and leptokurtic, as suggested by significant skewness and kurtosis levels. There is also high level of market risk exposure due to extreme events in the market, as reflected in the significant kurtosis level. The market return also deviates from the normal distribution and exhibits nonlinear dynamics; this is further confirmed by the rejection of Jarque-Bera test of normality [47, 48]. As autocorrelation and partial autocorrelation function indicate the trend factors, the daily prices are log differenced at the first order to remove them as in . We further calculate the descriptive statistics on the return data and find it to approximate the normal distribution, as indicated by the four moments. The kurtosis appears to deviate from the normal level, which indicates that the market exhibits significant abnormal return changes event. Besides, since the null hypothesis of JB test is rejected, this further indicates that the market return contains unknown nonlinear dynamics, not easily captured by traditional linear models.

##### 4.2. Empirical Analysis of Dynamic Behaviors

The exchange markets are subject to increasingly frequent and abrupt external shocks such as the subprime crisis and European debt crisis, which are illustrated in the plot of returns of both Chinese Renminbi and European Euro as in Figures 1 and 2.

This is where refers to the RMB returns over period . refers to the Euro returns over period . The three periods refer to July, 2007 to May, 2009; May, 2009 to December, 2011; and December, 2011 to August, 2013. It can be seen from Figures 1 and 2 and Table 2 that returns in both markets share some common features while demonstrating their unique characteristics. Both markets are subject to the influences of the subprime crisis from February, 2007 and spread to May, 2009 and European debt crisis from December, 2009 to December, 2011. Meanwhile individually each market is subject to the influences of some unique external forces. As shown in Figure 1, Renminbi market has much lower level of fluctuations than the Euro markets, subject to tighter control of the central bank. As shown in Figure 2, the European debt crisis has much stronger impact on the Euro markets than Renminbi market. Generally the impact of subdebt crisis is larger than the European debt crisis for both markets. Meanwhile, as shown in both figures, both markets are subject to some common latent factors. Both markets demonstrate the obvious jump behaviours around 2008, when they are subject to the subprime crisis.

We further calculate the cross correlations between two exchange markets, using the roll windows of size 252 and plotting the results in Figure 3.

It can be seen from Figure 3 that the correlations are dynamically changing, over different periods at the significant level. Taking subprime crisis and European debt crisis as the marking events, we further calculated the descriptive statistics of correlations for three periods and listed the results in Table 3.

It can be seen from both Figure 3 and Table 3 that for periods 1 and 3, the correlation is positive and relatively stable. The correlations turn negative for period 2, accompanied by significant fluctuations. Within period two, there are frequent regime-switching behaviors. For example, the correlation from November, 2007 to August, 2009 is mainly affected by subprime crisis. The other result from European debt crisis is as follows. We see that from November, 2009 to December, 2010, the correlation stays at the positive level, subject to the positive measures taken by the European Union. Then, during December, 2010 and August, 2011, the correlation becomes negative again because of the second wave of European debt crisis. In general, we find that the correlations of Renminbi and Euro exhibit significant fluctuating behaviors over different periods. They are stable over relatively short periods of time and shift frequently between different regimes. This observation also implies that the time varying features of correlations are the results from the joint influences of common latent factors such as the global financial crisis, as well as the individual risk factor such as China’s tight exchange rate bands.

##### 4.3. Experiment Results

To evaluate the performance of the proposed algorithm against the benchmark ones, we use MSE to measure its predicative accuracy and Clark West test of equal predictive accuracy to test the statistical significance of the predictive accuracy [49, 50].

MSEs for the benchmark vector random walk (VRW) and VAR model are and . Then the wavelet denoising VAR model is applied to the testing data to investigate the effects of different parameters on the model performance. Different combinations of parameters choices are pooled into the parameters pool, including two shrinkage rules (i.e., hard and soft shrinkage rules), decomposition levels up to scale 9, and 2 wavelet families including Daubechies 2 and Coiflet 2. The universal threshold is used during the denoising process. The model orders for VAR () processes are determined following AIC and BIC minimization principle.

Experiment results in Table 4 further confirm that the forecasting accuracy of the proposed wavelet denoting model is sensitive to the wavelet parameters used to denoise the original data. Some of the wavelet parameters can improve the forecasting performance to the level that beats the traditional benchmark models significantly. For example, the wavelet denoising VAR outperforms both VRW and VAR model when Coiflet 2 wavelet is used for both hard and soft shrinkage rules.

Different criteria can then be used to determine the model specifications and parameters. Following the MSE minimization principle when identifying the appropriate model specifications, the chosen model specifications from results in Table 4 are Coiflet 2 wavelet family with hard threshold strategy at decomposition level 7.

Meanwhile, we propose the entropy maximization principle as the criteria when identifying the appropriate model specifications. Experiment results for different model specifications and parameters are listed in Table 5.

From results in Table 5, the chosen model specifications are Coiflet 2, with hard threshold strategy at decomposition level 3.

Experiment results in Table 6 show that the proposed algorithm outperforms the benchmark VRW and VAR model, in terms of predictive accuracy. The Clark West test of equal predictive accuracy suggests that the performance gap is significant at 7% confidence level against the VRW model and 13% confidence level against VAR model. Meanwhile, experiment results also show that adopting the proposed entropy measure during the model parameter determination would lead to lower MSE and higher* P* value for its statistical significance performance out-of-sample compared to the currently dominant MSE measure. Out-of-sample performance comparisons for the proposed MWVAR using different criteria during the model parameter determination in-sample suggest that using the proposed entropy measure leads to improved model performance out-of-sample. The proposed MWVAR algorithm has achieved the lower MSE in general, compared with the algorithm performance using in-sample MSE as the criteria.

The performance improvement is attributed to the analysis of latent risk structure in the multiscale domain using wavelet analysis, as well as the best basis and parameters selection based on entropy maximization principle. These results further imply that the exchange rate data is complicated processes with a mixture of underlying DGPs of different natures. There may be redundant representation of the underlying latent structure since there lacks explicit and analytic solutions. The determination of best basis and parameters, heterogeneous in nature, hold the key to further performance improvement and more thorough understanding of the DGPs during the modeling process.

#### 5. Conclusions

In this paper, we propose the HMH-based theoretical framework for exchange rate forecasting. Under the proposed theoretical framework we further propose the multivariate wavelet-based exchange rate forecasting algorithm, as one particular implementation. We find that the exchange rate behaviors are affected by noises and main trends, which have different characteristics. The separation of noises and data need to be conducted in a multiscale manner to recover the useful data for further modeling by VAR model. Results from empirical studies using the USD against RMB and Euro, as the typical pair, confirm the performance improvement of the proposed models, against the benchmark models. Work done in this paper suggests that the more accurate separation of data and noises leads to better behaved data and higher level of model generalizability.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgments

The authors would like to express their sincere appreciation to the editor and the three anonymous referees for their valuable comments and suggestions, which helped improve the quality of the paper tremendously. This work is supported by the CityU Strategic Research Grant (no. 7004135), the National Natural Science Foundation of China (NSFC nos. 71201054 and 91224001), and the Fundamental Research Funds for the Central Universities (no. ZZ1315).