A New Approach for Reconstruction of IMFs of Decomposition and Ensemble Model for Forecasting Crude Oil Prices
Accurate forecasting for the crude oil price is important for government agencies, investors, and researchers. To cope with this issue, in this paper, a new paradigm is designed for the reconstruction of intrinsic mode functions (IMFs) of decomposition and ensemble models to reduce the complexity in computation and to enhance the forecasting accuracy. Decomposition and ensemble methodologies significantly enhance the forecasting accuracy under the framework of “divide and conquer” with the proposed reconstruction of IMFs method. The proposed approach used the autocorrelation at lag 1 of all IMFs for the reconstruction. The ensemble empirical mode decomposition (EEMD) technique is employed to decompose the data into different IMFs. Models that utilized the decomposed data relatively perform well, as compared to its application to the undecomposed data. However, sometimes, the decomposition may produce poor results due to the error accumulation at the end. Thus, in this study, the reconstruction of IMFs is proposed for minimizing the aforementioned error, thereby increasing the forecasting accuracy. The Brent and West Texas Intermediate (WTI) datasets (daily and weekly) are exploited to compare the forecasting performance of autoregressive integrated moving average (ARIMA) along with artificial neural network (ANN) models with the decomposed data. The results have proven that the new paradigm of reconstruction of IMFs through autocorrelation was a better and simple strategy that significantly improved the performance of single models including ARIMA and ANN. Hence, it is concluded that the proposed model takes less computational time and achieved higher forecasting accuracy with the reconstruction of IMFs as opposed to using all IMFs.
Crude oil is a very important commodity in the world because of its unique nature, as it affects the life of every individual in many ways. According to the IEA, in early 2018, the world currently consumes 99.3 million barrels of oil and liquid fuels daily. In global market, it is the most active and heavily traded commodity. Due to the high demand of crude oil in every field of life, it needs more attention as compared to other commodities. Oil is a nonrenewable commodity, but the world consumes it in different ways; thus, it is a challenge for mathematicians, statisticians, and econometricians to develop a better strategy for understanding the price changing aspect of crude oil. Due to the irregular and stochastic nature of oil prices, it is a very complex and challenging task for researchers to develop an appropriate model for COPs forecasting. The compound and complex nature of the COPs make this area widely opened for researchers to develop many different procedures to forecast feasibly.
In the existing literature, several models have been proposed for forecasting the time series data and applied for forecasting the COPs, which is generally described in [1, 2], , where represents the price of crude oil at time , is the previous value of time with lagging of , is the horizon of the prediction, and denotes the prediction error which is independent and identically distributed random variable. The function is distinguishably designed with the respective parameter evaluation techniques, while the current COPs forecasting models can be divided broadly into three categories. In the first category, traditional econometric models fall with strict assumptions such as stationarity and linearity in parameter estimation but with relatively simple fixed functions. The second category is called AI, which consists of flexible functions having the capability of self-learning at the stage of model training. The last category is very popular these days as the combination of several single models into a hybrid model systematically. This study focuses on the third category of hybridizing the popular single models into one for achieving the high forecasting accuracy with a simple technique.
The first category consists of ARIMA, RW, GARCH, VAR, and ECM models which are more frequently used to forecast the COPs. For the prediction of monthly COPs, Y. Xiang and X. H. Zhuang  used the ARIMA model for Brent data covering a period from November 2012 to April 2013. For estimation of the volatility and the conditional mean , GARCH model is used for daily COPs data of WTI covering a period from 9 December 2000 to 2 January 2010. The VAR model is utilized in  for US monthly COPs covering from January 1980 to November 2002. The RW model is also used as a benchmark to forecast COPs moments . The ECM model was used in  for the prediction of COPs of Brent and WTI covering the period from 1994 to 2002.
The second category concerned about the AI methods, the predominant techniques, which are frequently used to forecast the COPs, and empirical studies which proved that their superiority over the traditional econometrics models are SVR, ANN, and LSSVR. For SVR models, W. Xie et al.  used this technique for the prediction of WTI monthly COPs covering a period from January 1970 to December 2003, and the empirical study showed that the SVR model provides more accurate predictions as compared to the ARIMA model and BPNN models. Similarly, the SVR model was used in  for the prediction of COPs of WTI covering a period from January 1986 to December 2009. Movagharnejad et al.  used the ANN model for forecasting the COPs data covering a period from January 2000 to April 2010. The evolutionary neural network was presented by Chiroma et al.  for forecasting the WTI monthly COPs covering a period from May 1987 to December 2011. Li et al.  utilized the LSSVR model for forecasting WTI COPs from 4 January 2008 to 18 October 2018, with a conclusion that BPNN, ARIMA, and SVR models were outperformed by LSSVR model. The AI models performed better than the traditional models; however, overfitting and best parameter selection are the weakness of these models .
The third category is about the hybridization of the models into a single model. In this category, one of the most famous concepts “divide and conquer”  is used to decompose the original series into meaningful components and then ensemble them after applying different techniques on each component. Nowadays, the decomposition and ensemble strategy was used frequently to forecast COPs. This methodology has three main steps: in the first step, the time series is divided into small independent components; in the second step, a suitable technique is applied to each component and predict; and the last step, ensembles the individual predictions for the original series prediction [13, 15]. Using the “divide and conquer” concept in the decomposition and ensemble modelling techniques, some recent studies have demonstrated the usefulness of these techniques which identifies the enhancement of the forecasting accuracies. Yu et al.  decomposed the original series by using the EMD and predicted WTI and Brent daily COPs covering periods from 20 May 1987 to 30 September 2006 and 1 January 1986 to 30 September 2006, respectively. Subsequently, a decision is reached, whereby the “divide and conquer” approach performed well and enhanced the forecasting accuracy for both WTI and Brent COPs series. The complementary ensemble empirical mode decomposition technique is introduced in  for forecasting the COPs, the empirical results of the decomposition, and ensemble strategy which is supported by the theory of improving the forecasting accuracy of the model. The same outcomes were achieved in  which suggested a new AI learning-paradigm with compressed sensing for forecasting the daily COPs covering a period from 3 January 2011 to 17 July 2013. In the EEMD and EELM paradigm which is introduced in  for the prediction of WTI COPs covering a period from 2 January 1986 to 21 October 2103, the results supported the decomposition and ensemble methodologies. In conclusion, the decomposition and ensemble methodologies worked well and significantly improved the performance of the models used for forecasting the COPs.
Moreover, Lin and Sun  also proposed a hybrid method that is a combination of the CEEMDAN and MLGRU neural network to forecast the COPs and concluded that the new model enhanced the forecasting accuracy. Xian et al.  suggested a new model integrating the two methods called EEMD-ICA which has been proposed for financial time-series signal processing. Wu et al.  also proposed a novel model based on EEMD and LSTM. The proposed model EEMD-LSTM is applied to the COPs of WTI, and the numerical results have proved the superiority of the new technique.
Furthermore, the complex data of COPs will be effectively handled by the decomposition ensemble models as compared to the ARIMA, GARCH, and ANN single models. However, an essential issue could arise regarding the model computational time, cost, and complexity. Due to the break down of decomposition ensemble models, the time series data are divided into several independent components called IMFs. After the decomposition, the modelling of all IMFs could be a time-consuming method due to the fact of individual IMF modelling and forecasting. Moreover, at the end, sometimes, produced poor results due to the estimation errors of all models are accumulated in the last ensemble step of forecasting [2, 22]. Addressing the problem of reconstruction of IMFs is incorporated after the decomposition step to forecast the individual IMF. During the IMFs reconstruction step, all the IMFs, after the decomposition of the original series, are reconstructed into some meaningful components. To overcome the above problem in decomposition and ensemble models, B. Zhu et al.  proposed a new multiscale paradigm through a kernel function incorporating EEMD, PSO, and LSSVM. Using the EEMD method, the original series was decomposed into IMFs and reconstructed into groups such as HFs, LFs, and trend. Later, the HFs were forecasted using the ARIMA model, while LFs and trend were forecasted by the PSO method, and for the final prediction, the whole forecasted results were simply combined. Aamir et al.  also reconstructed the IMFs of the EEMD model using the order of the ARIMA model for forecasting the weekly COPs of Brent and WTI. The authors decided that the reconstruction of IMFs was an effective technique that reduces time and cost and enhances the forecasting accuracy. Rios and De Mello  empirically proved that the acquired IMFs present various levels of stochasticity. This was the motivation to develop formal research to prove that the decomposed IMFs might be separated into two components, namely, stochastic and deterministic. The problem arises as how to divide these IMFs into SD components. The IMFs were divided in  into two components based on the RP and AMI and concluded that dividing the IMFs into two components enhanced the forecasting accuracy. Furthermore, the authors in  claimed that the IMFs could be divided into two additive components called stochastic and deterministic, and the researchers can work to better model these components. Aamir and Shabri  also reconstructed the IMFs in two components called stochastic and deterministic through AMI. The authors concluded that the reconstructed components enhanced the forecasting accuracy taken WTI and Brent daily COPs as a sample. Furthermore, Du et al.  also proposed the BPE index method to solve the problem based on PEEMD. The authors also used the reconstruction of IMFs procedure by implementing the T-test.
Most of the above studies have used one or two criteria for reconstructing the IMFs into components, such as HFs, LFs, trend, RP, AMI, the order of ARIMA model, T-test, and average, while avoiding some other data characteristics. In order to capture some important components from the data dynamics efficiently, a comprehensive data analysis or some effective technique is strongly recommended for all the decomposed modes as well as a better procedure for forecasting reconstructed modes. In this study, the EEMD technique is used to decompose the time series for analyzing a wider class of time series.
To overcome the above problem and further improve the forecasting efficiency, in this study, we aim at proposing a novel method of the use of decomposed IMFs obtained from EEMD. The innovation and contributions of this study are outlined as (i) a new forecasting approach for COPs of utilizing the reconstructed IMFs was proposed, following the well-known “decomposition and ensemble” framework. (ii) To the best of our knowledge, this is the first time that IMFs are reconstructed in this format and used for time-series forecasting. (iii) A threshold value of autocorrelation is fixed based on simulations and statistical tests for separating the IMFs into SD components. (iv) In publicly accessible Brent and WTI (daily and weekly) COPs, extensive experiments were conducted, and it was shown that the proposed approach outperformed several state-of-the-art methods for forecasting COPs. (v) We further analyzed the characteristics of the SC which needs more consideration and can help in improving the forecasting accuracy of the COPs significantly.
In the following section, the framework of the proposed study, methods, and materials are outlined in detail.
2. Materials and Methods
2.1. Data Description
The data used in this study were obtained from DataStream and can be found in the supplementary files. A total of four datasets are used to evaluate the performance of the proposed procedure. All datasets consist of the daily and weekly COPs of Brent and WTI. There are a different number of observations in each dataset to check the correctness, robustness, and generalizability of the suggested procedure. Every time series was divided into training and testing sets of observations. The first 80% of the total observations in every time series were used as training set, whereas the rest 20% were used as the testing set. The details of every time series are discussed in brief in subsequent sections.
The Brent COPs time series contain a total of 12436 observations (spanning 2 February 1970 to 1 October 2017), the first 9,949 observations belong to the training series, and the rest 2487 part of the testing series. The next is the WTI COPs time series containing a total of 9060 observations (spanning 12 January 1983 to 1 October 2017), the first 7248 observations belong to the training series, while the rest of 1812 part of the testing series.
The Brent COPs time series contain a total of 2488 observations (spanning 2 February 1970 to 1 October 2017), the first 1991 observations belong to the training series, and the rest 497 part of the testing series. The next is the WTI COPs time series containing a total of 1812 observations (spanning 12 January 1983 to 1 October 2017), the first 1450 observations belong to the training series, while the rest 362 part of the testing series.
2.2. Framework of the Proposed Study
Reconstruction of IMFs obtained from EEMD is proposed in this study. Firstly, the obtained IMFs from EEMD are divided into two components, namely, stochastic and deterministic. In this study, the proposed procedure for dividing IMFs into two components is autocorrelation (correlation ()) of every IMF. If the autocorrelation is less than a threshold value, then that IMF will be considered stochastic, otherwise deterministic. To obtained two components add all stochastic IMFs for SC and deterministic for DC. The idea of reconstructing the IMFs into SD components should be utilized in more detail for further improvement because the SC requires more attention which can help in improving the forecasting accuracy significantly. The IMFs being part of the SC handled separately because most of the variation in total shared by the SC . Thus, in this study, the stochastic IMFs modelled separately, for every stochastic IMF a different model is chosen and for final forecasting ensemble all the models’ forecast, whereas all deterministic IMFs would be treated as one component. The ARIMA and ANN models will be fitted for stochastic IMFs and DC and for the final output, add all these components, so the proposed models are denoted by EEMD-R-ARIMA and EEMD-R-ANN. For ANN, the AR terms of the ARIMA model would be used as inputs. The complete framework of the proposed method is demonstrated in Figure 1.
2.3. Experimental Study
It is discussed that the main aim of this study is to use the minimum number of IMFs with higher forecasting accuracy. For a minimum number of IMFs use, the reconstruction of IMFs is proposed in this study. Thus, first obtain all IMFs from EEMD and divide them into two components, namely, stochastic and deterministic. However, a procedure or a threshold value is required which decides that the given IMF is stochastic or deterministic. The proposed procedure is the autocorrelation of IMFs from which a decision will be taken that the given IMF has stochastic or deterministic influences. Based on four different threshold values, i.e., of autocorrelation, the obtained IMFs would be divided into two components. The SC consist of all those IMFs whose autocorrelation is less than the threshold value; otherwise, the IMF belongs to the DC. To obtain the best threshold value for the division of IMFs into SD components, synthetic time series would be used which is characterized through the additive noise and organized into four different scenarios as follows:(1)In a time series composed of a sine function and normal distribution, the sine function represents the DC, while the normal distribution represents SC:(2)In a time series composed of a sine function and ARMA model, the sine function represents the DC, while the ARMA model represents the SC with an error of 0.25:(3)In a time series composed of a sine function, ARMA model, and normal distribution, the sine function represents the DC, while the ARMA model and normal distribution represent SC:(4)In a time series composed of a Lorenz system and normal distribution, the Lorenz system represents the DC, while the normal distribution represents SC:
Therefore, to divide the IMFs into SD components, in this study, autocorrelation is used. IMFs have stochastic influences being part of the SC while having deterministic influences being part of the DC. To obtain the original series, add both SD components. The following formula is used to compute the autocorrelation of all decomposed IMFs:
The problem is how to decide that the given IMF has a stochastic or deterministic influence. To fix this problem, an experimental study is performed to obtain a threshold value rather a confidence interval from which we will be able to detach the IMFs regarding their influences. When these IMFs are detached according to their influences, the next step is to obtain SD components by summing up all IMFs, respectively. A correlation test is performed to check whether the detached components are independent. For testing, the following test statistic is used with the alternative hypothesis :where denotes the total number of observations and represents the correlation among the SD components and is defined as follows:
To double-check the distribution of these two components, another assessment is also steered which can test the difference of the two correlations produced by the synthetic data SD components and the reconstructing SD components obtained from IMFs. For testing, the following test statistic is used with an alternative hypothesis that the difference between the two correlations is not equal to zero, i.e., :where , , , “s” represents the synthetic data, and “d” is used for decomposed data. The test is performed for different threshold values of the autocorrelation which divided the IMFs into two components (SD). The best threshold value will be chosen for the real-life datasets and should implement.
The above four different scenarios are used to check how well the proposed method extracted the SD components. The purpose of using different series is to check the generalizability and robustness of the proposed method. One of the advantages of the use of synthetic data is that it provided a controlled scenario and better supports the conclusions.
2.4. Methods Used in This Study
2.4.1. The ARIMA Box–Jenkins Approach
The inclusion of both AR and MA terms makes the Box-Jenkins models exceptionally flexible. The interdependencies in time series () are measured by the AR terms, while the dependencies on preceding error terms are measured by the MA terms [30, 31]. An ARMA model of order (p, q) for a univariate series has the following form:where and q represent the number of AR and MA terms, respectively, represents the constant and AR coefficients, whereas is used for MA terms. The white noise process is denoted by . Using the Box–Jenkins procedure which needed stationary data, this can be achieved by differencing the time series many times (but at most two times usually). The ACF and PACF plots can be used to select the appropriate order of the polynomials and compare them with the theoretical bounds of . Rival ARIMA models can also be selected based on AIC  and BIC . The next is the diagnostic checking that the fitted model is adequate by using the LB test . After obtaining the appropriate ARIMA model, the last step is the forecasting of the series.
2.4.2. Artificial Neural Network
ANNs have many distinguishing characteristics over the ARIMA models and very successful alternative to linear models to handle the nonlinearity factors when forecasting the time series. One of the best characteristics of an ANN is the universal approximation of any nonlinear function up to the best degree of accuracy of your desired [36, 37]. A most commonly used ANN is the FNN with a single hidden layer having one output node for forecasting the time series applications [38, 39]. An ANN model has the following form:where , denotes the weights, represent the bias terms, is used for the white noise, represents the input patterns, represents the single output of the model, is used for the number of input variables, is used for the number of hidden nodes, and is used for the transfer function of the hidden layer. Equation (11) performs nonlinear mapping from past observations , for the future value . i.e.,where is a function determined by the neural network training and is a vector representing all parameters. Therefore, from some viewpoint, the FNN model looks like or equivalent to an NAR model [40, 41]. The ANN effectively modelled the nonlinear time series, but for linear problems, it provided mixed results .
The new opportunities are provided in  to extract the time opposing elements from data after the advancement of transient local and versatile strategy for EMD. The transient region is a standout amongst the most critical qualities of EMD, which is procured from the spline fitting through maxima/minima of inputted data. The spline fitting has a high transient area and can be easily affirmed using numerical programming. The transient domain of EMD is all around spared if the quantity of filtering is smaller and fixed. The filtering process of EMD is worse with the excessive use of both transient region and oscillatory components. One of the benefits of the EMD procedure is that it automatically bypasses the stationarity assumption of the data. For the physical interpretability of the results, EMD territory gives a significant condition but not adequate. An EMD approach involves continuing with the levels of removing oscillatory sections of lesser replication, and the noise-wrapped results at one level can provoke continuing with the twisting of subsequent oscillatory components, making EMD happening hardly physically interpretable. Hence, due to the absence of this robustness problem, the EMD procedure becomes inadequate.
Several attempts have been made to overcome the problem of robustness. For example, Huang et al.  proposed the discontinuity test in which variation is controlled by one element. Due to the controlled scenarios, it reduced the adaptiveness properties of EMD, demanding for more fascinating methodologies. To fulfill this robustness problem, a noise-assisted data analysis procedure, namely, EEMD was introduced in  which exactly solved the highlighted issue raised in EMD. The following are the steps involved in EEMD:(1)A series of unique white noise summed with true series (2)In the next step, the new series obtained in (9) can be decomposed accordingly into IMFs(3)Repeat (9) and (10) by summing the different white noise series to the true series (4)As the outcome attained the ensemble means of the respective IMFs
The EEMD utilized the properties of white noise in the above steps and are as follows: the minima and maxima distribution should be uniform transiently and EMD being adequately a dual-channel bank for white-noise [44–46]. The minima and maxima distribution on all-time spans are relatively uniform with the added white noise. Another property of EEMD is the control on oscillatory elements after adding the white noise to the series which works like a dual filter bank which significantly reduces the mode mixing problem. Thus, as an outcome, the decomposition turns out to be steadier and physically significant.
2.4.4. EEMD-ARIMA and ANN Model
Wang et al.  summarized the hybrid model of EEMD-ARIMA by incorporating the steps as follows:(1)Firstly, the true series be decomposed into different IMFs(2)Secondly, for every IMF the best ARIMA model will be selected and forecast accordingly(3)The last step is to obtain the targeted series summed all the IMFs output
The EEMD-ANN model should be fitted in the same way above. Furthermore, for every ANN model, the delaying parameter or starting input parameter be the AR terms of every ARIMA model of IMF.
2.5. Forecast Evaluation Criterions
Forecasting accuracy is used to measure the degree of exactness among the proposed and state-of-the-art methods. When there are competing models, the forecasting accuracy measures are the most important criteria. In this study, four-typical forecasting accuracy measures are used consisting of MAE, RMSE, MAPE, and DS. Typically, these methods are defined as follows:(1)MAE(2)RMSE(3)MAPE where F represents the total number of forecasts of the testing set, the original value of the given time series at time t is denoted by and forecasted value is denoted by . Four different datasets are used in this study for validation and justification purposes. The next is the directional prediction measure which is more appropriate for business purposes since the investors are concerned with the market trend. Thus, DS is used for competing models to measure the directional forecasting accuracy and defined as follows.(4)DS where , if , else , and corresponds to the number of total forecasts of testing set. Apparently, the lower the values of MAPE, RMSE, and MAE, the higher the forecast accuracy in contrast with higher the values of DS produced more accurate forecasts. Next, to compare the two models prediction errors, we employed the DM test which is defined as follows.(5)DM TEST
It is used to test the statistical significance of the forecast accuracy of two forecast methods. Variable is the subtraction of absolute errors of the two methods. is the mean of . is the autocovariance at lag k. DM∼N(0, 1), if the value > α, we conclude there is no significant difference between the two forecasts.
3.1. ARIMA Model
Data stationarity is one of the assumptions which is required when using the ARMA model. In most of the cases, the time series for real data is not stationary because of the trend factor, so the ARIMA model applicability comes out. To get the stationary time series, we take the successive differences of the original series [48, 49]. ADF test is used to check the stationarity of a time series . Using the ARIMA model, first identify the appropriate model by selecting the order of AR and MA terms after obtaining the stationary time series. ACF and PACF plots are used for the selection of best orders of AR and MA terms. Other approaches are also available for obtaining the orders of AR and MA terms called theoretical approaches consisting of AIC and BIC. For model adequacy checking, the LB test is used. In the last step, we forecast the appropriate time series. In this section, the ARIMA model is fitted for all training periods and forecast the testing periods. Table 1 represents the order of best ARIMA models for all datasets.
The fitted ARIMA models are found in Table 2, where the estimated values of their respective parameters are presented with the standard error in parenthesis.
The next step after estimating the parameters is the extraction of the fitted model residuals and their testing to check whether the residuals are uncorrelated. The model diagnostic tests are also performed for all four datasets that the fitted models are adequate and can be used for forecasting.
From Figure 2, it is observed that the LB test reveals that all the fitted models are adequate and could be used for forecasting. Figure 2 shows on the third graph of LB statistic that all values for all datasets are greater than 0.05, which means that the hypothesis of no serial correlation among fitted model residuals is not significant. So, all these models are adequate and provide the best forecast for the future. The last is the forecasting accuracy of these models which are presented in Tables 3 and 4 for daily and weekly datasets, respectively.
3.2. ANN Model
The input structure of ANN is subject to the order of AR terms of the best chosen ARIMA model. To fix the input parameters of ANN, the order of AR terms is used provided in Table 1. After fulfilling the following requirements, the ANN structure is constructed:(1)Feedforward backpropagation used as a network type.(2)ANN inputs for WTI weekly data are and the same for Brent are . These inputs are automatically created by MATLAB ntstool once the training and target time series are imported to ANN structure.(3)Levenberg–Marquardt training algorithms are used as a training function in this study.(4)The weights are determined randomly depending on the strategies of the ANN toolboxes.(5)The maximum neurons in the hidden layer are equal to , where is the number of inputs . The maximum number of hidden neurons for WTI weekly data is 11 and for Brent is 5.(6)The nonlinearity of crude oil prices is obligated, so to filter the nonlinearity in the hidden layer, a nonlinear transfer function is chosen such as tan-sigmoid or log-sigmoid. In this study, the NAR problem is selected from the ntstool.
The complete structure of ANN is presented in Figure 3 for both daily and weekly datasets. After completing the ANN construction, the next step is the training and testing processes to obtain the fitted and forecasting series. Training and testing processes only need to import the input and target variables for training and testing periods, respectively.
3.3. Data Decomposition
For using EEMD, the two parameters should be fixed in advance which are the white noise amplitude and the number of ensembles because it affects the decomposition process, and Wu and Huang  described the rule for these parameters in the form of equation, i.e., , where is the ensemble number, representing the error of standard deviation, and the amplitude of added white noise represented by . One important thing should be kept in mind when added white noise to the original time series is that its amplitude should not be too large or small because if large, the EEMD will return some redundant IMFs; on the contrary, if small, the EEMD IMFs remain the same as EMD. The authors [43, 52] suggested that the amplitude of the added white noise is equal to 20% of the standard deviation. About the white noise amplitude and size of the ensemble number, more details can be found in the study of . In this study the EEMD technique is used. The two parameters of EEMD, the size of the ensemble number, and the white noise amplitude are equal to 100 and 0.2 times the standard deviation, respectively.
EEMD technique is employed to all datasets using the above description. In this study, the r package “Rlibeemd”  is used for data decomposition. All the decomposed IMFs are presented in Figure 4 for all four datasets.
The total number of IMFs including residual for daily Brent and WTI is 13, while for weekly, Brent and WTI are 11 and 10, respectively. From Figures 4(a)–4(d), IMF.1, IMF.2, IMF.3, and IMF.4 have no specific trend, and almost the observations distributed randomly. But after the IMF.4, the IMFs follow a trend and move slowly around the long-term mean, whereas the last IMF called residual represents the overall trend of the series. The overall decomposition provided by EEMD is physically meaningful, all IMFs are independent, and each IMF is consistent with the previous IMF . Therefore, the decomposition obtained from EEMD is very helpful because it transforms the nonstationary and nonlinear series into stationary series which improve the forecasting accuracy of the models.
3.4. Simulations for Threshold Value Determination
This study aims to determine a threshold value for the division of IMFs into SD components. After obtaining the decomposed IMFs for all synthetic datasets, the autocorrelation is computed for every IMF. From autocorrelation, it is easy to determine that the current observation depends how much on its past observations. Thus, closing value to zero means there is no dependency on the current value on its past values, whereas closing value to one means the current value strongly influenced by its past values. The problem is how to fix a point that one can say that, below this value, the considered IMF is stochastic. To solve this issue different values of autocorrelation are considered for dividing the IMFs into two components. First, the IMFs are separated into two different components through autocorrelation using the selected four values, i.e., . The SC consists of all IMFs whose autocorrelation is less than the given threshold value otherwise included in the DC. Initially different values of the autocorrelation were selected and tested, i.e., for the reconstruction of IMFs, but after the computation, the values almost provide the same results; thus, for avoiding some redundant analysis, only is chosen from the proceeding group. Therefore, in this study, the following four values of autocorrelation are used for reconstruction of IMFs, i.e., . Thus, in this section, only the synthetic data are used because it provided a controlled scenario and better support the conclusion. The number of IMFs being part of the SD components is presented in Table 5 for all scenarios.
The next step after dividing the IMFs into two components using the above threshold values is the testing of the obtained two components that they are independently distributed. The T-test is used to test the hypothesis of zero correlation among the SD components. The best threshold value is selected for which the null hypothesis is accepted for all scenarios and all sample sizes. Table 6 presented the values that both SD components are independently distributed.
To double-check the threshold values another test is also performed which is the difference of correlation test. This test will confirm that the correlation among the original components and the components obtained from the reconstruction of IMFs is equal to zero. So, only those threshold values of autocorrelation are selected, for which the null hypothesis of no difference is not rejected for all scenarios and all samples.
From Tables 6 and 7, it is observed that, from scenarios I and IV, the recommended threshold values for dividing the IMFs into two components are , and from scenario III, the recommended values are , whereas from scenario II, the recommended value is . So, it is observed from all scenarios that the only one value which significantly divided the IMFs into two components is . Thus, the recommended value of autocorrelation for dividing the IMFs into two components is . The summary of the threshold value determination is presented in Table 8.
The only recommended value at both levels of significance is . Thus, for division of IMFs into SD components, the threshold value of is recommended and could be used. In Section 3.5, the real-world applications are tested by utilizing the recommended threshold value of .
3.5. IMFs Reconstruction
In this study, the reconstruction of IMFs is proposed based on the autocorrelation of every kth IMF obtained from EEMD. Each IMF autocorrelation is compared with a threshold value of . An IMF is considered stochastic if its autocorrelation is less than the threshold value; otherwise, it is considered as deterministic. For comparison, the IMFs are also reconstructed through AMI. The AMI is simply the visual inspection of the plots which are obtained for all IMFs of each dataset. The detailed analysis of every dataset is followed in the subsequent sections.
From Section 3.3, it is detected that the total number of IMFs for daily datasets is 12 plus residue. For simplicity, the residue term is replaced with the IMF, so the total number of IMFs for both datasets is 13. For weekly data, the total number of IMFs is 11 for Brent, while for WTI, it is 10. Autocorrelation of all IMFs is computed and presented in Table 9.
It is observed from Table 9 that the autocorrelation of the IMF.1, IMF.2, and IMF.3 for all datasets is less than . So, the first three IMFs are considered as stochastic in all datasets, while the rest is considered as deterministic. So, to make a SC, all stochastic IMFs are summed, and for DC, all deterministic IMFs are summed. Next, to obtain the SD components from the AMI method for Brent daily data, their plots are presented in Figure 5.
From Figure 5, it is observed that the AMI plots have the same pattern after the 6th IMF, so the first six IMFs are considered as stochastic, and the remainder, as deterministic. To make two components, add the first six IMFs for SC and from seven to thirteen for DC. Next, to obtain the SD components of WTI daily data, the plots of AMI for all IMFs are presented in Figure 6.
From Figure 6, it is observed that the AMI plots have the same pattern after the 6th IMF, so the first six IMFs are considered as stochastic, and the remainder, as deterministic. To make two components, add the first six IMFs for SC and from seven to thirteen for DC. Next, to obtain the SD components of Brent weekly data, the plots of AMI for all IMFs are presented in Figure 7.
From Figure 7, it is observed that the AMI plots have the same pattern after the 6th IMF, so the first six IMFs are considered as stochastic, and the remainder, as deterministic. To make two components, add the first six IMFs for SC and from seven to eleven for DC. Next, to obtain the SD components of WTI weekly data, the plots of AMI for all IMFs are presented in Figure 8.
From Figure 8, it is observed that the AMI plots have the same pattern after the 6th IMF, so the first six IMFs are considered as stochastic, and the remainder, as deterministic. To make two components, add the first six IMFs for SC and from seven to ten for DC. The next is the correlation test that the SD components are independently obtained from the reconstruction of IMFs.
From Table 10, it is observed that, for all reconstructed components, the hypothesis of zero correlation is not rejected, meaning that the division of SD components is independent at a 5% level of significance. Moreover, there is a theorem, put forward by Takens , known as the embedding theorem, stating that a time series can be reconstructed in vectors with m values (described as the embedding dimension). Each of these values corresponds to an observation that is spaced out in intervals, following a time delay (or separation dimension) called τ. As numerous studies have shown, these vectors represent the interdependent relationships between observations, and since they increase the accuracy of the modelling, they also enhance the accuracy of the prediction . The plots of the AMI of the SD components are presented in Figure 9, showing the interdependency among the SD components.
From Figures 9(a)–9(d), it is observed that the division of IMFs into SD components is independently distributed. After obtaining the two components, the next step is the appropriate model selection for every component using the same steps discussed above. Every model is selected based on the LB test for which the values are greater than 0.05 for both daily and weekly datasets, all IMFs, and components so that all the fitted models are adequate and could be used for forecasting. In the next step, all the selected models are used for forecasting, and the last 20% data of every series were retained as the testing set forecasted. For forecasting of ARIMA models, r software is used. For ANN, the data are transferred to MATLAB, and ntstool is used for forecasting every component. After forecasting the series, the accuracy measures are calculated and presented in Tables 3 and 4 for both daily and weekly series.
The performance of the proposed model EEMD-R-ANN is evaluated using four quantitative measures including RMSE, MAPE, MAE, and DS. The results of all these methods for all models are obtained and presented in Table 3 for both markets of daily COPs. The benchmark models include the well-known single ARIMA and ANN models, the hybrid models including EEMD-ARIMA and EEMD-ANN, the hybrid models using only the SD components obtained from the proposed method of reconstruction including EEMD-SD-ANN and EEMD-SD-ARIMA, models using AMI for reconstruction of IMFs including EEMD-AMI-ARIMA, and EEMD-AMI-ANN, and the last is the proposed method of R of IMFs based on autocorrelation which are denoted by EEMD-R-ARIMA and EEMD-R-ANN.
The RMSE is computed for all models included in this study and shown in Tables 3 and 4 for both markets and all datasets of COPs and plotted in Figures 10 and 11 for a clearer picture. In all models, EEMD-R-ANN and EEMD-R-ARIMA almost achieved the same and the best (lowermost) values for both the COPs series. The reconstructed hybrid models EEMD-AMI-ARIMA and EEMD-AMI-ANN achieved nearly the similar values but high; EEMD-SD-ANN, EEMD-SD-ARIMA, and the hybrid models EEMD-ANN and EEMD-ARIMA achieved the lowermost values but higher than the EEMD-R-ANN. The single models ARIMA and ANN achieved the worst (highest) values among all considered models. Therefore, from the RMSE perspective, the proposed model EEMD-R-ANN confirmed their usefulness.
As another level performance of forecasting, the MAE values of all the models on WTI and Brent (daily and weekly) COPs are shown in Tables 3 and 4 and plotted in Figures 10 and 11 for clear picture. It can be seen that EEMD-R-ANN outperformed all other models, and their interpretations are not different from RMSE for both markets and all datasets. Hence, from the MAE viewpoint, the proposed model EEMD-R-ANN confirmed their usefulness by attaining the lowest values.
The MAPE is another level performance of forecasting. The MAPE values of all models on both markets WTI and Brent (daily and weekly) are presented in Tables 3 and 4 and plotted in Figures 10 and 11 for clear picture. From Tables 3 and 4, the model EEMD-R-ANN attained the lowermost values for both markets and all datasets (daily and weekly). The three models EEMD-ANN, EEMD-ARIMA, and EEMD-R-ARIMA attained nearly similar values, whereas the other two models EEMD-SD-ANN and EEMD-SD-ARIMA attained nearly the similar values but higher than the preceding three models. The model’s ANN, ARIMA, EEMD-AMI-ANN, and EEMD-AMI-ARIMA achieved nearly similar values but highest and ranked high among all models. Therefore, from the MAPE perspective, the proposed model EEMD-R-ANN confirmed their usefulness by attaining the lowermost values.
The model EEMD-R-ANN produced the lowest values of MAPE for Brent and WTI daily COPs which are 0.542% and 0.572%, respectively, and attained the category of highly accurate forecasts . The values of the MAPE for Brent and WTI daily COPs of three models which are EEMD-R-ARIMA, EEMD-ANN, and EEMD-ARIMA in the range of 0.559% through 0.697% and achieved the same category as attained by the model EEMD-R-ANN but with high values. The values of the MAPE for models EEMD-SD-ARIMA, EEMD-SD-ANN, EEMD-AMI-ARIMA, EEMD-AMI-ANN, ARIMA, and ANN were in the range of 1.018% to 1.489% of both COPs and fall under the category of good accurate forecasts. Thus, from the MAPE perspective, the best model for obtaining highly accurate forecasts is EEMD-R-ANN for daily COPs of both markets, i.e., Brent and WTI.
For weekly data, the model EEMD-R-ANN produced the lowest values of MAPE for Brent and WTI COPs, which are 0.933% and 0.936%, respectively, and attained the class of highly accurate forecasts . The values of the MAPE for Brent and WTI weekly COPs of the model EEMD-R-ARIMA were 0.936% and 0.952% and achieved the same category as attained by the EEMD-R-ANN model but with high values. The values of the MAPE for models ANN, ARIMA, EEMD-SD-ARIMA, EEMD-SD-ANN, EEMD-AMI-ARIMA, EEMD-AMI-ANN, EEMD-ARIMA, and EEMD-ANN were in the range of 1.018% to 1.489% of both COPs and fall under the class of good accurate forecasts. Thus, from the MAPE perspective, the best model for obtaining highly accurate forecasts is EEMD-R-ANN for weekly COPs of both markets, i.e., Brent and WTI.
The next forecast accuracy measure is the directional forecasts which are much more important for policymakers and investors since they look to the market trend either the COPs are going up or down. The DS values are presented in Tables 3 and 4 for daily and weekly data, respectively, and plotted in Figure 12 for both markets. The DS values of the ANN and ARIMA models of both daily and weekly COPs for both markets are in the range of 51.29%–55.55% and almost equal to the random guess. Hence, it is very challenging to achieve the directional forecasts at a reasonable level for single models, and the reason may be the complex and compound nature of COPs. The next is the decomposition and ensemble model that used all IMFs obtained from EEMD of both COPs daily and weekly which are in the range of 80.17%–83.38%. Thus, decomposition and ensemble models achieved a higher forecasting accuracy and having a good capability of producing more accurate directional forecasts in comparison with single models. But the reason with these models is that they are too time-consuming. To cope with this issue, the idea of reconstruction of IMFs is introduced. So, the next is the directional forecasting performance of the models which used the reconstructed IMFs. The forecasting performance of the models which used the reconstructed IMFs obtained from AMI is in the range of 55.81%–62.47%. These results are better than the single models but not far from the random guess. The next is the forecasting performance of the models which used the proposed method of reconstruction of IMFs, and their DS values are in the range of 75.01%–78.39% and showed some satisfactory directional forecasts. In the last, the directional forecasts of the proposed models EEMD-R-ARIMA and EEMD-R-ANN are in the range of 88.25%–94.46% for both COPs as well as for both markets which are the highest among all models. Thus, the proposed reconstruction of the IMFs method attained the highest values for directional forecasts and outperformed all other benchmark models including ARIMA, ANN, EEMD-ARIMA, EEMD-ANN, EEMD-AMI-ARIMA, EEMD-AMI-ANN, EEMD-SD-ARIMA, and EEMD-SD-ANN. The model EEMD-R-ANN relatively performed better than the EEMD-R-ARIMA, so the highly recommended model regarding the DS values is EEMD-R-ANN for both COPs and as well as for both markets.
Thus, the proposed method of IMFs reconstruction through autocorrelation significantly improved the performance of the well-known ARIMA and ANN models. Thus, from this study, the suggested model is EEMD-R-ANN which produced the highest values for DS and lowermost values for RMSE, MAPE, and MAE for both Brent and WTI (daily and weekly) datasets.
To check the superiority of the proposed model EEMD-R-ANN, the DM test is also conducted. The test results of DM statistics with their values are shown in Table 11 for both Brent and WTI COPs. The output of the DM test statistically validated the conclusion drawn from the above forecasting accuracy measures. The proposed model EEMD-R-ANN statistically outperformed ARIMA, ANN, EEMD-ANN, EEMD-ARIMA, EEMD-AMI-ANN, EEMD-AMI-ARIMA, EEMD-SD-ANN, and EEMD-SD-ARIMA models with their respective values which were far below than 0.01 for both daily and weekly data as well as for both Brent and WTI COPs. However, the DM test statistic values were not statistically significant for EEMD-R-ANN and EEMD-R-ARIMA models for both markets and all datasets which confirm the usefulness of the proposed reconstruction strategy. The above analysis and discussions are summarized in Section 4.1.
4.1. Summary of the Work
From the above discussions and analysis, some important findings are summarized as follows:(1)To accurately forecast the COPs, it is hard to attain satisfactory results when using single models (i.e., ARIMA and ANN) because of the nonstationary and nonlinearity structure of the data.(2)The models which used all IMFs relatively performed well than the models which used the reconstructed IMFs into two components, namely, stochastic and deterministic.(3)The recommended threshold value of the autocorrelation based on synthetic data was used for the reconstruction of IMFs on the four real datasets. The analysis has proved that the reconstruction of IMFs through autocorrelation is a better and simple strategy that enhanced the performance of all models.(4)With the benefits of EEMD, reconstruction, ARIMA, and ANN, the proposed ensemble models EEMD-R-ANN and EEMD-R-ARIMA significantly outperform all other models listed in this paper in terms of MAE, RMSE, MAPE, DS, and DM test.(5)The performance of the proposed models EEMD-R-ANN and EEMD-R-ARIMA is statistically insignificant in terms of the DM test. However, in terms of MAE, RMSE, MAPE, and DS, the model EEMD-R-ANN relatively performed well compared to EEMD-R-ARIMA. Thus, the suggested model for forecasting COPs is EEMD-R-ANN.(6)The experimental results demonstrated that the reconstruction of IMFs into SD components using the proposed method was effective but modelling the IMFs being part of the stochastic component separately was more effective and powerful approach for forecasting COPs.(7)As one of the most important commodities in the world, we analyze the COPs for illustration and verification of the proposed method. The empirical results showed that our proposed reconstruction of decomposition and ensemble model-based analysis approach is vital and effective and could be tested for more complex tasks.
This paper proposed a new approach for the reconstruction of IMFs of decomposition and ensemble model according to their influences, i.e., stochastic and deterministic. We considered four different synthetic series to fix a value of autocorrelation for reconstruction of IMFs with minimum human interference. All these analysis confirm that the new approach supports the fast selection of reconstruction of IMFs and helps researchers to make predictions in situations where there are time constraints. In that sense, we believe that our approach can improve the most complex stage of reconstruction of IMFs according to their influences. Hence, the reconstruction of IMFs based on autocorrelation is one of the best methods to reconstruct the IMFs according to their influences (Section 3.5). Thus, from this study, the threshold value of of autocorrelation is recommended for separating the IMFs into SD components. To evaluate the performance of the proposed approach, four real-world COPs series were considered. The experimental findings demonstrated that all methodologies including two single benchmark models and eight ensemble models were effective. However, the forecasting accuracy measures in terms of MAE, RMSE, MAPE, and DS highlighted that the models EEMD-R-ANN and EEMD-R-ARIMA based on reconstruction of IMFs through autocorrelation were the most efficient methods for forecasting COPs. Moreover, the performance of the model EEMD-R-ANN is slightly better than the EEMD-R-ARIMA model. Thus, finally, the recommended model for forecasting the world COPs is EEMD-R-ANN. Hence, the reconstruction of IMFs based on autocorrelation is the best method to reconstruct the IMFs according to their influences.
The advantages of the new proposed reconstruction approach were the results of handling the stochastics IMFs separately in which the linear and nonlinear parts of the ARIMA and ANN models were combined to handle the stochastic uncertainty. Furthermore, the proposed method of reconstruction of IMFs has also overcome the problem of supervising the method in which the reconstruction is carried out through visual inspection using the AMI plots, while in the proposed method, the reconstruction of IMFs is done through the autocorrelation of each IMF from which every researcher can obtain the same number of reconstructed IMFs using the same data with high forecasting accuracy and less computational time.
In future, the work could be extended in two aspects: (1) studying the stochastic IMFs in more detail to improve the performance on forecasting COPs and (2) applying the EEMD-R-ANN to forecast other time series of energy, such as wind speed and electricity price.
|AIC:||Akaike’s information criterion|
|AMI:||Average mutual information|
|ANN:||Artificial neural network|
|ARIMA:||Autoregressive integrated moving average|
|BIC:||Bayesian information criterion|
|BPE:||Bilateral permutation entropy|
|BPNN:||Backpropagation neural network|
|CEEMDAN:||Complete ensemble empirical mode decomposition with adaptive noise|
|COPs:||Crude oil prices|
|ECM:||Error correction model|
|EELM:||Extended extreme learning machine|
|EEMD:||Ensemble empirical mode decomposition|
|IEA:||International Energy Agency|
|EMD:||Empirical mode decomposition|
|FNN:||Feedforward neural network|
|GARCH:||Generalized autoregressive conditional heteroscedasticity|
|ICA:||Independent component analysis|
|IMFs:||Intrinsic mode functions|
|LSSVR:||Least-square support vector regression|
|LSTM:||Long short-term memory|
|MAE:||Mean absolute error|
|MAPE:||Mean absolute percentage error|
|RMSE:||Root mean square error|
|MLGRU:||Multilayer gated recurrent unit|
|PACF:||Partial autocorrelation function|
|PEEMD:||Partly ensemble empirical mode decomposition|
|PSO:||Particle swarm optimization|
|SD:||Stochastic and deterministic|
|SVR:||Support vector regression|
|WTI:||West Texas Intermediate.|
The data used to support the findings of this study are included within the supplementary information files.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
The COPs data used in this article consist of WTI and Brent (daily and weekly). (Supplementary Materials)
Y. Xiang and X. H. Zhuang, “Application of ARIMA model in short-term prediction of international crude oil price,” Advanced Materials Research, vol. 798, pp. 979–982, 2013.View at: Google Scholar
S. Mirmirani and H. Cheng Li, “A comparison of VAR and neural networks with genetic algorithm in forecasting price of oil,” in Applications of Artificial Intelligence in Finance and Economics, pp. 203–223, Emerald Group Publishing Limited, Bingley, UK, 2004.View at: Google Scholar
A. Khashman and N. I. Nwulu, “Intelligent prediction of crude oil price using support vector machines,” in Proceedings of the IEEE 9th International Symposium on Applied Machine Intelligence and Informatics (SAMI), 2011, pp. 165–169, Smolenice, Slovakia, January 2011.View at: Publisher Site | Google Scholar
X. Li, L. Yu, L. Tang, and W. Dai, “Coupling firefly algorithm and least squares support vector regression for crude oil price forecasting,” in Proceedings of the 2013 Sixth International Conference on Business Intelligence and Financial Engineering (BIFE), pp. 80–83, Hangzhou, China, November 2013.View at: Publisher Site | Google Scholar
S. Y. Wang, L. A. Yu, and K. K. Lai, “Crude oil price forecasting with TEI@I methodology,” Journal of Systematic Science Complexity, vol. 18, pp. 145–166, 2005.View at: Google Scholar
M. Aamir, A. Shabri, and M. Ishaq, “Improving forecasting accuracy of crude oil prices using decomposition ensemble model with reconstruction of IMFs based on ARIMA model,” Malaysian Journal of Fundamental and Applied Sciences, vol. 14, no. 4, pp. 471–483, 2018.View at: Publisher Site | Google Scholar
D. C. Montgomery, C. L. Jennings, and M. Kulahci, Introduction to Time Series Analysis and Forecasting, John Wiley & Sons, Hoboken, NJ, USA, 2015.
M. Aamir and A. B. Shabri, “Modelling and forecasting monthly crude oil prices of Pakistan: a comparative study of ARIMA, GARCH and ARIMA-GARCH models,” Science International, vol. 27, no. 3, pp. 2365–2371, 2015.View at: Google Scholar
G. E. Box, G. M. Jenkins, and G. C. Reinsel, Time Series Analysis: Forecasting and Control, vol. 734, John Wiley & Sons, Hoboken, NJ, USA, 2011.
L. Yu, S. Wang, and K. K. Lai, Foreign-Exchange-Rate Forecasting with Artificial Neural Networks, vol. 107, Springer Science & Business Media, Berlin, Germany, 2010.
Z. Wu and N. E. Huang, “A study of the characteristics of white noise using the empirical mode decomposition method,” Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, vol. 460, no. 2046, pp. 1597–1611, 2004.View at: Publisher Site | Google Scholar
G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung, Time Series Analysis: Forecasting and Control, John Wiley & Sons, Hoboken, NJ, USA, 2015.
M. Aamir and A. Shabri, “Modelling and forecasting monthly crude oil price of Pakistan: a comparative study of ARIMA, GARCH and ARIMA Kalman model,” in Proceedings of the 23rd Malaysian National Symposium of Mathematical Sciences (SKSM23), vol. 1750, Johor Bahru, Malaysia, November 2016, Advances in Industrial and Applied Mathematics.View at: Publisher Site | Google Scholar
S. Debert, M. Pachebat, V. Valeau, and Y. Gervais, “Ensemble-empirical-mode-decomposition method for instantaneous spatial-multi-scale decomposition of wall-pressure fluctuations under a turbulent flow,” Experiments in Fluids, vol. 50, no. 2, pp. 339–350, 2011.View at: Publisher Site | Google Scholar
F. Takens, “Detecting strange attractors in turbulence,” in Dynamical Systems and Turbulence, Warwick 1980, pp. 366–381, Springer, Berlin, Germany, 1981.View at: Google Scholar