Abstract

The integration of the global economy has led to an increasingly strong connection between the futures and spot markets of commodities. First, based on one-minute high-frequency prices, this paper applies the thermal optimal path (TOP) method to examine the lead-lag relationship between Chinese crude oil futures and spot from March 2018 to December 2021. Second, we apply the Mixed Frequency Data Sampling Regression (MIDAS) model and indicators such as deviation degree to test the degree of prediction of high-frequency prices in the futures market to the spot market. The experimental results show that the futures markets lead the spot market most of the time, but the lead effect reverses when major events occur; 60-minute futures high-frequency prices are the most predictive of daily spot data; crude oil futures’ predictive power declined after the Covid-19 outbreak and is more predictive when night trading is available. This study has important implications, not only to guide investors but also to provide empirical evidence and valid information for policy makers.

1. Introduction

Historically, financial experts and investors have devoted considerable attention to the relationships between international commodity markets and whether they exhibit similar price characteristics and converge over time or indeed are fully integrated. Concepts such as lead-lag relationships and comovement are often used to study efficiency in financial markets. The leading-lag relationship between futures and spot markets for the same commodity is always reflected by the transaction price, thus presenting the respective information efficiency. In the commodity market, crude oil futures have not only commodity attributes, but also financial attributes. As an important strategic resource, not only is crude oil related to energy security and people’s livelihood development, but it also is an important hedging tool in the crude oil market, which in turn affects crude oil pricing. Therefore, it is of great practical value to study the leading-lag relationship between crude oil futures and crude oil spot prices and to forecast them. On the one hand, if investors can accurately predict the trend of future liquidity in the market, then they can adjust their investment strategies and improve their investment returns in a timely manner; on the other hand, regulators can focus on monitoring possible future anomalies and take relevant measures to avoid abnormal market fluctuations and promote the healthy development of the market. China’s crude oil futures have been traded since March 2018 and have now become one of the four major crude oil futures trading varieties in the world and have also played an important role in the international crude oil market. Then, what kind of leading-lag relationship exists between China’s crude oil futures and China’s crude oil spot and whether the movement of its price information can predict the price change of China’s crude oil spot are the focus of our research concern. In this paper, we will use TOP and MIDAS to investigate the lead-lag relationship between the two markets. In addition, the outbreak of the Covid-19 epidemic suspended trading of Chinese crude oil futures for three months, giving us an additional window to observe the trend of crude oil spot prices during this period.

The rest is arranged as follows: Section 2 is literature review, Section 3 introduces the data sources and methods, Section 4 gives empirical results, and Section 5 concludes the paper.

2. Literature Review

Since this paper explores the lead-lag and forecasting relationship between Chinese crude oil futures and spot using both the TOP in Econophysics and MIDAS, we provide an overview of the literature in terms of both research subjects and research methods to paint a profile of the current research on Chinese crude oil futures.

2.1. Relationship between Futures and Spot Market

The relationship between futures and spot prices has been studied for a long time. At present, there are many related literature pieces, mainly focusing on the leading-lag relationship and price discovery. The research on the guiding relationship of spot prices was initially empirically tested and analyzed by Garbade and Silver, and a dynamic analysis model showing the relationship was established for the spot prices [1]. At present, the relationship between futures and spot research covers almost all financial derivatives markets, including index futures [2, 3] and energy futures [4]. Academia has also conducted a lot of research on the international mainstream oil futures and explored its impact on the stock market [5, 6], industrial chain [7, 8], and national economic conditions [9, 10]. With the diversification of financial instruments, oil futures are widely used in the risk hedging mechanism to control the risks of spot oil transactions. Once the oil futures market price changes, the price fluctuations will directly spread to the spot market.

Prior to 2018, most studies in this area focused on the spillover effects between the Chinese crude oil spot market and the international crude oil futures market, finding that global financial markets such as crude oil show a trend of global integration [11] and that there are also significant spillover effects between Chinese crude oil spot and WTI crude oil futures [1214]. A study of the relationship between fuel oil futures and other energy financial derivatives also finds that the correlation between Chinese fuel oil spot, fuel oil futures, and energy equity markets is weaker than that of the US market, and the strength of the correlation has weakened after the financial crisis [15]. With the listing of Chinese crude oil futures after 2018, academics began to focus on its relationship with global benchmark crude oil futures, trying to analyze the impact it brings to the global energy finance market from several perspectives.

In terms of research on the relationship between Chinese crude oil futures and spot, using a complex network model, Ji et al. found a higher degree of integration with the global market during the nighttime trading hours but a regional fractional effect during the daytime hours [16]. By analyzing the correlation between Chinese crude oil futures and spot, Jie et al. found that Chinese crude oil futures achieved their main functions and establishment goals [17]. Most of these studies analyze the correlation between Chinese crude oil futures and international crude oil futures. Yang et al. examined the transmission characteristics of returns and volatility between Chinese crude oil futures and international crude oil futures based on intraday data and found that there is a cointegration relationship between crude oil futures [18]. Huang et al. used wavelet analysis to investigate the return linkage characteristics between Chinese crude oil futures and the two major international benchmark crude oil futures (WTI and Brent). The results show that the linkage between Chinese crude oil futures and international crude oil futures is relatively weak and directionally variable, and the linkage with international crude oil prices is very different, and the price fluctuations of Chinese crude oil futures lag behind those of international crude oil futures [19]. Palao et al. found that Chinese crude oil futures do not have influence on WTI and Brent crude oil futures prices, but they are sensitive to Brent crude oil futures price fluctuations [20]. Similarly, Yang et al. constructed a VaR correlation network between Chinese crude oil futures and international crude oil futures for both upside and downside risks [21].

In 2020, with the outbreak of COVID-19, the changes regarding the crude oil futures market during the epidemic have also attracted the interest of many scholars. Most scholars have uncovered the linkages and spillovers between world benchmark crude oil futures and other financial derivative markets during the epidemic, including gold, stock markets, cryptocurrencies, and crude oil spot [2225]. Jefferson analyzed the impact of the epidemic on the oil industry in different countries and found that, despite a partial recovery, many oil industries will still suffer [26]. Wang investigated the impact of multifractal cross-correlation between crude oil and agricultural futures under the epidemic and found that multifractal correlation between crude oil and agricultural futures increases after the epidemic [27]. Wang et al. found that the COVID-19 disrupted the oil supply and demand balance, reducing US oil by 18.14% [28]. Peng et al. found significant risk spillovers from the stock market to the oil market during the COVID-19 pandemic; the oil market was subject to high-risk spillovers from the second board market; and bidirectional risk spillovers between the Chinese stock market and the oil market increased rapidly [29]. Regarding Chinese crude oil futures, Lin and Su found that the driving effect of macro factors in China on future prices has oscillated after the Covid-19 outbreak [30]. In addition, the outbreak exacerbated the spillover effects of the commodity markets to the US and Chinese stock markets [31]. At the same time, the coflow of Chinese crude oil futures with international benchmark crude oil futures weakened, thus providing investors with a way to hedge their risks [16].

2.2. TOP and MIDAS
2.2.1. Thermal Optimal Path (TOP)

Sornette and Zhou use the GDP growth rate, the unemployment rate, and the inflation rate in the United States and proposed TOP method to verify their time-varying linkage relationship and further explain that TOP has advantages over cross-correlation analysis [32, 33]. Compared with traditional static research methods, TOP can accurately describe the time-varying leading and lagging relationship between two time series. This method is not limited by the general parameter distribution assumptions and can well show the structural changes of the nonlinear relationship in the leading-lag relationship and can determine the leading-lag direction and leading-lag order at each moment.

TOP has been currently widely used in real economic research and finance. Guo et al. studied the leading and lagging relationship between S & P and bond yields and found that the stock market can significantly guide short-term bonds and have a weaker effect on long-term bonds [34]. In addition, they also found that the Chinese stock market has a barometer effect on the economy [35]. Xu et al. used the thermal optimal path method to analyze the intraday and interday leading and lagging relationship between the onshore and offshore RMB exchange rates [36]. Mend et al. innovatively used self-consistent testing to test the rationality and effectiveness of the TOP method results and further improved the TOP method [37].

2.2.2. Mixed Frequency Data Sampling Regression Models (MIDAS)

The data selected by the traditional econometric model when analyzing time series data must have the same frequency; otherwise, the model cannot be identified. Therefore, when the variable frequencies are different, the academic community usually adopts two types of processing methods: one is to use interpolation to make the low-frequency data high-frequency data; the other is to use the simple average method to make the high-frequency sample data low-frequency sample data. However, these methods ignore the timeliness of high-frequency data, and the processing results may lose the effective information of high-frequency data, which will lead to serious consequences such as distortion of quantitative analysis and reduced prediction accuracy.

To solve this problem, Ghysels et al. first proposed the MIDAS model in 2004 [38], which allows the frequency of variables on both sides of the model equation to be different. High-frequency variables are summed by a parameterized polynomial weight function to become variables at the same frequency as the low-frequency variables. MIDAS was originally used for volatility prediction [39, 40]. Recent work mostly uses monthly data to improve quarterly macro forecasts through regression [4143] or use daily financial data to improve quarterly and monthly macroeconomic forecasts [44].

From another perspective, we can regard MIDAS as a simplified representation of the linear projection generated from the state-space model. In this simplified form, MIDAS regression does not require a complete state-space equation. Bai et al. pointed out that, in some cases, MIDAS regression is an accurate representation of the Kalman filter [45]. From another perspective, MIDAS can be seen as a simplified representation of the linear projection generated from the state-space model. Although the Kalman filter is optimal when performing linear prediction, it requires a complete measurement system and state equations, and it is more prone to setting errors, so it requires more parameters, which leads to computational complexity. In contrast, when involving large datasets, MIDAS regression is computationally easy to implement but is more prone to setting errors [44].

3. Data and Methods

3.1. Data

Data used in the paper derives from two sources. The futures prices, including daily data, 60 min, 30 min, and 15 min high-frequency data, come from Tongdaxin database. The spot prices, based on the data availability and integrity, use the daily data of Shengli crude oil spot selected from Wind database. Considering that China crude oil futures were officially listed and traded from March 26, 2018, the data period of futures and spot in this article is from March 27, 2018, to December 31, 2021. Aiming at the collected data, this paper uses two methods for research. The first is the TOP, used to find the lead-lag relationship between two sets of time series of equal length. In this article, we use daily data of futures and spot for TOP analysis and both sets of series included 890 observations after matching. The second method is MIDAS. Due to the impact of the COVID-19 epidemic, the China Securities Regulatory Commission canceled night trading of crude oil futures on February 3 and resumed on May 6, which gave us an opportunity to study the control sample. In addition, due to the cancellation of the night market, the frequency of high-frequency data was not uniform throughout the sample period. Therefore, in the MIDAS analysis, we divide the sample period into three parts: the first part is with night trading, from March 27, 2018, to January 23, 2020, the second part is without night trading, from February 3 to April 30, 2020, and the third is with night trading, from May 8, 2020, to December 31, 2021. The price observations of one day of different frequencies of crude oil futures in two periods are shown in Table 1.

3.2. Methods
3.2.1. TOP

The main algorithm of TOP method is recursive operation by using the classic model in physics: probability transfer model. TOP introduces the distance matrix and estimates the parameter without parameters, then uses the classic partition function for recursing, and finally obtains the dynamic leading-lag relationship between the two sequences. In this article, we use Python to complete this algorithm calculation and the steps of TOP method are as follows.

First, normalizing two sequences and their distance matrix can be obtained by the following equation, and the simulated result is shown in Figure 1:where indicates the time series has been standardized. As shown in (1), matrix element represents the distance between the value of time series at time and the value of time series at time , which can also be called the energy value. TOP takes the lowest energy as the objective function and achieves this by finding the correct path.

Rotating the coordinate system in (1),

By the coordinate rotation of (2), -axis of the new coordinate system is the direction of the main diagonal of the original coordinate system, -axis of the new coordinate system is the subdiagonal of the original coordinate system, and -axis is perpendicular to -axis.

Secondly, the partition function relies on physical ideas. Zhou and Sornette use the Boltzmann factor as the weight of the partition function, and plenty of studies have shown that the leading-lag relationship of time series can be more accurate in the two-level algorithm. The algorithm only follows three different directions, vertical, horizontal, and diagonal, to approach the target and minimizes the energy value. Therefore, its corresponding partition function can be obtained in the following equation: in (3) represents the allowable temperature, indicating the probability that the path deviates from the maximum value of the target increases with the allowable temperature. When , equation (3) means to find the corresponding minimum value of at any time . However, the actual data is often random and unstable. The time series will overfit when is too small and will lose massive information if is too high. It can be seen that choosing an appropriate temperature is very important to determine a more accurate thermal optimal path.

Finally, calculating the thermal average position , at each time , the probability that each on the partition function can be taken is , which is inversely proportional to energy and . The average leading-lag order is obtained by the following equation:

After the above steps, we can find the thermal average position at different time, that is, the leading-lag order. Although the calculation results will fluctuate to some extent when the order changes, the TOP method can still confirm the order well. Coordinate system transformation and iterative format of transfer matrix of TOP are shown in Figure 2.

3.2.2. MIDAS

MIDAS is a method of estimating and forecasting from models where the dependent variable is recorded at a lower frequency than one or more of the independent variables. Unlike the traditional aggregation approach, MIDAS uses information from every observation in the higher frequency space. In this paper, we run MIDAS in Matlab.

The single variable MIDAS model is the simplest and most basic MIDAS model. This model establishes the regression model by using the weight function between the high-frequency data and the low-frequency data and then uses the nonlinear least squares estimation to estimate the parameters, so as to realize the prediction ability of the high-frequency dependent variable to the low-frequency independent variable. The basic form of the univariate MIDAS (m, K) model is as shown in the following equation:where is the independent variable of low frequency and is the explained variable of high frequency. represents the ratio between the high-frequency and low-frequency variable. For example, when is the daily data and is the 60 min data, when the trading continues throughout the day. , is the weighted hysteresis polynomial, is the high-frequency hysteresis operator, and is the maximum hysteresis order of the high-frequency variable. Since the lag of this model is the weight lag of high-frequency explanatory variable, it is impossible to make out-of-sample forecast in this process.

In the economic system, the indicator data often has inertia, and there is also an autocorrelation relationship between previous and subsequent data in the time series. Therefore, it is necessary to add an autoregressive term to the mixed frequency data model. Clements and Galvao argue that the inertia can only be further addressed by adding dynamic autoregression to the model [46]. In this paper, we add the lag of China’s crude oil spot price as an expectation and obtain MIDAS-AR(1) model as in the following equation:

At present, various parsimonious polynomial specifications have been considered, including (1) Almon lag polynomial specifications, (2) exponential Almon lag polynomial specifications, (3) beta polynomial, and (4) step functions [47].(1)Almon lag polynomial specifications are as follows:(2)Exponential Almon lag polynomial specifications are as follows:Exponential Almon hysteresis polynomial is the most frequently used weight function nowadays, because it can not only establish various types of weight functions but also ensure that the weight of the given variables is greater than zero, which makes the model have a good property of approximating residual to zero.(3)Beta polynomial is as follows:where(4)Step functions are as follows:where

4. Empirical Results

4.1. Leading-Lag Relationship

Figure 3 shows the changes in the futures closing price and spot price of the sample period with 890 observations, respectively. The main vertical axis represents the daily closing price of the futures in RMB per barrel, and the secondary vertical axis represents the daily spot price in USD per barrel. Since we are examining the leading-lag relationship of price changes, the effects of exchange rate changes are not considered here. For comparison, we have removed dates that do not overlap. It can be seen from the figure that although there are differences in the trading methods and trading hours between spot and futures, the price trend of the two shows a certain consistency, and the change in futures prices is slightly ahead, indicating that oil futures have a price discovery function. Regarding how many days the futures price changes lead, we will use the TOP method for further analysis.

Figure 4 shows the leading and lagging relationship at three different temperatures , with positive values on the vertical axis representing futures ahead of the spot , and vice versa . We found that, before the first delivery of the crude oil futures contract SC1809 on September 7, 2018, changes in temperature would cause a significant jump in the lead order, and the change in temperature will cause the leading order to jump significantly. The lower the temperature, the larger the leading order. This is because the optimal path is too sensitive to noise due to low temperature, and the path sensitivity will decrease as the temperature rises, but too high temperature will also cause information loss. The change in the order also shows from the other hand that, before the first delivery, the price of oil futures did not assume the function of price discovery and market stabilization, and its price formation was more affected by speculative factors. After the first delivery, the leading-lag order at different temperatures is basically the same, with only a small span at some time points, which indicates that, from this time on, the futures market trading mechanism is gradually complete and slowly begins to assume its due functions.

In Figure 4, we additionally label the timing of events that may have had a significant impact on the world crude oil market during the sample period. At the first INE settlement date, futures prices are ahead of spot prices by about five trading days. In April 2019, the US announced the continuation of sanctions against Iran, at which point the futures market changes lagged the spot market. In June 2019, tankers were bombed in the Strait of Hormuz, but the market, particularly in China, may have taken such events for granted and did not react violently, as was also seen in the bombing of Saudi crude oil facilities in October of the same year. In 2020, futures prices moved slightly behind spot prices under the influence of a number of factors such as COVID-19, a trend that continued through the end of the year. The leading relationship between futures and spot prices begins to be significant as market sentiment changes on November 2, 2020, when the US presidential election is held, and is again strengthened by the Suez Canal in 2021. As can be seen from Figure 3, during the sample period, futures lead spot price changes at most of the time, and in a few periods the leading-lag relationship changes due to epidemic factors, but the lag order is less than 10 days at all three temperatures, indicating that, to a certain extent, INE guides domestic oil spot prices better.

4.2. Forcasting Results

In this part, six regression models, Beta-MIDAS, Beta-Nonzero-MIDAS, Exp-Almon-MIDAS, Stepfun-MIDAS, and Almon-MIDAS, were used to simulate the influence and prediction of different high-frequency explanatory variables on daily spot price, and the lag order of high-frequency variable changes from 1 to 40th. The parameter is estimated based on the minimum value of RMSE (root mean square error) and the largest value of R2 (goodness of fit) to determine the optimal model. And the deviation degree and deviation matrix are used to measure the prediction accuracy when performing intrasample prediction. The estimation results of the integrated model and the prediction accuracy within the sample determine the optimal weight function form and the optimal lag order of the high-frequency explanatory variables. By comparing the fitting results and prediction accuracy of the model with autoregressive terms and without autoregressive terms, it was found that, after adding autoregressive terms, the model fitting results and the intrasample prediction accuracy were greatly improved. Tables 2, 3, and 4 show the actual and predicted values of the data for different frequencies with night trading before COVID-19, without night trading during the COVID-19 outbreak, and with night trading resumed after the COVID-19 outbreak, respectively. The corresponding line graphs are shown in Figures 5, 6, and 7.

By comparing Figures 5, 6, and 7, we can find that, regardless of whether there is a night trading market, futures prices at different frequencies can predict the spot daily price well. However, the prediction with night market is more accurate in observation. To verify this conjecture, we use the deviation degree and deviation distance to measure the prediction accuracy and the relevant results are shown in Table 5. The deviation degree is a method of measuring the degree of correlation between variables and , and the range of deviation degree is . The greater the absolute value of the deviation, the higher the correlation between and . When they are linearly correlated, the deviation is 1 (positive linear correlation) or -1 (negative linear correlation). Therefore, the prediction sequence with the highest accuracy can be determined according to the maximum absolute value of the deviation degree and the minimum deviation distance. It can be found that, regardless of whether there is a night market, the prediction accuracy of the 60 min data is the highest, indicating that the information contained in the hour data can more accurately predict the price of the day. In addition, in the period of no night market, degree and distance are both lower than those in the period with night market, indicating that the night market also contains valuable information and improves the prediction accuracy. It is also worth noting that, in the third stage of forecasting, the accuracy is weaker than that of the first stage. This phenomenon may be explained by the fact that international crude oil prices produced an unexpectedly large increase in late 2021, which led to a weakening of the forecast accuracy of the futures.

4.3. Discussion of Results

The lead-lag and forecast analysis reveals, first, that the thermal optimal path indicates that the lead-lag relationship between the futures and spot markets alternates over the sample period, and that there are no instances where either the futures or spot markets dominate throughout the sample period. Moreover, some of the periods with strong lead-lag signals coincide with significant changes in oil and stock markets and geopolitical influences, suggesting that futures markets can reflect real global economic, financial, and geopolitical regime changes, which is consistent with the findings of Shao et al. [48]. Given the current sample period used, it can be assumed that the leading-lag relationship between oil futures and spot markets exists only temporarily and will change in the long run due to shocks from other factors. Second, using the MIDAS model, we find that INE plays a role in the discovery of domestic crude oil spot prices, reflecting their dominant influence on the domestic market, which is similar to the findings of wang et al. [49]. Before and after the launch of INE, the relationship between domestic and global oil prices has changed significantly, with the addition of China’s own crude oil futures to what was previously only a hedge against domestic crude oil spot risk through international crude oil futures. Based on the results of the empirical analysis, investors can use 60 min futures data with overnight trading to hedge their investment risk by forecasting spot price movements for 5–10 days in the short term.

5. Conclusion

First, China crude oil futures have gradually played a good price guidance role since delivery and have generated immediate reactions to world events affecting crude oil supply. The extent to which crude oil futures prices are ahead of domestic crude oil spot prices can reach 5–10 days in some time periods, providing an early warning of risk to relevant investors. Chinese crude oil spot pricing has gradually transitioned from only being able to refer to foreign benchmark crude oil futures to jointly refer to foreign and domestic crude oil futures, providing a more comprehensive pricing basis for the Chinese crude oil market. After the COVID-19 outbreak, the price discovery function of crude oil futures was temporarily missing, and the lead-lag relationship between them and spot prices changed rapidly from leading to lagging, once reaching a lag of half a month, most likely due to the preexisting global industrial production and supply chain system being paralyzed by the impact of the COVID-19 outbreak. According to the results presented in the thermal optimal path chart, the impact of the epidemic on Chinese crude oil futures continues until the end of April 2020, after which Chinese crude oil futures will continue to guide the spot price as the global economy gradually recovers. As crude oil futures and spot prices have continued to climb since November 2020, the predictive function of futures for spot has continued to strengthen. Considering that this upward situation continues, the discussion of it can be left for future studies.

Second, futures trading price data at different frequencies are good predictors of spot daily prices, with the 60-minute data having the highest predictive accuracy, suggesting that higher frequency is not better when using high-frequency markets to predict low-frequency markets. The most likely conjecture for this phenomenon is that too high a frequency will trap a lot of white noise, while too low a frequency will lose some important information. 60-minute data is the most informative data for investors to forecast prices and hedge risks.

Finally, the changes in the price discovery function of futures when only day trading is available are explored with the help of a natural event, the temporary suspension of night trading in crude oil futures. The deviation degree and deviation distance confirm that futures prices are less accurate in predicting spot prices when there is no night trading than futures prices in periods that include night trading, suggesting that night trading in Chinese crude oil futures contains important market information. Further, by comparing the forecast accuracy when night trading was available both before and after COVID-19, the forecast accuracy decreased after the epidemic, probably due to the unanticipated impact of the unforeseen world geopolitical situation on prices, which caused significant price fluctuations and thus affected the forecast accuracy. Chinese crude oil futures can ensure connectivity with overseas markets through the night market and avoid situations such as being unable to close positions due to unexpected conditions that bring significant losses.

This study has important theoretical value and practical significance. On the one hand, the relevant research results can effectively guide relevant enterprises to take advantage of changes in the crude oil futures market and effectively avoid the adverse effects of crude oil price fluctuations. On the other hand, while developing the crude oil futures market and improving the efficiency of the crude oil futures market, the regulatory authorities should pay attention to the price signals emitted by the crude oil futures market and take measures to resolve abnormal situations when they occur, to provide help for better prevention of financial risks.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Authors’ Contributions

C. Zhang contributed to formal analysis, investigation, writing of the original draft, visualization, and data curation. D. Pan contributed to investigation and reviewed and edited the manuscript. M. Yang contributed to methodology, software, and validation and reviewed and edited the manuscript. Z. Pu contributed to conceptualization, methodology, investigation, supervision, and funding acquisition and reviewed and edited the manuscript.

Acknowledgments

This work was supported by the National Social Science Foundation, China (no. 20&ZD110), Social Science Fund, Nanjing University of Posts and Telecommunications (no. NYY219004).