Abstract

Not only the fundamentals of supply and demand but also international oil prices are affected by nonfundamental indicators such as emergencies. With the development of big data technology, many unstructured and semistructured factors can be reflected through Internet information. Based on this, this paper proposes a HD-based oil price forecasting model to explore the impact of Internet information on international oil prices. Firstly, we use LDA and other methods to extract topics from massive online news. Secondly, based on conditional probability and correlation, the positive hot degree (PHD) and negative hot degree (NHD) of the oil market are constructed to realize the quantitative representation of Internet information. Finally, the SVAR method is established to explore the interactive relationship between HD and oil prices. The empirical results indicate that PHD and NHD have a better ability to predict international oil prices compared with Google Trends which is widely used in the other research. In addition, PHD has a significant positive impact on oil prices and NHD has a negative impact. In the long term, PHD accounts for 51.00% of oil price fluctuations, ranking the first among relevant influencing factors. The findings of this paper can provide support to investors and policy-makers.

1. Introduction

As a strategic energy source, oil has both commodity, financial, and political attributes [1, 2]. The fluctuation of oil prices will have an important impact on economic growth, stock exchange rates, bond markets, and national security, so the forecasting of crude oil prices has received much attention [3, 4]. However, crude oil price prediction is a typical Nondeterministic Polynomial Complete (NP-C) problem, and the indicators affecting its price fluctuations are complex [57], not only being related to the supply and demand of fundamentals, but also to the USD exchange rate, emergencies, market speculation, and big country games [8, 9]. The fluctuations of nonfundamental factors mostly lead to psychological changes in investors, triggering market speculation [10, 11], which further causes changes in fundamental supply and demand. Faced with such a complex system in the oil market, how to assess the price trend is key [12, 13], however, many indicators affecting the crude oil market are difficult to quantify directly, and investorsʼ psychological changes are hard to capture in time. Therefore, it is necessary to discover new indicators to characterise the volatility of the oil market quickly and accurately. With the advent of big data era, investor behaviors are increasingly influenced by Internet information orientation. Some studies propose the use of Internet information to quantify investors’ speculative behaviors [1416].

2. Literature Review

In fact, several studies have proved that Internet information can promote the prediction of commodity price trends. The current research is mainly divided into two parts: on the one hand, more traditional research investigates planned and easily identifiable news cases, macroeconomic reports, income reports, and so forth to characterise the impact of Internet information on asset prices [17,18]. For example, Yuan [19] uses Dow Jones record-breaking events and front-page news to characterise the stock market attention and concludes that investors would generally sell stocks when widely concerned, which has a negative impact on prices; Schmidbauer and Rösch [20] put the announcements issued after the OPEC meeting as dummy variables into a Generalized Autoregressive Conditionally Heteroskedastic (GARCH) model to assess the impact of the announcement on oil price fluctuations, and the results show a significant effect. On the other hand, most of the more advanced research uses Google Trends to conduct related research [21,22]. Among them, Yao et al. [23] used the principal component analysis (PCA) method to combine the Google Trends to characterise oil market investor attention, and based on the Structural Vector Autoregression (SVAR) model, the results show investor attention has a significant negative impact on crude oil prices. Wang et al. [24] constructed an Internet concern index by analysing the correlation between Google Trends and oil prices and predicted oil prices by combining Extreme Learning Machine (ELM) methods, which improves the accuracy of forecasting; Gao et al. [17] explored the impact of Internet attention on China’s stock market through Qihoo 360’s search index and found that Internet attention contributes to the spread of information to stock prices and weakens information asymmetry.

As a free and open tool, Google Trends has the advantages of easy accessibility, timeliness, and objectivity. It is favored by researchers and has achieved good research results in the field of price forecasting [25]. However, Google Trends, an indicator of investor attention, still has certain limitations. For example, Li et al. [26] had shown that minors and nonprofessional investors tend to search industry news through Google and experienced investors in the oil market will choose a more professional platform to obtain the first news. In addition, the motivation of users to actively search is often derived from the continuous reporting and fermentation of news; that is, news reports will lead the change of Google Trends and have the first timeliness. Moreover, Google Trends is repetitive, and multiple searches of the same user will be recorded, which will cause data bias.

Given the abovementioned limitations of Google Trends, news reports issued by professional media can play a complementary role [27]. Studies have shown that news reports often have a large influence, disseminating major events in the energy market to various groups, and have an important role in promoting price predictions [28,29]. However, owing to the fact that news is a text variable, there are technical barriers to data preprocessing and quantification. Based on the above analysis, a summary of the characterisation methods of the oil market’s Internet information is shown in Table 1.

As Internet information is uneven and difficult to use directly, natural language processing (NLP) technology and text mining technology are used: firstly, these methods can quickly grab a large amount of information on the Internet; then, denoise the extracted Internet information (news, social network feeds, etc.) to enhance data availability; finally, adopt the characteristics of text information in relevant, quantitative ways to explore the relationship between Internet information and financial price series data [30,31]. In terms of data acquisition, web crawlers such as Scrapy, Puppeteer, and Selenium are currently often used [32]. Among them, Selenium is a packaged tool for data acquisition in the programming software Python [33]. The principle is to use a virtual browser to open the specified web page and locate the data according to the CSS function. This will involve some basic web page knowledge and any of us can use it for free. In addition, we can learn more about the operating principle of Selenium through the website “https://selenium-python.readthedocs.io/.” In terms of data application, Wang et al. [34] used the Term Frequency–Inverse Document Frequency (TF–IDF) method to represent the Internet text as a feature vector and put it into the Autoregressive Integrated Moving Average (ARIMA) model to predict stock prices and obtain a better prediction; Ho et al. [35] extracted emotional information from online news and put it into the Fractionally Integrated Generalized Autoregressive Conditionally Heteroskedastic (FIGARCH) and Regime-Switching GARCH models to analyse the dynamic relationship between emotion and stock return rate, and the conclusion shows that news emotion can better reduce yield volatility; Füss et al. [36] proposed the use of information density to measure the response of prices to online news and then to characterise uncertainties in the market and analyse its relationship with market price “jump.” Lee [37] used word2vec (a word embedding method) to represent news headlines as vectors and introduced a recurrent convolutional neural networks (RCN) model for deep mining of stock market information. The final results show that the information is more analytical and more conducive to price forecasting.

On the whole, Internet information is hard to capture and quantify in time. Although in the financial market, research on the deep mining of Internet information to assist the forecasting of price fluctuations has made some breakthroughs, research into oil markets mostly uses Google Trends. In addition, the content expressed in the form of Internet information is diversified, and different topics will have different effects on prices. The aforementioned studies take into account all the information on the Internet, do not implement any filtering of fraudulent or irrelevant information, and do not focus on analysing different types of Internet information, which will lead to subjectivity and bias in the results. Therefore, the application of Internet information extraction in oil price changes remains to be further studied.

Based on the above problems, we use the probabilistic topic model of NLP technology to extract the topic of news reports about the oil market and classify it automatically. Not only does this method filter out invalid information of Internet news but also it realises text clustering and topic factor mining. More importantly, this method can mine the topic hot degree based on conditional probability to achieve the quantification of text data. Compared with traditional search volume and news volume processing, this method is more rational and interpretable. Considering that the use of linear regression does not readily capture the dynamic relationship between sequences [38], the current relatively hot machine learning models can only provide a result of prediction accuracy, whose economic significance and interpretability of the model are poor. However, the SVAR model can better avoid the above problems. The model is often used to predict the interconnected time series system and analyse the dynamic impact of random disturbances on the variable system, so as to explain the impact of various economic shocks on the formation of economic variables. It has been widely used. Applied to the literature of energy economics and policy modeling; for our research, SVAR can analyse the impact of crude oil prices and influencing factors at the same time and can provide rich quantitative results based on the impact of these factors on crude oil prices. Thus, we use an SVAR model to explore the dynamic effects between hot degree (HD) and oil prices [39,40]. Generally, the SVAR model is developed based on the hypothesis that all variables are stationary [41]; however, most variables cannot meet the constraint conditions. Research indicates that the results of the unit root test are sensitive to the size of the dataset [42,43]. Fortunately, the method proposed by Toda and Yamamoto [44] to estimate the SVAR model is less restrictive, whose advantage is that it does not need to consider the stability of variables, or the single and cointegration relationships. The main resolved problems of this paper are as follows:(1)How does the hot degree extracted based on the probability topic model affect the oil price trend? How long will this impact last?(2)Compared with the traditional influencing factors, how much does the contribution of the hot degree to oil price changes?(3)Compared with Google Trends that is widely used in other research, can the hot degree extracted here better explain oil price fluctuations?

3. Methods

In order to solve the quantification of Internet news and analyse how the Internet information affects oil price fluctuations, this paper builds a HD-based oil price forecasting model based on LDA, SVAR, and other methods. The model framework is shown in Figure 1.

First, the web news related to the international oil market is obtained; then, cleaned news is modeled in the oil market topic generation method, and the topic hot degree is obtained by means of the probability matrix. Next, through analysing the correlation between the crude oil price and topic hot degree, identify the tendency, and the positive and negative hot degree is selected. Finally, the supply and demand factors and hot degree (HD) are put into the SVAR model to explore the impact of the HD on international oil prices and make a comprehensive comparison with the effect of Google Trends. Next, the model in this paper will be explained in detail.

3.1. Oil Market Text Analysis

The theoretical basis for text analysis of the oil market is a probabilistic topic model, and its development is a process of continuous improvement. The main idea of this model is to regard text as the polynomial distribution of several topics, and the topic is the polynomial distribution of words. The earliest theory is the Latent Semantic Analysis (LSA) model, proposed by Scott et al. [45] in 1990; subsequently, Hofmann [46] proposed the Probability Latent Semantic Analysis (PLSA) model in 1999 to better describe polysemy in texts. Blei et al. [47] raised the Latent Dirichlet Allocation (LDA) model on the basis of PLSA in 2003; the main contribution is to add the Dirichlet prior distribution, which effectively solves the overfitting problem caused by too many parameters in the PLSA model. Currently, a large number of papers have proved that the LDA model shows excellent performance in text topic extraction [4850], which can not only obtain a qualitative output of the topic keywords but also obtain a quantitative output of the topic probability values. In view of this, we selected the LDA model as the topic extraction model used on crude oil market text to extract the topic heat of web information. The process of generating simulated text based on the LDA model is as follows:

Let be the topic distribution of text generated from probability sampling in the Dirichlet distribution ; the topic of the j-th word position of news item generated with probability as sampled; the word distribution of topic generated from probability samples in the Dirichlet distribution ; and a word generated by the sampling from. In the oil market text generation process, the joint probability is defined as follows.

Definition 1. The joint probability of generating oil market text based on the Dirichlet distribution and the polynomial distribution is given by the following equation:where is Dirichlet, is a polynomial parameterised by , is a polynomial over the words.
Since the log-likelihood function of the LDA model contains latent variables and cannot be estimated by simple maximum likelihood estimation method, the Expectation–Maximization (EM) algorithm proposed by Dempster et al. [51] in 1977 is adopted. At this point, the unstructured and semistructured data can be structured, and the optimal value of and is obtained. Then, the topic of each text and the keywords under each topic can be determined according to the probability value.

3.2. Oil Market Hot Degree Extraction Method

Based on the value of and , the probability that each text corresponds to each topic and each topic corresponds to each word is obtained. Next, we propose the definition of topic hot degree based on the changes of the topic over time.

Definition 2. At time t, the probability of each text corresponding to each topic is summed and divided by the total number of texts to get the hot degree of each topic at this moment, as defined in the following equation:where is the hot degree of topic j at time t, is the number of texts, and is the probability of text corresponding to the topic at time t; the smaller the value is, the less the content of the text is related to the topic. In addition, the threshold is set here to avoid invalid information, when the probability which is less than the value of does not participate in the calculation. On the whole, forms a time series in continuous time, which in turn quantifies the Internet information. The higher the , the more texts related to the topic, that is, the Internet attention of the topic is higher. On this basis, we give the definition of positive and negative hot degree.

Definition 3. Calculate the linear correlation coefficient between crude oil prices and obtained from Definition 2. The topic hot degree is divided into positive topic hot degree set and negative topic hot degree set , as defined in the following equation:where is the correlation coefficient between and crude oil price; is the threshold value to ensure the relevance of the data; and k is the number of topics.
Next, perform a principal component analysis (PCA) on the topical hot degree in the set , and obtain the first principal component. Then, the positive hot degree (PHD) is obtained within the range of 0–100 by the maximum and minimum normalisation; this method is very similar to the definition of Google Trends. The definition of negative hot degree (NHD) is the same as PHD, as shown in equation (4). At this point, two indicators, PHD and NHD, are used to characterise the tendency of the news to the trend of oil prices, and the two are collectively referred to as HD:where PCA stands for principal component analysis and the number “1” stands for obtaining the first principal component.
On the basis of the definition of joint probability, topic hot degree, and HD tendency, the PHD and NHD extraction algorithm for the massive Internet information about the oil market is as follows.

3.3. Crude Oil Price Forecasting Based on Hot Degree

To explore the relationship between Internet information and oil price fluctuations, we define a vector , where represents the supply and demand factors of the oil market, is the hot degree factor proposed in this paper, and is the international crude oil prices. Compared with the traditional VAR model, the SVAR model can capture the contemporaneous correlation between variables and can reflect the response of the model system to the independent perturbation shock, thus better explaining the fluctuation of oil prices. Therefore, we define the SVAR model related to the oil market as follows:where are the vector parameters to be estimated, p is a lag order, and is a structural innovation. Assuming that is invertible, the SVAR model is simplified to the following equation:where represents the residual vector of the simplified SVAR model and (see equation (7)). According to Kilian and Lee [52], the restrictions on mean that it is in lower triangular form:where represents the response coefficient of the ith variable’s response to the jth variable’s structural shock; the larger the coefficient, the greater the impact on the whole system; and 0 means that the current position has no response to a specific impact.

Next, we will explore the impact of hot degree (HD) on oil prices through the impulse response function (IRF) and variance decomposition (VD) of the SVAR model. The IRF is used to calculate the response of the whole system when the error term of the Internet information changes. The VD is used to analyse the contribution (measured by variance) of each structural shock to oil price changes and further evaluate the importance of different shocks.

4. Empirical Analysis

4.1. Data Sources

The sources of Internet news are uneven, including social networking sites and new media, but scientific research needs to ensure the security, normalisation, and universality of information sources. Therefore, this paper uses “oil,” “oil price,” “oil market,” “crude oil,” “OPEC,” “WTI,” and “Brent” as keywords to crawl the news published by UPI, Reuters, Oil price, and World oil which are authoritative online media as the source of oil market Internet information, and the key technology is Pythonʼs selenium framework. The total number of news items is 220,362; after text preprocessing and information filtering, we finally obtained 88,763 oil market-related news items from January 2012 to June 2019.

Meanwhile, the SVAR model adopted here considers four variables: in addition to the HD extracted from the Internet in this paper, it also includes global oil supply, global oil demand, and Brent crude oil prices, which are all monthly data. Among them, oil supply is represented by global oil production; on the demand side, the Purchasing Manager Index (PMI) has become an important evaluation indicator of world economic operations and a barometer of world economic changes [53]. Accordingly, oil demand is represented by PMI. The specific sources are shown in Table 2. In addition, the samples are split into two subsets when forecasting, with the data from January 2012 to December 2018 being regarded as the training set and the data from January 2019 to July 2019 forming the test set.

4.2. Topic Generating of Oil Market

Firstly, text preprocessing is undertaken, including four steps of removing invalid text, abnormal vocabulary, stop words, and word form conversion. Specifically, due to network connection faults and other reasons, some news is empty that needs to be deleted. Internet news obtained at a first pass is prone to containing garbled, or abnormal, characters. To avoid interference with information quality, it is also necessary to remove such erroneous data. Different word forms increase the time complexity of the model and require conversion. Stop words are a relatively complex part of the data: stop words in the oil market text data include not only general stop words (a, an, the, etc.) but also a large number of words that are less relevant to the oil market (year, time, week, etc.), thereby interfering with the results. Through the analysis of the preliminary results, 34 words without specific information other than the common stop words are added to the stop words lists to form a dedicated stop words dictionary of the oil market.

In addition, the number of topics is the key factor in determining the topic extraction effect of LDA model. Based on the literature [5456], the number of topics k is valued as 5, 8, 10, 12, and 15, through the repeated execution of Algorithm 1, combined with the perplexity of the LDA model; the results show that the optimal number of topics is 8. We filter the 50 keywords most similar to the topic to illustrate the meaning of each topic and select 10 words that have practical significance. The final topic generation effect is shown in Table 3.

Step 1: Use crawler technology to obtain massive Internet information related to the oil market and then preprocess the Internet information, including details such as removing invalid text, filtering abnormal vocabulary, removing stop words, and converting word form.
Step 2: Vectorise the cleaned oil market text. First, all the words appearing in the texts constitute a dictionary. If the frequency of a word appearing in text is , the position of this word is recorded as ; otherwise, it is recorded as 0. Based on this, text becomes a vector, and all the texts form a word frequency matrix.
Step 3: Select the appropriate number of topics, use the EM algorithm to estimate the joint probability distribution of text-word, get the probability of and , and determine 50 words most relevant to each topic according to the probability value to define the realistic meaning of the topic.
Step 4: Investigate whether, or not, the text topic is reasonable and effective. If there is more redundancy in the information or the meaning of topic is ambiguous, repeat Steps 1 to 3 until the topic is reasonable and effective, and the model confusion is small. After meeting the above conditions, output the current topic and get influence factors affecting the oil market.
Step 5: Calculate the heat corresponding to oil market text topics based on equation (2), and realize the quantification of Internet information.
Step 6: Calculate the topic hot degree set and based on equation (3), and obtain the value of PHD and NHD based on equation (4). So far, the indicators of news reports on crude oil prices have been extracted.

Taking Topic 8 as an example, it includes keywords such as Iran, attack, and sanction. It can be speculated that the topic is related to the Middle East situation. Overall, the news topics include market economy, exploration and development, government intervention, and military war. These topics are closely related to the composition of the oil market, and this is basically consistent with the factors influencing the oil market proposed by Miao et al. [6] and Huang et al. [9] which corroborates the effectiveness of the news topics extracted in this paper. But the meaning of each topic is pluralistic, and it is difficult to summarise it with simple words.

4.3. Hot Degree Extraction of Oil Market

For the news-topic probability output matrix of LDA model, based on Definition 2, the probability threshold is set to be 0.1, 0.2, and 0.3, respectively. After executing Step 5 of Algorithm 1, the HD of each topic is obtained in monthly units of time. The correlation coefficient between and Brent oil prices is calculated. It can be seen from Figure 2, and the correlation is the strongest when is set to 0.1. Looking at Figure 2, the deeper the circle colour and the larger its area, the stronger the correlation. It can be found that the correlation between the hot degree of each topic is small, and the numerical value remained between 0.03 and 0.65, indicating that the information contained in each topic is basically independent of other items; it also verified that the topic clustering of the LDA model is better.

Looking at the first column of Figure 2 to explore the correlation between crude oil prices and topic hot degree, we can find that have a negative correlation with crude oil prices, while have a positive correlation with oil prices. According to Definition 3, set ; then, . In order to avoid the problem of too many parameters failing in the estimation of the SVAR model, based on equation (4), the PHD and NHD are obtained through PCA and maximum and minimum normalisation to represent the hot degree of the oil market. The comparison between the two and the trend of oil prices is shown in Figure 3.

It can be found from Figure 3 that there is a clear codirectional relationship between crude oil prices and PHD. In 2012, oil prices were at a high level, and PHD was also at high levels. In 2014, oil prices plummeted and PHD also fell to low levels. The NHD shows exactly the opposite effect. Next, the dynamic relationship between crude oil prices and HD will be captured in detail through the SVAR model.

4.4. Analysis of the Interactive Relationship between Oil Price and Web Information Index

The input vector of the SVAR model is , and is represented by PHD and NHD, respectively. Before developing model estimation, data preprocess is performed first, including deflation processing and seasonal adjustment of Brent crude oil prices, seasonal adjustment of supply and demand factors, and finally taking the logarithm uniformly. Next, the unit root test is performed on each variable by using the Augmented Dickey–Fuller (ADF) and Phillips–Perron (PP) tests. The results are shown in Table 4. It can be found that all variables are the first-order stationary. Consequently, we use the method proposed by Toda and Yamamoto [44] to estimate the SVAR model. As long as certain conditions are met (, is the optimal lag for the VAR model, is the maximum single integer order of variables), the model of variables in levels can be established. Moreover, using variables in levels facilitates the capture of long-term information and enhances the explanatory ability of the model [57].

Then, is represented by PHD first, and the optimal lag order of VAR is determined by considering multiple criteria. The results are shown in Table 5. Among them, Final Prediction Error (FPE), Akaike Information Criterion (AIC), Schwarz Criterion (SC), and Hannan–Quinn (HQ) criteria show that the optimal lag order is 1, and the maximum single order of the variables in the SVAR model is 1. According to the principle proposed by Toda and Yamamoto, the VAR lag order selected here is 2.

The SVAR model is estimated on the basis of equations (5) to (7), and the impact matrix is shown in equation (8). Observing the impact of various errors in fluctuations of oil prices, the value of is smaller than zero, indicating that the fluctuation of global oil supply has a certain negative impact on oil prices. Meanwhile, are positive numbers, meaning the changes in oil demand and PHD have positive effects on oil prices. The increase in production causes oil prices to fall, and the increase in demand leads to an increase in oil prices, which is consistent with our usual perception, and the positive effect of PHD on oil prices is consistent with the results in Figure 3, indicating that the larger the PHD, the more bullish news the media reports, and the higher the prices of oil:

4.4.1. The Forecasting Effect of HD on Oil Prices

This section forecasts crude oil prices based on the VAR model, and the influence of HD factor on the prediction result will be analysed. The input of the model is , the input of the model is , and the input of the model is . In addition, as mentioned before, related research uses Google Trends to indicate investor attention of the oil market [17,23,24], analyse its impact on oil price fluctuations, and achieve good results. Google Trends and the HD extracted in this paper are also the products of the Internet era. Based on this, the effect of Google Trends and HD to predict crude oil prices will be compared. We deal with Google Trends in the same way as Yao et al. [23] and obtain Google search volume index (GSVI). Therefore, the input of the model is .

The prediction is performed to the next step, and the obtained prediction results are shown in Table 6. It can be found that, regardless of the results of Mean Absolute Error (MAE) or Root Mean Square Error (RMSE), the modelʼs prediction effect with HD is significantly better than the original model, and the model reduces the MAE by $ 3.7075 compared to the model . This is a major improvement, and the prediction effect of is better than . It works best in the four models. Unfortunately, the model with the addition of GSVI has the worst prediction effect and does not play a role in assisting the prediction. It can be seen that the HD factor extracted in this paper can significantly improve the forecasting effect of oil prices and is significantly better than the auxiliary forecasting ability of Google Trends. Next, we will specifically analyse how the HD affects oil prices through the impulse response function (IMF) and variance decomposition (VD) method.

4.4.2. The Influence Timeliness and Explanation Ratio of PHD Shocks on Oil Price Fluctuations

Based on the estimation results of the model, the impulse response of oil prices to other variables’ shock is shown in Figure 4. From Figure 4(c), it can be found that, within the sample interval, given the shock of one standard deviation of PHD, the oil price showed a significant positive response in the first period, with a value of 1.45%, and the oil price reached the maximum response (5.92%) in the seventh period. The impact persists for a long time, with the most significant impact during the fifth to ninth periods. In addition, the positive impact of the demand factors represented by PMI also gives a positive response to oil prices in the short term, but the oil prices respond more strongly and last longer to PHD shocks, clearly leading the demand factor. This figure indicates that PHD has better timeliness and a higher impact on oil prices than traditional demand indicators. Moreover, the positive shock of oil supply factors has a certain negative impact on oil price fluctuations, and this effect has been present for a long time.

To further analyse the explanatory ratio of PHD on oil price fluctuations, the VD method was used to investigate the variance of oil prices prediction error. The results are shown in Table 7. It can be seen that the error from oil prices itself is over 80% in the first period; as the forecast period increases, the proportion of errors in the supply, demand, and PHD increases. In the short-term, the variance of oil prices prediction error explained by PHD is 3.72% in the first phase, which gradually increased later, and, in the fifth period, the proportion of oil price fluctuations explained by PHD’s shocks reaches 28.63%, ranking first among all variables. In the long term, PHDʼs ability to explain oil price fluctuations still ranks the first in the factors listed in this paper. Therefore, it can be considered that PHD has a strong ability to explain oil price fluctuations, while the demand factor interpretation ability represented by PMI is at a relatively low level, and the supply factor interpretation ability has shown a steady increase.

In summary, the results of IMF and VD both show that the impact of PHD’s shocks on oil price has exceeded traditional supply and demand factors. In reality, according to the response cycle and explanatory ratio of crude oil prices to the shock of PHD, it can help policy-makers and investors to reasonably arrange corresponding countermeasures, know the turning point of the event in advance, and make decisions in a timely manner.

4.4.3. The Influence Timeliness and Explanation Ratio of NHD Shocks on Oil Price Fluctuations

Based on the estimation results of the model, the impulse response of oil prices to other variables’ shock is shown in Figure 5. From Figure 5, given the impact of one standard deviation of NHD, the oil price showed a significant negative response in the first period, and this once again validates our hypothesis. Meanwhile, this impact will probably last for 7 periods, with the most significant impact in periods 1–3, peaking at 3.70% in the second period. Compared with PHD, the impact of NHD on oil prices is more rapid and disappears faster. This also shows that news events that have a suppressing effect tend to have shorter timeliness. In addition, the response of oil prices to the impact of supply factors and demand factors is similar to that in Figure 4 and is consistent with reality and has good stability.

Under the current model, the variance decomposition results of crude oil price prediction errors are shown in Table 8. Similar to the impulse response results, in the short term, NHD shows good performance with a peak of 12.19%. In the long term, NHDʼs explanation ratio to oil price fluctuations is declining, but it is still superior to demand. In addition, the supply factorʼs ability to explain price fluctuations is growing steadily, maintaining first in the factors listed in the model.

Through the above research, it can be found that the PHD and NHD indicators extracted from web news show better performance in the SVAR model, which is significantly better than the demand factor, and the PHD factor also surpasses the supply factor, which has the largest impact on oil price fluctuations. It answers the second question raised at the beginning of this paper. A third question will be considered next. Can PHD better explain fluctuations in oil prices than Google Trends?

4.4.4. The Impact of Google Trends on Oil Price Fluctuations and Comparison

The construction of GSVI based on Google Trends was explained in section (1), and an oil prices forecasting model was made. This section will compare the impact of Google Trends and PHD on oil prices in various aspects. First, the trend comparison effect of Brent oil prices, PHD, and GSVI is shown in Figure 6. It can be seen that the GSVI has a large fluctuation range and has two peaks. The overall trend is negatively correlated with crude oil price. In order to better understand the causal relationship between variables, Granger causality tests were performed on GSVI, PHD, and Brent oil prices. The results are shown in Table 9. Obtaining the Brent oil prices is the Granger cause of the GSVI change, and PHD is the Granger cause of the Brent crude oil price change, which indicates that PHD has a better interpretation of oil prices than GSVI. From the practical point of view, people often have search behaviors due to changes in oil prices, which in turn causes fluctuations in GSVI. On the other hand, changes in oil prices often come from news reports.

Based on the model’s estimation results, the impulse response and variance decomposition results corresponding to GSVI are obtained. The comparison of IMF and VD results with PHD is shown in Figure 7. According to Figure 7(a), it can be found that, for the shocks of one standard deviation of GSVI, oil prices responded in the first period, similar to the trend of NHD shocks, and the response reduction caused by GSVI was slower. But the positive response brought by PHD is more obvious, both the maximum impulse response and the cumulative impulse response are better than GSVI. According to Figure 7(b), from the perspective of VD, the contribution ratio of GSVI in the early stage to the crude oil price was relatively high, reaching a peak at 25.39%, and gradually stabilized in the later stage. The contribution rate of PHD shocks to oil prices has shown a steady increase, much higher than GSVI. Based on this, PHD shows better performance than GSVI, whether it is predictive power, causality, or explanatory ratio, which further validates the effectiveness of the PHD index proposed in this paper. It also provides new ideas for the current research referring to Google Trends.

4.5. Robustness Analysis

In order to ensure the robustness of the experimental results in this paper, we tested the model based on three aspects: transforming the sample interval, estimating the DCC-GARCH model of dynamic relationship testing, and using a different benchmark of crude oil prices.

4.5.1. Analysis of the Influences of HD on Oil Prices during Different Sample Intervals

Considering the sample range of 2012–2019 is a relatively long period. Thus, the possibility of structural changes cannot be fully ruled out. This section will discuss the possibly different influences of HD on oil prices during different sample intervals.

Firstly, we define the concept of time window: L is called the length of the time window, the sample is divided into a finite number of subintervals using L, s is called the moving step of the window, so each subinterval will move s steps and form a new time window. The schematic diagram is shown in Figure 8.

According to the above definition, the sample range from 2012 to 2019 is divided into 7 time windows, with 48 months as a window for rolling. The prediction is performed to the next step, that is, the window length L = 48, and the window moving step length is 6 months; in the other words, s = 6. The range of the first window is from January 2012 to December 2015, and the test set is 6 months after the training set. The last window is from January 2015 to December 2018, and the test set is from January 2019 to June 2019. Taking PHD as an example, the obtained prediction results are shown in Table 10.

It can be found that, as the time window gets closer, the predicted error becomes smaller. The possible reason is that with the globalization of the Internet, the number of news is gradually enriched and the timeliness of news reports has increased, which is in line with practical significance. Next, we will analyse specifically how HD affects crude oil price trends through impulse response analysis and variance decomposition in different windows.

Based on the estimated results of the model, the impulse response of crude oil prices to PHD shocks is shown in Figure 9. In different time windows, given a shock of one standard deviation of the PHD, the oil price has shown a clear positive response in the current period and has persisted for a long time. It is worth noting that between January 2012 and December 2015 in the sample window, oil prices had a negative response to the impact of PHD. This may be due to the US shale oil and gas revolution in 2014, and the fundamental factors of supply are severely larger than demand, leading to global oil prices plummet; even if news that is beneficial to oil prices appears, it will not cause oil prices to rise.

In order to further analyse PHDʼs ability to explain crude oil price fluctuations, the variance decomposition method is used to analyse the variance composition of crude oil price forecasting errors as shown in Figure 10. Under different time windows, with the extension of the period, the proportion of the influence of PHD gradually increases, and after the fifth period, it slowly decreases and tends to be flat. The same difference is that from January 2012 to December 2015, due to the influence of political events, PHDʼs ability to explain oil price fluctuations gradually increased after the 9th period.

In summary, the results of impulse response and variance decomposition show that, in the absence of major events, the impact of PHD in different time windows has similar effects on oil prices and forecast errors. It also illustrates the stability of the model in this paper and the validity of the proposed HD index. In reality, based on the response cycle and interpretation ratio of crude oil prices to HD shocks, it is helpful for policymakers and investors to reasonably arrange corresponding countermeasures, know the turning point of the event in advance, and make timely decisions.

4.5.2. The Dynamic Correlations between Oil Prices and HD Variables Based on the DCC-GARCH Model

In addition to the SVAR analysis, there are also other models that can capture the possible dynamic correlations between oil prices and HD variables [58]. To verify the robustness of the model results, we estimate the DCC-GARCH model and the results of the dynamic correlation between oil prices and relevant factors (HD, supply, and demand) are shown in Figure 11.

Firstly, it can be seen from the size of the dynamic correlation coefficient that the correlation coefficients of PHD and Brent are distributed between (0.1, 0.4), which is positively correlated with oil prices and can reach up to 0.4. The correlation coefficient of NHD and Brent is (−0.75, −0.15) and the highest negative correlation can reach 0.75, even exceeding the negative influence of the famous supply factor on Brent.

Secondly, it can be seen from the sign of the correlation coefficient that the correlation between PHD and Brent during the entire sample period is positive. When it shows that the larger the PHD, the more news media reports on good news, which is conducive to the upward movement of oil prices; while the correlation between NHD and Brent is negative, it means that NHD has a significant negative spillover effect on Brent, which is in line with expectations.

Thirdly, it can be seen from the trend of the path diagram of the correlation coefficient that the correlation coefficient between HD and Brent fluctuates more frequently, which means that Brent is more sensitive to HD fluctuations, and small fluctuations in HD will quickly cause Brent to change, but Brentʼs response to supply and demand was relatively flat. From the above three conclusions, we can conclude that the DCC-GARCH model and the VAR model used in this paper have reached a consistent conclusion.

4.5.3. Analysis of the HDʼs Ability to Explain the WTI Crude Oil Market

Compared with the Brent crude oil market, WTI crude oil market also occupies an important position in the international trading market. The price range and fluctuation trend of the two markets are different due to differences in their own attributes. To explore the explanatory power of HD for the WTI market, the series of is replaced by the WTI crude oil prices, and the SVAR model is reestimated. The impulse response results of WTI oil prices to PHD and NHD shocks are shown in Figure 12. According to Figure 12(a), the shocks of PHD also have a positive effect on WTI oil price. The impact has gradually weakened from the second period to the 25th period, and the impact from the third period to the sixth period is the most significant. Figure 12(b) shows that NHD has a suppressive effect on WTI oil price, which is completely consistent with the results of Brent oil prices, further clarifying that HD still has the same cycle and effect on the WTI crude oil market.

The results of variance decomposition are shown in Figure 13, the proportion of PHD for WTI oil prices fluctuation reached 51.50% in the long term, slightly higher than the interpretation of Brent oil prices, which also ranks the first among the factors. In addition, the highest proportion of NHD for WTI oil prices fluctuation reached 12.92%. Overall, the explanatory of HD is similar to the Brent price in terms of WTI oil prices, which proves the validity and robustness of HD.

5. Conclusions

Based on the massive Web news of the oil market, this paper uses the LDA model to perform multiple topic extraction and result evaluation on 88763 items of oil news. After filtering duplicate and invalid information, eight topics related to the oil market are obtained. Based on the probability matrix, the definition of the topic hot degree was proposed to characterise how much each topic is being talked about on the web. At the same time, according to the method of correlation coefficient and principal component analysis, the positive hot degree (PHD) and negative hot degree (NHD) were obtained, the quantitative expression of web news was realized, and the bridge between web news reports and crude oil prices was established. Finally, this paper uses the SVAR model to explore the impact of HD on oil prices. The main conclusions obtained are as follows:(1)The results of the SVAR model show that PHD and NHD can play a role in assisting oil price forecasting. Among them, the shocks of PHD have a significant positive impact on crude oil prices, and this effect persists for a long period, and the most significant impact appears between the second and fifth periods. The shocks of NHD have a significant negative impact on oil prices, and the impact gradually disappeared after 5 periods. In addition, supply shocks have a significant negative impact on oil prices, and demand shocks have a significant positive impact.(2)In the long term, PHD accounts for 51.00% of all oil price fluctuations, ranking the first among the influencing factors considered. Among them, the contribution of the oil supply factor is 19.83%, and the contribution of the demand factor is only 5.07%. And in the robustness analysis, HD has the same impact on the fluctuation of WTI oil prices, further proving that the HD extracted based on news reports can better explain the fluctuations of the global oil market.(3)Through comparison, it is found that PHD has better performance than Google Trends in oil prices forecasting, maximum impact value, and maximum explanation ratio, verifying the effectiveness of PHD indicators derived from news reports, and it also provides new ideas for the current research referring to Google Trends.(4)The robustness check by transforming different sample interval, reestimating DCC-GARCH model, and using different oil price benchmark confirms that our empirical results are robust.

In summary, accurately grasping various influencing factors is the key to improving the accuracy of oil price forecasting. However, the factors that affect oil prices are intricate and complex. Many nonfundamental factors hidden in the Internet text are difficult to characterise through quantitative indicators. Using NLP technology to structure this unstructured Internet information is the main work of this paper, and the importance of this data for oil price forecasting is verified through econometric models. The sudden and large fluctuations in oil prices are of great significance to future inflation and economic growth. At the same time, fluctuations in oil prices will have a significant impact on various key macroeconomic indicators (e.g., fixed investment, consumption, employment, and unemployment). Therefore, the European Central Bank and the Federal Reserve have used future oil prices as an important reference in the decision-making process. For policy-makers, through the method of this paper, an important indicator “HD” has been added as an indicator of nonfundamental factors to help predict future prices, improve the accuracy of forecasts, further adjust market trends in a timely manner, and stabilize market operations. In addition, for investors, based on the HD changes proposed in this paper, they can gain timely insight into the impact of online news volatility on oil prices, which cannot be quantitatively estimated in the past, so as to help specify better investment strategies.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors gratefully acknowledge the financial support from the National Natural Science Foundation of China under Grant nos. 71871020 and 71021002.