Abstract

Historical trading data, which are inevitably associated with the framework of causality both financially and theoretically, were widely used to predict stock market values. With the popularity of social networking and Internet search tools, information collection ways have been diversified. Instead of only theoretical causality in forecasting, the importance of data relations has raised. Thus, the aim of this study was to investigate performances of forecasting stock markets by data from Google Trends, historical trading data (HTD), and hybrid data. The keywords employed for Google Trends are collected from three different ways including users’ definitions (GTU), trending searches of Google Trends (GTTS), and tweets (GTT) correspondingly. The hybrid data include Internet search trends from Google Trends and historical trading data. In addition, the correlation-based feature selection (CFS) technique is used to select independent variables, and one-step ahead policy is adopted by the least squares support vector regression (LSSVR) for predicting stock markets. Numerical experiments indicate that using hybrid data can provide more accurate forecasting results than using single historical trading data or data from Google Trends. Thus, using hybrid data of Internet search trends and historical trading data by LSSVR models is a promising alternative for forecasting stock markets.

1. Introduction

With the advances of the Internet and communication in recent years, the increasing amount of data from social networks leads to changes in ways of collecting and analyzing data. Google Trends (http://www.google.com/trends) can be used to search trends of keywords. Hence, the data from Google Trends data started to be applied to many fields such as economy, election, and medication. Compared to structured data, collection data from social networks are another way to depict the issues concerned, and thus, some other interesting and essential insights that are not included in the traditional data collection may be discovered. Ever since the beginning of the stock market, it is hard to predict. However, the stock markets have profound effects on a country. In the past, the forecasting of stock markets has relied heavily on historical trading data. Most forecasting models using historical trading data are based on the causality theoretically. Due to the popular use of the Internet search, people tend to seek data or information from the Internet and express opinions on social networks. Stephens-Davidowitz [1] indicated that when social censoring issues are studied, Internet search behaviors can better reflect the real thinking of people than survey data, and the timing to obtain data is more close to real time [26]. However, the importance of historical trading data in forecasting stock market values should not be disregarded. This study attempts to incorporate the data from Google Trends and historical trading data together to predict stock markets. The performance of hybrid data and the unique data type in forecasting stock market closing values were examined in this investigation. Five stock markets, namely, Dow Jones Industrial Average Index (DJIA), Nasdaq Composite Index (IXIC), Russell 2000 Index (RUT), Standard & Poor’s 500 Index (S&P 500), and Chicago Board Options Exchange Volatility Index (VIX), and three companies, the Apple corporation (APPL), the Alphabet corporation (GOOGL), and the Microsoft Corporation (MSFT), were forecasted by least squares support vector machines models with different data types. The rest of this article is organized as follows: Section 2 provides the related work. Section 3 introduces the methods employed in this study. Section 4 illustrates the proposed stock-forecasting framework and numerical examples. Section 5 draws conclusions.

Hassan [7] noted that predicting stock markets using complex calculations does not help much. The author proposed a forecasting technique combining the hidden Markov model and fuzzy concept to predict stock markets. The results showed that the presented model outperformed the autoregressive integrated moving average model, the neural network model, and other hidden Markov models. Hadavandi et al. [8] claimed that a successful forecasting technique model for stock markets is a technique that can obtain accurate forecasting results with the smallest amount of input data and the simplest stock market model. This article combined genetic fuzzy systems and neural networks to forecast stock markets for information technology companies and airline companies. For the data-preprocessing stage, the stepwise regression analysis was used to pick factors, and then, through the self-organizing map approach, they were employed to cluster data. The experiment’s results showed that the proposed approach can obtain more accurate results than some other forecasting methods. Singh and Borah [9] designed a forecasting model consisting of fuzzy theory and the particle swarm optimization technique to predict stock markets by using historical data from the State Bank of India. The numerical results illustrated that the proposed forecasting model is superior to the grey model, artificial neural networks, and regression models.

Another tendency of forecasting stock markets is putting finance indicators into forecasting models. Laboissiere et al. [10] developed a model including correlation analysis and artificial neural networks to predict stock prices of Brazilian electric companies. In addition to the historical trading data, some indices such as the Ibovespa index, the Electric Power index, and American dollar quote were employed to predict stock prices. The numerical results were promising in terms of forecasting accuracy. Lincy and John [11] presented a multiple fuzzy inference systems model to predict selected stocks prices of the Nasdaq stock exchange. Four indicators, Moving Average Convergence/Divergence, Relative Strength Index, Stochastic Oscillator, and Chaikin Oscillator, were used by the proposed model, and decision rules were generated by using fuzzy set theory and multicriteria decision-making approaches. Simulation results revealed that the presented model is a positive way to analyze stock prices in terms of profit return. de Oliveira et al. [12] used artificial neural networks to forecast Petrobras’ PETR4 stock by fundamental and technical factors which may influence stock markets. After the data-preprocessing procedure, essential factors left out were used by artificial neural networks. This study reported that the testing accuracy of stock market directions was more than ninety percent. Göçken et al. [13] applied metaheuristics, which are employed to select essential indicators, and artificial neural networks in stock price prediction. In addition, this study examined the suitable number of hidden neurons in the hidden layer in order to deal with the overfitting or underfitting problems of artificial neural networks. The results indicated that the proposed forecasting model was a dominant way to predict stock markets.

Because the use of social networks is booming, data from social networks offer valuable insights into what people think and want. Thus, these data have become more and more popular for collecting opinions and for forecasting. Stephens-Davidowitz [1] studied the relation between the voting of American presidential election and racially charged language. The author pointed out that the Google search queries were more useful than the survey data when social censoring issues were investigated. The results showed that there was a relation between voting and the search queries of racial animus. Gunn III and Lester [5] employed Google Trends with three terms to analyze the relation between the three terms and monthly suicide rates. They reported that the information from the Internet search is correlated with the number of suicides, and thus, it is a faster way of monitoring possible suicide trends than compiling suicide statistics. Yang et al. [14] analyzed the relation between Internet search trends and suicide death. The conclusions revealed that suicide-related search terms were related to suicide death, and thus, keyword-driven search results of the Internet are the essential knowledge to reduce suicide deaths. Frijters et al. [4] conducted a study about the relationship between macroeconomic conditions and an indicator of problem drinking data from Google searches. The results showed that the macroeconomic conditions are associated with health in some ways, and the real-time data provided by Google searches are crucial information for policy-makers. Smith [15] investigated the volatility in forecasting foreign currency exchange rates by using three Google search keywords and time-series models. The results demonstrated that the information from Google searches is important in forecasting the market for foreign currency. Fondeur and Karamé [16] used the Google search data to enhance the prediction accuracy of youth unemployment in France. The results indicated that Google search data did improve the prediction of unemployment. Li et al. [17] used both statistical data and Google search data to predict the consumer price index by a mixed-data sampling model. Numerical results revealed that the proposed approach was helpful in forecasting the consumer price index by using data from the user-generated content. Takeda and Wakao [18] studied the relation between the Google search intensity, stock trading volume, and stock prices. It was reported that the positive relationship between Google search intensity and trading volume is stronger than that between Google search intensity and stock prices. Araz et al. [2] used Google Flu Trends data to forecast influenza-like illness, and a strong positive relation between Google Flu Trends data and influenza-like illness was revealed. In addition, using Google Flu Trends data as independent variables can result in accurate forecasting results. Some studies have examined the relation between the Internet search and some diseases, such as disease-related genes [19], kidney stones [20, 21], epilepsy [3, 22], allergy [23], and restless legs [24].

Most data on social networks are unstructured. Therefore, to find meaningful information from social networks, text mining has been one of the major tools employed. Mostafa [25] used tweet samples on some famous companies to analyze sentiments of users to forecast the Prosperity index of each company. This investigation concluded that text mining in social networks is a helpful way to capture consumers’ view and preferences of products. Ikeda et al. [26] investigated the Japanese tweeters and developed a hybrid text-based and community-based method for the demographic group or prediction of Twitter users. The proposed method can analyze tweeter’s hobby, occupation, marital status, age, gender, and area. The authors reported that the proposed hybrid method can increase the precision of the text-based method. He et al. [27] collected social media data from both their own sites and the competitors’ sites in the pizza industry. This study indicated that the social media competitive analysis is essential and can help companies to form marketing strategies. Yu and Wang [28] gathered real-time tweets during 2014 World Cup games and employed text mining tools to distinguish positive and negative comments which may reflect moods of the soccer fans during matches. This study showed that opinions of sports fans can be learned from Twitter, and the results were fairly close to the predictions of the disposition theory. Chae [29] used a collection of Twitter hashtags related to the supply chain to gain some insight into supply chain management. The presented model consists of four approaches, descriptive analytics, content analytics, integrating text mining and sentiment analysis, and network analytics. Some interesting and valuable conclusions have been reached from the studies on the professional use of Twitter, organizational use of Twitter, and supply chain research, respectively.

3. Methodology

Proposed by Hall [30], the correlation-based feature selection (CFS) is a feature identification technology used for determining features with critical influence on prediction classes. The influence of features is related to the correlation between the feature and the prediction class labels. The correlation function is represented as follows:where is the degree of importance of a feature set p, NV is the amount of features in the subset p, is the average correlations between the feature i in the subset p and the class q, and is the average intercorrelation between features. The best-first search algorithm [31] was employed to generate the appropriate feature subset, and the Weka [30, 32] software was utilized to perform CFS in this investigation. The support vector machines [33, 34] model has been one of the most prevalent classification techniques in the past two decades. The support vector machines model was extended to cope with regression problems, and the support vector regression [3537] has become popular in solving function approximation problems. Both support vector machines and support vector regression have to handle quadratic functions during the problem-solving processes. This is a time-consuming task. This restriction has been overcome by transferring a quadratic programming problem into a linear equation so that it can be solved. The least square support vector regression (LSSVR) [38] model can be represented as follows:where is the weighted vector or the normal of the hyperplane, is the penalty parameters that manipulate the balance between the minimization of estimation error and smoothness of the estimated function, is the error vector of the ith sample point, is the nonlinear function mapping of from the original space into a high dimension feature space, is the bias parameter, and and are input data and output value, respectively.

Due to the difficulty of solving the optimization problem straightly, the Lagrange function is developed and the dual problem can be represented as follows:where are the Lagrange multipliers.

By solving the above functions, the solution of the problem can be achieved when all derivatives are equal to zero based on the Karush–Kuhn–Tucker conditions [3941]. The optimal conditions are shown as follows:

By removing and from (4), the following linear equation can be obtained:where .

K is a kernel matrix and determined bywhere indicates the kernel function satisfying the Mercer’s condition [42].

In this study, the radial basis function represented by (7) was employed as a kernel function:where is the kernel width. By solving (5), and p can be obtained, and the LSSVR function is represented as follows:

4. The Proposed Stock Market-Forecasting Framework and Numerical Examples

4.1. The Proposed Framework

Figure 1 shows the framework of this study. Three major types of data, namely, data from Google Trends, historical trading data, and hybrid data, were gathered in this study. When using Google Trends data as independent attributes for making a forecast, the determination of related search keywords influences forecasting results a lot. Thus, in this study, keywords of Google Trends were collected in three ways: users’ definitions (GTU), trending searches of Google Trends (GTTS), and tweets (GTT), respectively. Firstly, for collecting GTU data, users specified keywords subjectively with some domain knowledge or intuition. Secondly, keywords of Google Trends were gathered by the GTTS approach. Google Trends has a way to calculate keywords’ activity levels, namely, trending searches of Google Trends. When a specific term is considered, the results show other related keywords from the highest activity level to the lowest one. Then, the keywords of trending searches are ranked. Users can select keywords in terms of the ranking. The third way of generating keywords for Google Trends is the GTT method which collects texts on Twitter. When keywords for Google Trends obtained from Twitter were employed, the word “clusters tool” provided by KH Coder [43] was employed in this study to select the first 100 terms according to the scores calculated. For three methods of generating keywords for Google Trends, only keywords for Google Trends with scores were used as independent variables to forecast stock markets in this study. Some keywords for Google Trends are without scores due to the low search frequencies. Three hybrid data sets shown in Table 1 were generated by combining the historical data set data set with three data sets of Google Trends. Hybrid data I, hybrid data II, and hybrid data III represent historical data with data of GTU, GTTS, and GTT correspondingly.

Then, the correlation-based feature selection technique was performed for determining essential independent variables to predict stock markets. Since GTU data and historical trading data are with a small number of features, all data sets except the GTU data and historical trading data were processed by the feature selection procedure. Therefore, totally 12 types of independent variables were used in this study to forecast stock markets. One-step ahead policy was employed to predict values of stock markets for all data sets. All 12 types of data were divided into three parts, namely, training data, validation data, and testing data, for LSSVR models to predict five stock markets. The training and validation data were used to select the LSSVR models, and the testing data were utilized to evaluate the forecasting performance of LSSVR models. In addition, genetic algorithms [44] were employed to determine parameters of LSSVR models [45]. In addition, the mean absolute percentage error (MAPE) and mean absolute error (MAE) were used to measure the performance of LSSVR models. The MAPE can be represented as follows:where is the number of forecasting periods, is the actual value at period , and is the forecasting value at period .

4.2. Numerical Examples

Five daily data sets of stock markets, Dow Jones Industrial Average Index (DJIA), Russell 2000 Index (RUT), Standard & Poor’s 500 Index (S&P 500), Volatility Index (VIX), and Nasdaq Composite Index (IXIC), and three companies, the Apple corporation (APPL), the Alphabet corporation (GOOGL), and the Microsoft Corporation (MSFT), obtained from Yahoo Finance (http://finance.yahoo.com) were employed in this study. The data from Google Trends and historical trading data of the current working days were used to predict the stock market values or stock prices of the next working day. Due to the function limitation of Google Trends, the daily search data can be collected within the time horizon of 270 days. Within the limited time horizon of 270 days excluding weekends and national holidays, the data of working days were gathered and one-step ahead policy was employed to predict values of stock markets for all data sets. The time period of the Google Trends data and historical trading data is from June 14, 2016, to March 9, 2017, and data were divided into the training data set (from June 14, 2016, to December 9, 2016), the validation data set (from December 12, 2016, to January 25, 2017), and the testing data set (from January 26, 2017, to March 9, 2017). The training data set, validation data set, and testing data set contain 126, 30, and 30 data, respectively. For the data from Google Trends, three types of data, namely, GTU data, GTTS data, and GTT data, were used in this study. The Google Trends search keywords determined by users, trending searches, and tweets are listed in Tables 24, respectively. When the GTT data were collected, terms of five stock markets and three corporations, namely, Dow Jones Industrial Average Index, Russell 2000 Index, S&P 500, Volatility Index, Nasdaq Composite Index, APPL, GOOGL, and MSFT, were searched by the Twitter search engine and related tweets were determined. Then, KH Coder [43] was used as a text mining tool to select terms from tweets. The top 100 terms provided by the KH Coder were put into the Google Trends search. Not all keywords selected from KH Coder could be observed from the Google Trends search due to the shortage of search volume. Sequentially, the CFS was performed to select essential keywords of Google Trends determined by trending searches and by tweets. The results are shown in Tables 5 and 6, respectively.

Five variables, including opening values, maximum values, minimum values, closing values, and trading volume, were used as condition variables, and closing values of the next day were used as the variables predicted [79,12,46]. Three types of hybrid data were used to predict stock markets. Tables 79 show the selected keywords and historical data attributes of three hybrid data used for five stock markets by using CFS. Tables 1017 indicate testing MAPE and MSE values and two LSSVR parameters of different data types for predicting five stock markets and three corporations. The point-to-point comparisons of actual and predicted values by using various data to forecast values of stock markets and corporations are presented in Figures 29. The experiment’s results revealed that using hybrid data with LSSVR models does improve forecasting performance on closing values of five stock markets and three corporations.

5. Conclusions

Many forecasting models have been proposed for stock market forecasting in the past decades. Due to the rise of social networking and Internet search tools, types of data employed for predicting stock markets became diversified. This study proposed a framework to explore the influence of Internet search trends, historical trading data, and hybrid data on the prediction of stock markets by the least squares support vector regression models. Numerical experiments indicate that using hybrid data can provide satisfied forecasting results. The superior performance and success of the proposed framework are most likely owing to employing the unique advantage of data from the Internet search and historical trading data. Empirically, the Google data may capture a part of the nonlinear data patterns [47], and therefore, the variety of the data has a chance to improve the forecasting performance. The promising results achieved in this study reveal the potential of the proposed framework for forecasting stock markets.

Since keywords of Google Trends significantly affect the forecasting accuracy, Naccarato et al. [48] pointed out the selection of keywords results in different data sets for analysis and thus generates different numerical results. This study provided three ways, namely, users’ definitions, trending searches of Google Trends, and tweets, to determine keywords for Google Trends. The three ways can be easily and systematically reproduced for future use. Some other advanced techniques for determining appropriate keywords for Google Trends could be an essential direction for future study. In addition, numerical examples in the developed markets were employed to depict the proposed framework. For emerging markets, owning to the restriction of languages used for Twitter and Google Trends, some hurdles have to be overcome for analyzing the performance of the proposed framework.

Data Availability

The data used to support the findings of this study are included within the article by website linkages.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

The authors would like to thank the Ministry of Science and Technology of the Republic of China, Taiwan, for financially supporting this research under Contract nos. MOST 103-2410-H-260-020, MOST 104-2410-H-260-018, and MOST105-2410-H-260-017-MY2. The authors acknowledge Hsiao-Ting Hsu, Pao Hsiung Huang, Fang-Ru He, Chia-Hsin Liu, and Yi-Ting Huang who assisted with data collection and analysis.