Abstract
Investor sentiment has been widely used in the research of the stock market, and how to accurately measure investor sentiment is still being explored. With the rise of social media, investor sentiment is no longer only influenced by macroeconomic data and news media, but also guided by We-Media and fragmented information. We take the data of China A-shares from January 2020 to December 2020 as the research object and propose a stock price prediction method that combines investor sentiment with multisource information. Firstly, the sentiment of macroeconomic data, brokerage research reports, news, and We-Media is calculated, respectively, and then the investor sentiment vector combining multisource information is obtained by the multilayer perceptron. Finally, the LSTM model is used to represent the stock time series characteristics. The results show that (1) the proposed algorithm is superior to the benchmark algorithm in terms of accuracy and F1-score, (2) investor sentiment vector can effectively measure the investment sentiment of stocks, and (3) compared with vector concatenation, multilayer perceptron can better represent investor sentiment.
1. Introduction
Behavioral finance, which is derived from finance, psychology, communication, and behavioral science, believes that the stock price is not only determined by the intrinsic value of an enterprise but is largely influenced by the psychology and behavior of investors [1]. The idea in behavioral finance is that investors in markets are not completely rational people. In the process of investment decision-making, investors often cannot make a correct and reasonable judgment due to factors such as emotional preference and cognitive bias. In other words, investor sentiment reflects investor behavior and affects the final investment decision to some extent. Researchers try to explain market behavior from the perspective of investors. To verify the effectiveness of investor sentiment, Akerlof and Shiller [2] found a close relationship between investor sentiment and stock price by studying the volatility of investor sentiment and stock price. You and Wu [3] used the “spiral of silence” theory in media effect research of communication to study the impact of sentiment index on stock asset pricing from the perspective of media. Sentiment, as a factor affecting investors’ psychological activities and then their behavior, has gradually become an important research issue in the task of stock price prediction.
Investor sentiment plays an important role in stock price forecasting. Song et al. [4] proposed a method for predicting stock excess returns that integrates research reports and investor sentiment, which can be verified in the Chinese A-share market to effectively improve the accuracy of the forecast. Li et al. [5] also conducted a similar study, and the difference is that the research object is the Hong Kong stock market. Polk and Sapienza [6] showed in their research that investor sentiment is similar to mispricing behavior in the stock market. Other views believe that investor sentiment is formed by the wrong estimation of asset value, which to a certain extent indicates the speculative propensity of investors [7]. Although the definition of investor sentiment has not yet reached a unified concept, it can be seen from different definitions that investor sentiment is an expectation of future stock returns, and due to investors’ irrational behavior and reasons that are not completely based on fundamental analysis, investors will have certain deviations in their expectations [8].
In the current study, measures of investor sentiment can be divided into three categories. The first is the direct measurement method, which uses indicators obtained from market surveys to directly replace investor sentiment. The second is the indirect measurement method, which uses single economic variables and combination variables as proxy variables to measure investor sentiment. The third type uses the machine learning method to extract online text information in social media and further construct an investor sentiment index to measure investor sentiment. The information explosion and fragmented nature of the age of big data make it inadequate to use any of these measures alone to measure investor sentiment. In our opinion, the measurement of investor sentiment should take into account four factors simultaneously: macroeconomic conditions, brokerage research reports, news, and We-Media information. Based on this, we put forward a kind of multisource information fusion method to predict the price of the shares of investor’s emotion; first, the sentiment of macroeconomic data, securities research reports, news, and the media is calculated, fusion of multisource information is obtained by concatenation operation ISV (Investor Sentiment Vector, ISV), and finally LSTM model is used to represent the stock time series characteristics. The contribution of this paper is as follows:(1)An investor sentiment measurement method integrating multisource information is proposed(2)The positive role of investor sentiment in the stock prediction task is verified(3)A stock price prediction framework is proposed based on deep learning
The rest of this paper is organized as follows. Section 2 reviews investor sentiment measurement and its relationship with stock prices. Section 3 introduces our proposed method. Section 4 presents the experimental and details. Section 5 presents the experimental results and discussion. Section 6 gives our conclusions and directions for future work.
2. Related Works
With the continuous development of the Internet, the emergence of social media provides a new platform for users to search for information, express their feelings, and exchange opinions. Using social media indices as a proxy for investor sentiment has also become a convenient way to capture investor sentiment in the market. According to the Google search index, Da et al. [9] constructed the investor sentiment index through Google search keywords and found that the index could predict the short-term return and volatility of stocks. Meng et al. [10] use the Baidu search index to measure investor sentiment and find that investor sentiment has a linkage mechanism with the stock market. Although quantitative indicators are feasible in reflecting investors’ attention to the stock market, they are difficult to measure more in-depth investor sentiment information [11].
With the rise of big data, text mining, machine learning, and sentiment analysis technologies, researchers can more quickly and accurately extract valuable information from texts for the construction of investor sentiment [12]. Oliveira et al.’s [13] research shows that investor sentiment extracted from social media platforms has a certain impact on stock prices, and social media also provides a large number of data sources for the construction of investor sentiment. Bollen and Mao [14] analyzed and compared the predictive ability of traditional investor sentiment metrics and social media and found that sentiment indicators extracted from social media have a better predictive effect. Sentiment indicators obtained from text analysis of social media content have been widely used in stock market prediction, but there is no consistency in research conclusions [15]. Ma and Zhang [16] believed that the inconsistent conclusions were caused by the difference in sample data selection and the accuracy of investor sentiment measurement. At present, the research is no longer limited to judging whether investor sentiment can predict the stock market. How to extract valuable information from a large amount of data and apply it to the construction of investor index has become the focus of the research.
Pröllochs et al. [17] analyzed the information in financial news media and found that the sentiment of negative sentences in financial news is correlated with stock prices. In terms of information usefulness, Sprenger et al. [18] point out that many professional and amateur investors and analysts use Twitter to post news comments and opinions, usually more frequently than professional news media. In terms of the speed of information transmission, Sul et al. [19] believe that investors’ emotions transmitted through social media are more likely to affect stock prices quickly, while investors’ emotions that spread more slowly take longer to affect stock prices and are more likely to predict prices in the next few days. In addition to Twitter, StockTwits [20] and Yahoo Finance [21, 22] are also used to mine investor sentiment.
3. Method
We propose a stock price prediction method, including an investor sentiment module and a stock prediction module. The investor sentiment module separately calculates the four dimensions of macroeconomic status, broker report sentiment, news sentiment, and self-media sentiment through different methods and then obtains the ISV by MLP (multilayer perceptron) [23]. The stock prediction module consists of an LSTM [24], where the first input of LSTM is investor sentiment, and the subsequent input is the stock price. The method flow is shown in Figure 1.

3.1. Investor Sentiment Vector
3.1.1. Macroeconomic Status
MS (macroeconomic status) includes market status and economic status. For the market status, we select five indicators to measure the transaction volume (VOLUME), the number of new investor accounts (NEWIN), the consumer confidence index (CCI), the closed-end fund discount rate (FUND), and the market turnover rate (HS_TVR). The economic status is measured by four indicators: the resident consumption index (CPI), the amount of new credit (IC), the economic growth rate (GDP), and the money supply (M2). The macroeconomic status measurement can be divided into two steps.
The first step is to calculate the preliminary sentiment index: Sentiment1. Specifically, first standardize the market state indicators, then perform principal component factor analysis on the indicators, and select the three principal components with the highest variance explanation as weights; finally, the factor load, that is, the coefficient of Sentiment1, is obtained after weighted average.
The second step is to control the influence of economic state indicators and perform regression analysis on , as shown in formula (2). The residual value is the measurement index of macrosentiment (MS).
Among them, m is the month, is a constant, and ∼ are the regression coefficients to be estimated.
3.1.2. Brokerage Report Sentiment
Referring to the method proposed in the literature [4], we first split the brokerage research report into attention and rating sentiment and then take the product of the two as the BRS (brokerage report sentiment), as shown the following equation:
The difference is that attention in this paper is calculated daily (in literature [4], it is calculated monthly). Specifically, the attention index is constructed by the ratio of the absolute number of reports of stock k on that day to the total number of all reports of stock A-share market on that day. The calculation method is shown as follows:where is the total number of reports of stock k on day t and is the total number of reports of all stocks in the A-share market on day t.
Stock k report rating sentiment () comprehensively considers the two factors of base rating and rating change. Specifically, the assignment of base rating and rating change is shown in Table 1, and the calculation method is as follows:where is the base rating and is the rating change. When there are multiple ratings for an individual stock in a single day, the research report ratings are averaged.
3.1.3. News Sentiment
News on the Web is a long text, and headlines alone cannot accurately and completely express the text. Therefore, we first generate a summary of the news, obtain the accurate intention of the text, and then calculate the NS (news sentiment).
(1) News Summary Generation. We use the architecture of Seq2Seq [25] to generate the summary, where the encoder takes a sequence as input, encodes the information in the sequence as a semantic vector, and then outputs the summary text through the decoder. The model is shown in Figure 2.

The encoder is bidirectional long short-term memory (Bi-LSTM) network. The input news is represented as , and we encode x into hidden state vectors with
. Specially, is the result of the merger of two-way last hidden states. The decoder part uses LSTM, whose initial state is the output of the encoder. On the step t, the decoder receives the previous decoder state and the previously generated token , and the decoder current state is calculated as follows:
This method only uses to connect the encoder and decoder, so the encoder needs to compress the entire sequence information into a fixed-length vector, which is limited. As the length of the input sequence increases, the information entered first is diluted by the information entered later. For better decoding, we use the attention mechanism [26] to instruct the decoder to generate the next word through the probability distribution on the source word. The attention distribution can be calculated by and :where are learnable parameters and computes the context vector :
contains decoding information, and we finally get the probability distribution of the output words through :where and are learnable parameters.
(2) News Sentiment Computing Based on Rules. Referring to the study of Qi [27], we construct relevant semantic rules to dig out the real emotions of semantic words in different contexts. Specifically, according to the number of semantic words, the text is divided into multiple clusters, and the emotional value of the text is the sum of the semantic values of the clusters. The calculation formula is as follows:where is the sentiment value of the cluster without negative words; is the sentiment value of the cluster with negative words; S is the semantic value of the emotional word; is the degree value of the degree adverb that modifies semantic words; is the degree value of the degree adverb that modifies the negative word; is the number of negative words. Then, the average of all the news semantic of the day was calculated, namely, news semantic (NS).
3.1.4. We-Media Sentiment
The data we chose to calculate WMS (We-Media sentiment) came from the stock BBS (Bulletin Board System). Retail investors communicate in the form of posts and replies, and the information they publish is usually short. After analysis, we believe that a post, including the post information and the reply information, represents the investment sentiment, so we combine the short text of the post and reply into a long text and calculate the sentiment of the long text. Refer to formulas (11)–(13) for WMS calculation.
3.1.5. Multisource Information Fusion
We use a multilayer perceptron [23] to fuse the output of four different emotion vectors and then use the aggregate vector as the first input of the stock prediction LSTM. ISV calculation is as follows:
3.2. Stock Price Prediction Model Based on LSTM
LSTM takes the output of the encoder as the input of time step to guide the prediction of subsequent stock price, and the output of LSTM at time step is the input of t time step. In the training phase, for ISV and stock price , the probability formula for predicting the stock price of the next trading day is as follows:where and , respectively, represent the inputs of LSTM in time steps −1 and t, represents the closing price of stocks at the beginning trading day, and represents the closing price of stocks in the end trading day. The loss function formula of the whole model is as follows:where represents ISV.
4. Experiment
4.1. Data and Preprocessing
The experimental data were selected from January 1, 2020, to December 31, 2020, excluding new shares and long-term suspended stocks. All the web texts are captured by scrapy crawler framework and preprocessed by word segmentation and removal of stop words.(1)Macroeconomic data: considering the lag of macroeconomic data, the data is selected from September 2019 to September 2020, and the data source is the WIND database (https://www.wind.com.cn/NewSite/edb.html).(2)Brokerage research reports: a total of 32724 reports on 2365 A-share companies released by 63 securities institutions are included. Data on the number of published reports, date of release, title, basic rating, and change of rating were obtained from East-money (https://www.eastmoney.com/). The brokerage rating confidence data are shown in Table 2.(3)News: we selected the news on the four authoritative websites of China Securities Network (http://www.cs.com.cn/), Sina Finance (https://finance.sina.com.cn/), Netease Finance (https://money.163.com/), and Securities Times (http://www.stcn.com/) as news data sources and captured the content including news headlines, release time, and news content. After sorting, a total of 96,532 news articles were obtained.(4)We-Media: the We-Media data came from Guba (https://guba.eastmoney.com/) and Xueqiu (https://xueqiu.com/), two BBS where Chinese retail investors discuss stocks. The number of “We-Media” texts after splicing is 183,938.
4.2. Baseline
In this paper, stock returns are used as the research object to predict whether the returns obtained by individual stocks in a certain period in the future are positive, negative, or flat. To ensure robustness, all data are standardized according to the returns of the market in this period. To measure the advantages and disadvantages of the model from different perspectives, this paper selected SVM, LSTM model, RrmsNet [4], and SenticNet [5] as the benchmark methods for comparison with our work.
4.3. Metrics
In the experiment, accuracy and F1-score are adopted to evaluate the performances of each method. Let denote the total number of samples and denote the number of samples whose true label is . These metrics are defined as follows:where , is the total number of samples in class , is the total number of samples, and is the number of samples whose true label is and the predicted label is . and are precision and recall.
5. Results
5.1. Main Result
Due to the timeliness of information dissemination, we choose the 5th, 15th, and 30th as the window period for experimental observations. Table 3 shows the detailed results of the comparative experiment. In general, the method we proposed has achieved the best results in both accuracy and F1-score, which shows that investor sentiment vectors combined with multisource information can effectively improve the performance of stock price prediction.
Figure 3 shows the accuracy and F1-score for different time windows. It can be seen from the figure that over time, the accuracy and F1-score of all methods have declined. Taking our method as an example, the accuracy rates on the 5th, 15th, and 30th days are 0.749, 0.693, and 0.668, respectively, and the F1-score are 0.723, 0.699, and 0.641, respectively. There are two reasons for this. First, all methods, no matter whether third-party information is included, are based on historical stock prices to predict future stock prices. As a result, the greater the time, the greater the uncertainty of the prediction. Second, investor sentiment based on the comprehensive calculation of different information is essentially an expression of information dissemination, so its influence on stock price prediction will weaken over time. This is consistent with the theory of information communication, that is, the longer the time, the weaker the influence of information.

5.2. Ablation Experiments
To better observe the influence of MSI, BRS, NS, and WMS on the performance of stock price prediction, an ablation experiment was carried out in this paper. The main idea is to remove one of the above indicators, respectively, to obtain four models Without_MSI, Without_BRS, Without_NS, and Without_WMS. Then, the accurate value and F1-score are compared with the Our_full model. The larger the difference, the greater the influence and contribution. The ablation results are shown in Table 4. In general, excluding any index, the accuracy and F1-score are lower than the Our_full model, which indicates that the four indicators measuring investor sentiment have a positive impact on the stock price prediction. Among them, the Without_BRS model excluding the BRS indicator has the largest gap compared with the Our_full model, which shows that, among the four indicators, BRS has the greatest impact on the stock price prediction. There are two reasons. First, BRS, as a professional brokerage research report, is more easily recognized by shareholders. Second, compared with other sentiment indicators, brokerage research reports will directly give buy or sell recommendations, more direct.
Next, we remove the two indicators for further testing the performance of the model. Specifically, one is to get the Without_MSI_BRS model by removing MS and BRS simultaneously and the other is to get the Without_NS_WMS model by removing NS and WMS simultaneously. The reason for this is that MS and BRS information is more formal and comes from official or institutional sources, while NS and WMS information comes from news and comments on the Internet, which is more casual and free. The results show that the gap between the Without_MSI_ BRS and Our_full model is larger, which shows that although the amount of news and commentary information from the Internet is greater, the price of stocks is more affected by official economic indexes and brokerages.
Finally, we compare the influence of different fusion methods of four indicators on the stock price prediction. Concatenation model splices four indicators into one-dimensional vectors as LSTM inputs, and the results show that its performance is inferior to Our_full model, indicating that the fusion method proposed in this paper is more suitable for stock price prediction tasks.
5.3. Long-Term Stock Price Impact Analysis
To examine whether there is a long-term effect on the impact of information on stock prices, we choose to conduct experiments on the 45th, 60th, and 90th window periods. The results are shown in Table 5. It can be seen from the table that the accuracy and F1-score of all methods are between 0.50 and 0.60, and it does not clearly show which method has better performance. Through the analysis of the case, it is found that, in the 45th, 60th, and 90th day time window after the information is released, the stock price performance of the predicted stocks is unstable and even presents a certain degree of randomness.
First, the basic methods of stock price prediction in other models except SVM are all based on the LSTM model, and the LSTM model itself has the problem that the input sequence is too long and the gradient disappears. Further, the investor sentiment vector aggregated by our proposed method is used as the initial input of the LSTM model (t = 1). As the observation window expands to 45 days or even 90 days, the influence of investor sentiment vectors on subsequent time steps gradually weakens.
Second, in the self-media era, the update cycle of market-related information is relatively short. Among the four information sources we selected, the MSI cycle is updated monthly, the BRS update cycle is about 20 days, the NS update cycle is about 2 weeks, and the WMS is updated daily, as shown in Table 6. In other words, the longest period of all information is 30 days, which means that new news will overwrite old news and affect investors’ decision-making.
Finally, China’s A-share market is a semiclosed and immature market. Investors’ decision-making is often affected by the latest information, leading to frequent transactions and short holding periods. According to statistics, the average holding period of individual investor accounts in the A-share market is less than 20 trading days; even for investment institutions, the average holding period is about 30–40 trading days. The characteristics of the market determine the direction of the market.
In summary, the investor sentiment vector calculated from the day of information collection has a limited impact on the stock price 45 days or even 90 days later, which is also the reason for the poor performance of the model.
6. Conclusion
The relationship between investor sentiment and the stock price has always been a hot research topic. In the era of big data, the channels for investors to obtain information have changed from research reports and news dominated by securities brokers to We-Media information. Multiple sources of information have brought new changes to measures of investor sentiment. Based on multisource information fusion, this paper proposes a new measurement method of investor sentiment and incorporates the new investor sentiment into the framework of stock price prediction. In the experiment with the data of China A-shares from January 2020 to December 2020, the results show that (1) investor sentiment is an important factor affecting stock price fluctuations, (2) among the different indicators of investor sentiment, brokerage report sentiment has the greatest impact on stock prices, and (3) multilayer perceptrons can better integrate emotional indicators.
Data Availability
All data in this paper come from public information on the Internet.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This study was supported by the Natural Science Foundation of Jiangxi Province (grant no. 20212BAB202016) and the Science and Technology Research Project of Jiangxi Provincial Department of Education (grant no. GJJ200318).