Evolutionary Prediction of Nonstationary Event Popularity Dynamics of Weibo Social Network Using Time-Series Characteristics
A growing number of web users around the world have started to post their opinions on social media platforms and offer them for share. Building a highly scalable evolution prediction model by means of evolution trend volatility plays a significant role in the operations of enterprise marketing, public opinion supervision, personalized recommendation, and so forth. However, the historical patterns cannot cover the systematical time-series dynamic and volatility features in the prediction problems of a social network. This paper aims to investigate the popularity prediction problem from a time-series perspective utilizing dynamic linear models. First, the stationary and nonstationary time series of Weibo hot events are detected and transformed into time-dependent variables. Second, a systematic general popularity prediction model N-M is proposed to recognize and predict the nonstationary event propagation of a hot event on the Weibo social network. Third, the explanatory compensation variable social intensity (SI) is introduced to optimize the model N-M. Experiments on three Weibo hot events with different subject classifications show that our prediction approach is effective for the propagation of hot events with burst traffic.
Social media platforms, for example, Weibo and Facebook, enable web users to post their views in the virtual communities behind screens. This new type of communication has been well accepted and acclaimed for its apparent advantages of low cost and low user interaction risk. However, the obvious benefits have created potential problems that almost everybody may encounter. An increasing number of social users indulge in passing unverified gossip without doing anything meaningful in the virtual community. Therefore, establishing a highly extensible propagation prediction model has recognized the wisdom of employing social media analysis problems. The related research aims to provide reasonable technique for avoiding public opinion outbreaks and maintaining social stability.
Information popularity prediction in social networks is usually related to the information diffusion models and information-carrying patterns. The primary idea of popularity analysis is to design or learn a prediction model that can accurately reflect a hot event’s information propagation law. On the other hand, the temporal evolution of event popularity can be separated into two types (stationary and nonstationary) according to its trend degree of fluctuation. This work is based on the assumption that a hot event’s evolution prediction accuracy is closely correlated to its propagation type in a social network. Some of the current excellent literature on prediction and detection problems in engineering applications pays particular attention to deep learning [1–3]. It is now well established from a variety of studies (e.g., Lin, 2020 ) that a small dataset can train a high-precision model. Some studies of engineering problems benefit a lot by using hybrid deep learning models  or deep fusion models . On the other hand, dynamic liner models (DLM) [4, 5] are quite suitable to be employed for the prediction problem concerning their feature extensibility. DLM has a strong interpretability compared with the deep learning models. Hence, this study seeks to obtain a general popularity prediction framework based on DLM, which will help address the evolutionary prediction problem of information propagation dynamics based on stationary theory and time-series characteristics.
The rest of this paper is organized as follows. Section 2 summarizes the related works. Section 3 systematically reviews the evolution analysis models and assessment methods used in this study, including the state space models, dynamic linear models, Kalman filtering, Kalman smoothness and prediction, and maximum likelihood estimation. Section 4 details the proposed information popularity prediction models based on DLM. Section 5 presents the experimental results. Finally, we conclude this work in Section 6.
2. Related Work
Social networks play an essential role in our daily life. People join multiple social network platforms, for example, Facebook and Twitter, to enjoy different services. Information propagation through online social networks has proved to be a powerful tool in many situations. The review in  on information propagation has highlighted several application disciplines ranging from biology to social sciences, mathematics, physics, and computer science . The classical application in this field is the virus spread analysis in ecology , biology , and marketing .
Previous studies of information propagation prediction in social networks refer to one of the following three tasks: predicting information popularity [11–14], foretelling user influence [15–18], and divining information diffusion paths (links) [19–22]. Some of the literature focuses on the user influence in the social analysis [15, 17]. Some significant studies are concerned with link prediction to reveal the evolution of real social networks [19, 20]. Much of the current literature on the propagation prediction pays particular attention to information popularity, since it can clearly and intuitively reveal the genuine impact of a hot event by employing statistics. For example, the popularity of video content on social network platforms TikTok and YouTube is often regulated by the statistics such as views, followers, favorites, shares, and downloads [23, 24]. The popularity prediction focuses on the whole trend of information diffusion, for example, the propagation ranges  and lifespans [26, 27], to provide the valuable decision-making supports for network public opinion monitoring and guidance. This study sets out to propose the universal scheme of the information popularity prediction problems with time-series characteristics.
Weibo 1 is the unique microblog platform of China which can satisfy regular users’ demands to send messages across the country. Any public opinion that might spread in society would certainly rely on the Weibo platform in China. There will always be hundreds of millions of users forwarding their posts to Weibo from the portal, forums, moments, and other media if the content is novel or valuable. Weibo social network is popularly accepted as a monitoring center for public opinion in the era of big data. This study takes Weibo as the carrier to deal with popularity prediction considering its influence on all social networks in China.
The popularity prediction in social networks is a typical time-series problem, in which solutions can provide useful decision supports for avoiding negative propaganda, rumor, hotspot disposal, and so forth. The current research on Weibo popularity prediction is mainly based on the methods of epidemic models [28, 29], classification models [30, 31], and regression models [32, 33]. Such approaches highlight the requirement for time-series analysis of social networks. Time-series analysis can help researchers realize the random mechanism of generating time feature sequences, set up the data generation model, and predict the future possible values of time series. However, a systematic understanding and process of how to analyze time-series problems in the evolution prediction of a social network is still lacking.
Traditional predictive models are usually related to information cascade theory , divided into graph-based and non-graph-based ones. The former explore the dynamics of information propagation by individual nodes starting from an initial set of nodes and spreading through the network based on a cascade model, for example, linear threshold models  and independent cascade models . The latter mathematically study diffusion using population-based dynamics, for example, SIR models . Some recent attention has focused on the application of incorporating both traditional prediction models and time-series factors [37, 38]. Wu et al.  have shown an example to apply the time series of dynamic data in the task of popularity analysis. This method focuses on the regression modeling of time series corresponding to the propagation of the user-generated content. The investigation pattern of user-generated content popularity over time is supported by Hu et al. , who utilize regression models and three time-series features to recognize content changes. Matsubara et al.  accompanied additional examination of popularity prediction by applying the SpikeM pattern to fit the above-reported time-series models. The published studies have focused on using a temporal feature as an analytical tool rather than concentrating on the model theory of time series in social networks. Some tiny differences that appeared in the popularity of prediction research may lead to consequential large variations in approach.
Some investigators [38, 42] have examined the popularity of the trend analysis in the prediction problem of social networks. Manshad et al.  suggested a new time-series trend prediction method based on irregular cellular learning automaton and evolutionary computation. Figueiredo et al.  used YouTube videos as samples to extract popular trends from historical uploaded video objects, combining the new time-series classification algorithm (TrendLearner) with the target features to predict the new target trend.
Popularity trend fluctuation in a social network can be directly exploited for the evolution prediction problem. Some studies deal with the phenomenon of trend fluctuation but lack the investigation of the volatility role in social networks’ prediction problems. Wang et al.  introduced a time-series prediction method based on complex network theory, which maps the time series to a data network and extracts fluctuation sequence features based on its network topology. The data fluctuation term is recommended as an optimal proposal to the prediction problem for the first time. Hu et al.  proposed a time-series feature space containing the average of the prevalence, trend, and period to capture viral hot topics’ epidemic behavior. The study observed a high degree of similarity between the short-term trend fluctuations of these hot topics.
Constructing a general evolutionary prediction model with time series is considered one of the most critical social network analysis tasks. Understanding the evolution trend volatility and offering some vital insights into its applications in the popularity prediction problem benefit a lot in the time-series analysis of social networks.
This paper addresses studying the popularity prediction problem from a time-series perspective by means of dynamic linear models. A systematic general popularity prediction model N -M is proposed to recognize and predict the nonstationary event propagation of a hot event on the Weibo social network. First of all, the popularity evolution of a social network can be distinctly divided into two patterns, stationary (Figure 1(a)) and nonstationary (Figure 1(b)), according to the apparent difference in volatility. The institution concept of the volatility of a time feature sequence shows that the quantitative fluctuation term has a high correlation in the accuracy of event propagation prediction. Our idea requires creating separate prediction models for the stationary and nonstationary events based on volatility calculation.
On top of that, the explanatory compensation variables are introduced into the model N -M, optimizing dynamic time-series prediction models of the Weibo hotspot event propagation. We summarize the main contributions as follows:(i)One of the more significant contributions to emerge from this study is that we recommend a general time-series modeling method of popularity prediction by establishing dynamic linear models based on stationary and nonstationary time-series evolution characteristics in social networks.(ii)The benefit of N -M is that the above social network prediction problems’ feature parameters can be updated simply by adding matrix rows and columns of the model, thus avoiding the negative influence on model design and model adaptability for the variations of the target prediction task.(iii)This study evaluates N -M on three real-world hot events of the Weibo social network. The results show the superior performance by using a proposed explanatory compensation variable, social intensity (SI). The accuracy of the N -M model can be significantly improved by adding compensation parameters.
This section systematically reviews the evolution analysis technologies based on model theory, aiming to analyze evolution prediction of information diffusion dynamics in social networks. Characterization of time series that appeared in social networks is essential for our increased perception of network information diffusion rules. Simultaneously, we require learning the time-series law of social networks so that a high-precision information diffusion prediction model can be established. Time series refers to a number sequence arranged by successive observations of the same phenomenon at separate times. Statistically, a time series is the realization of a stochastic process, which can be divided into stationary and nonstationary according to its statistical characteristics. However, the series that appear in social networks are, for the most part, nonstationary. The following part of this section describes the preliminaries to establish a popularity evolutionary prediction model of information diffusion dynamics based on time-series characteristics.
3.1. State Space Model
The state space model (SSM) is a dynamic time-domain model with time-dependent variables. A growing number of studies around the world have started to apply SSM in social and economic analysis. One purpose of SSM was to process time sequences of several variables into vector time series, transforming the information cascade problem of the social network into a vector sequence analysis problem. Specifically, an SSM is a dynamic system that evolves over time, which is determined by two different time series. One is called the state sequence, denoted by , that is, , with respect to the discrete time variable . Any state belonging to sequence in a system is hidden and unobservable, since the system is inevitably subject to external interference. The other is counterpart observable sequence, denoted by . The SSM connects the two series by introducing an iterative state equation and an observation equation. The former depicts the transition relation of any adjacent states between the current state and the next moment . The latter indicates the internal relationship between observation and state sequences. An SSM follows two basic assumptions:(i)Markov hypothesis: the state sequence constitutes a Markov chain, and the current state is only related to the previous moment.(ii)Conditional independence assumption: under the condition of , the observed values of are independent of each other. In other words, depends only on .
Based on the above assumptions, the primary form of an SSM can be formalized as the state equation (1) and the observation equation (2):where the mathematical notations and represent system noise and measurement noise, respectively. The two noises are typically considered to be the mutually independent Gaussian distributions. The function depicts the transfer relationship of the state variables . Both and may be linear or nonlinear. According to the different function forms, the SSM can be roughly divided into three categories: linear, nonlinear, and mixed linear/nonlinear. The complexity increases from the linear to the nonlinear models. Function is used to measure the mapping relationship between the state variables in sequence and the observation variables in sequence .
Constructing an SSM for a specific time-series system is to estimate the state sequences that may occur in the system life cycle. The state estimation problem is a dynamic estimation problem, which can be divided into three types: smoothing, filtering, and prediction. Among them, filtering is the core. The process of state estimation is shown in Figure 2.(i)A state estimation problem is called smoothing if we utilize the real-time information up to the current moment to estimate and trace back the past states. In other words, the noise-removed state values within the total time , that is, , are the target to be traced and identified by investigating the observation sequence .(ii)A state estimation problem is addressed as filtering if both the real-time system information up to the current moment and the observed value are used to estimate and revise the current state .(iii)A state estimation problem is designated as prediction if real-time information up to the current moment is adopted to predict the future states. The problem estimates the state values at time , respectively, according to the known observation sequence .
In summary, the SSM can be seen as a theoretical algorithm framework, which provides a flexible method for time-series analysis by means of state vectors. The system state prediction based on the SSM is convenient for the analyst employing statistical theory to test the model. Existing research on system state prediction recognizes the critical role played by the SSM. Many economic and financial time-series models can be represented in the form of SSM, such as the autoregressive integrated moving average (ARIMA) model , dynamic linear model (DLM) [4, 5], and stochastic volatility model (SVM) . This paper innovatively proposes a general algorithm framework for the time-series analysis of information propagation in social networks by using a DLM. The approach (Section 3.2) introduced for this study is one of a well-designed DLM. An empirical analysis of information dissemination prediction is presented in Section 4 by using the Weibo social network data and DLMs.
3.2. Dynamic Linear Model
The DLM is presented as a particular case of a general SSM with Gaussian and linear characteristics. The estimation and forecasting tasks can be obtained recursively by the well-known Kalman filter of DLM which is the most prevalent and widely accepted approach in analyzing state space problems. The information cascade prediction of social networks can therefore be simplified as a state iteration problem. The characteristic of a DLM is that all model variables obey Gaussian distribution and satisfy linear relationships. A DLM can be represented by the following mathematical form:
Equations (3) and (4) are the state equation and the observation equation of the DLM, respectively. The uppercase parameters and denote the state transition matrix and measurement matrix, respectively. The parameters and are the system noise and measurement noise, respectively. Two kinds of noises are independent of each other, which follow the Gaussian distribution.
Let denote the parameter set related to a DLM. Then, all parameters in set except the observed variables may be unknowns when a DLM is given a practical application task. The dynamic linear model can be used to process parameter estimation tasks. The problem can be divided into three categories according to different parameter estimation targets required in set :(i)Given the state transition matrix , measurement matrix , system noise , and measurement noise , the estimated target parameter is the state . This kind of problem is a typical state estimation problem. Then, we generally use the Kalman filtering, smoothing, and prediction algorithms.(ii)Given the state transition matrix and measurement matrix , the target is to estimate noise items and . Then, the maximum likelihood estimation approach has obvious advantages.(iii)When the parameters , and are unknown and we require estimating the parameters in them, the Markov Chain Monte Carlo (MCMC) method based on Markov process theory is the potential choice.
The information diffusion evolution prediction of social networks mainly involves the first two categories. The following is a brief description of the Kalman filtering, smoothing, prediction, and the maximum likelihood estimation method applied in this study.
3.2.1. Kalman Filtering
As mentioned above, the state estimation can be divided into prediction, filtering, and smoothing according to the information obtained from the system. Among them, filtering is the core of state estimation. In the filtering problem, the data are supposed to arrive sequentially in time. We require a procedure to estimate the state vector’s current value, based on the observations up to time , and to update our estimates and forecasts as new data become available at time . The filter process provides the available specifications for updating the current inference on the state vector as new data. We first give the derivation and procedure of the Kalman filtering algorithm.
The Kalman filtering algorithm [46, 47] has been developed to recursively solve the linear filtering problem of discrete data by using the Bayes theorem . The primary task is to predict the prior distribution of the state time, that is, the probability density at the time , based on the state of the system at time . Then, the algorithm corrects the prior distribution using the Bayes theorem after the likelihood factor, that is, system observation value at time , is obtained. The result probability density is recognized as the posterior distribution of the state at time . Therefore, the state estimation of each system moment can be regarded as a normative prediction/correction process. Let the abbreviated expression with denote the observation sequence and let symbol represent a conditional distribution. We formalize the Kalman filter process below to facilitate the introduction of filtering problems of social network information diffusion.
The conditional distribution equation (5) describes the probability distribution of the discrete random variable (state value ) under the condition when another discrete random variable (observation sequence ) obtains possible fixed values. Both variables and follow Gaussian distributions in a DLM model. All of the distributions represented as the form mentioned above can be uniquely determined by the expectation and covariances.
The primary operations of the Kalman filtering are predicted, which aim to ascertain the prior distribution and normalization factor of the system states. First of all, we calculate the predicted mean and covariance matrix of state variables at any time :where and are the state transition matrix and its transpose. Variable notation with respect to denotes the state mean from to .
On top of that, we estimate the predicted mean and covariance matrix of the observed variable at time :where and are the measurement matrix and its transpose. denotes the measurement noise. and are the predicted mean and covariance matrix calculated by equations (6) and (7), respectively.
The correction operations emphasized in the Kalman filtering were based on the parameter values obtained by the prediction operations, the objective of which is to calculate the filtering mean and covariance matrix of the state variable at time :where is the inverse of covariance matrix . The detailed procedure of Kalman filtering algorithm is given in Algorithm 1.
3.2.2. Kalman Smoothness and Prediction
Smoothing estimation in time-series problems is mainly used for retrospective analysis of observation sequences to explore potential phenomena or laws underlying the observation values. In the case of many practical problems, the system’s a priori parameters cannot be calculated directly, which need to be estimated. For example, in economic research, researchers may necessitate to apply a country’s recent gross domestic product to understand the socioeconomic behavior of a country’s systems in the past. The forward-backward smoothing algorithm can make such an estimation based on an observation sequence. For each state in a discrete system, it calculates both the “forward” probability of reaching the state and the “backward” probability of generating the model’s final state. This study applies the underlying gradient descent idea in the forward-backward smoothing algorithm to modify the past state observation values. Filtering distribution is necessary for the forward-backward smoothing. The derivation process of the filtering distribution is as follows:
According to equation (12), the entire smoothing process is reverse propelling by time, which employs the predicted distribution , the filtering distribution of the state, and the smooth distribution of the next moment. Through forward filtering, the filter density distribution and the predicted density distribution at time can be obtained. Then, we combine them with the smooth density distribution at time . The objective smooth density distribution at time can be deduced backward.
This study takes the output, that is, the state mean and covariance , of the Kalman filter algorithm (Algorithm 1) as the input of forward-backward smoothing and prediction algorithm. The aim is to obtain their estimated values that are similarly denoted by and , respectively. The algorithm follows a binary classification process. First, the smooth distribution is equal to the filter distribution if the time parameter satisfies :
The smooth distribution and the normal distribution with parameters and are equivalent under such a branch. Second, if time is an intermediate state time other than the initial time or the terminal time , we calculate the estimated smooth mean and the estimated covariance for any time according to the following equations:
Then, the filter distribution can be obtained successfully.
Under the assumption of the Gaussian distribution related to the DLM models, the execution process of the forward-backward smoothing algorithm can be summarized as Algorithm 2.
As explained in Algorithm 2, it is clear that the filter distribution has been determined under any time . Subsequently, we can deal with the prediction problems involved in the observed values or state values at any future time point by a recursive prediction process. The calculation detail is as follows:(i)Prediction of state variables:(ii)Prediction of observation variable:
3.2.3. Maximum Likelihood Estimation
Maximum likelihood estimation (MLE) is one of the most important and widely used parameter estimation methods in statistics. The idea is to determine the unknown parameters in a system model by maximizing the probability of observed values. Under the condition that the state transition matrix and the measurement matrix are known from the system environment, the maximum likelihood estimation of noise terms can be derived.
This section uses the simplified expression to describe the observation data , whose distribution depends on the state noise and the observation noise . Let the joint probability density denote the likelihood function of noise items and relative to the observation sequence . We have the following formal description:
The derivation in equation (18) is the prediction probability density function of the observation sequence , which obeys the Gaussian distribution with mean and variance obtained by the forward Kalman filter algorithm. To simplify the operation, we convert equation (18) to its log-likelihood form:
The purpose of equation (19) is to find the noise terms and , which makes the occurrence probability of observation sequence in the system the greatest. According to the principle of the MLE, parameters and can be estimated only by dealing with the optimization task of maximizing equation (19). Finally, we use the quasi-Newton method  to solve the parameters.
4. Information Popularity Prediction in Social Networks Based on Dynamic Linear Models
A growing number of information senders and receivers around the world have started to post their speeches, thoughts, and comments online on some new social media platforms such as Weibo, Facebook, Twitter, WeChat, and LinkedIn. These attention-grabbing open media have been well accepted and acclaimed for their apparent advantages: faster transmission speed, being more timely, broader coverage, higher public, and being more interactive. These types of social network platforms will play an increasingly significant role in our future life, society, business, and so forth.
The graphical properties of the users and the relationships between users in a social platform are called social networks. The information popularity prediction in social networks is a typical time-series problem, the solutions of which can provide useful decision supports for avoiding negative propaganda, rumor, hotspot disposal, and so forth. The issue of time-series analysis in social networks has received considerable attention. However, a systematic understanding and process of how to analyze time-series problems in social networks is still lacking.
This section takes Weibo as an example to discuss how the DLM can be used as a standard analysis framework for the propagation and prediction issues of social networks with time-series characteristics. The study first highlights some important statistics and evolution rules of the Weibo social network, including the number of daily posts, retweets, comments, and how they change with time. Second, we use the theory of the DLM, combined with social data and search data, to decompose the hotspot events’ time series and predict their propagation effect on the Weibo network.
In this paper, the time series of information propagation in Weibo is regarded as the superposition of trending, periodic, and random terms.(i)The trend term characterizes the objective evolutionary trend of Weibo hotspot events.(ii)The periodic term describes the regular changes due to cyclic factors such as the incubation, development, outbreak, mature, recession, and extinction periods of a network public opinion event.(iii)The random term involves a sudden change or noise disturbance. A sudden change refers to a change caused by unexpected circumstances such as the hot news retweet or emergent unexpected social events. The noise illustrates the influence of many random factors, such as different attitudes of users who release or forward messages, public opinion reversal caused by facts, and public opinion explosion.
The DLM model can describe the parameters of the information propagation time series listed above. The proposed models are given in Section 4.3.
4.1. Challenges of Information Propagation Prediction
A well-structured propagation prediction model benefits researchers or government departments to find the regularity, suddenness, or reversibility in the process of information propagation. Predicting the spread of a particular news event is helpful to guide the negative public opinion in advance. There are two challenges facing the research on the information propagation model and the prediction of the online social networks that are similar to Weibo:(1)Avoidance of low-quality data: Spam users (nonzombies) are flooded in every Weibo platform corner. Many users engaged in Taobao, microbusiness, and network shopping platforms continue to release advertisements, order display, and purchasing agency information on Weibo. Such jumbled information results in low-quality data. Despite the importance of statistical characteristics such as the number of posts, comments, retweets, and mentions of a hot event on the Weibo social network, there remains a paucity acceptable accuracy of information propagation prediction based on only the specific numbers.(2)Data association mining: The essence of prediction analysis is to find the law of the event occurrence from the mass data. The timestamp of a time-series system cannot be used directly for prediction. We still require discovering the underlying factors that influence the time-series state change.
This study adopts a divide-and-conquer strategy in response to the issue of low-quality data. The propagation of a hot event on Weibo can be divided into stationary and nonstationary by calculating the event’s time-series volatility. First of all, we use DLM models established by using the historical time-series information to deal with stationary event popularity prediction. On top of that, some explanatory variables (hot event characteristic variables) are captured to enhance the DLM models. The improved models are applied to deal with the propagation prediction of nonstationary situations.
Association mining methods in  are introduced to find explanatory variables from the Weibo platform by executing information filtering, text classification, data normalization, and so forth. Then, we calculate the correlation coefficient between the event propagation time series and the explanatory variable time series. The resulting highly correlated variables were chosen as the random terms of the time series in the process of event propagation. The complete framework for information popularity prediction of a hot event on the Weibo social network is shown in Figure 3.
4.3. Modeling of Weibo Hot Events Prediction Based on Dynamic Linear Models
Definition 1. A DLM is called a stationary event propagation prediction model (M) if it has the characteristic of local linear trend and seasonality and possesses the following state equation form:whereDefinition 1 provides a general model to deal with the popularity prediction problems with stationary event behavior in social networks, where and are the level trend and the inclination trend at time , respectively. denotes the state mean estimate of a periodic item , which can be obtained by Algorithm 2. The period length is represented by the Greek symbol . For example, if the model takes a week (seven days) as a cycle, we have . The information propagation of the seventh day is predicted using historical data from the previous six days. The tail terms , , , and of the equations are the measurement noise, level trend noise, inclination trend noise, and mean estimate noise, respectively.
The model representation of M in matrix form serves as an effective way to perform programming and the subsequent data experiments. Hence, we convert the M described in Definition 1 into the following matrix form:(i)The observation equation:(ii)The state equation:where we have the two coefficient matrices:
Definition 2. A DLM is called a nonstationary event propagation prediction model (N -M) if it has the characteristic of local linear trend, seasonality, regression, and possesses the following state equation form:whereDefinition 2 presents a model suitable for analyzing the propagation of nonstationary events in social networks, where , , and are similarly the level trend, the inclination trend, and periodic item estimate at time , respectively. Parameters , , , and are the noises related to the system, level trend, inclination trend, and periodic item mean calculation, respectively. represents explanatory variables. The explanatory variable coefficient is expressed as the symbol . The compound expression denotes a random term in a nonstationary event that suffers a sudden occurrence or reversal. We convert the model N -M into the following matrix form:(i)The observation equation:(ii)The state equation:where the details of coefficient matrices are as follows:
This section discusses the datasets, stationary and nonstationary identification, and the model evolution analysis of three hot events that appeared on the Weibo social network.
5.1. Event Datasets and Propagation Network Setting
To estimate the effectiveness of the proposed models for popularity prediction, the survey collected the datasets from the Chinese popular social platform Sina Weibo 2, including three Weibo hot public opinion events with the three content categories of social, entertainment, and international news from 2015 to 2017. A brief description of the three events is presented as follows:(i)Chengdu female driver incident (Chengdu-Driver). On the afternoon of May 3, 2015, a beating incident occurred near the Jiaozi overpass on Chengdu Third Ring Road. Ms. Lu was forced to stop by Zhang at the Jiaozi interchange for changing lanes and was later beaten and injured. The incident was broadcast on Weibo. People were first shocked by the ferocity of Zhang. However, public opinion turned to blame the female driver for driving too unruly and dangerously when the event video came out. The propagation of the event has formed the typical public opinion reversal effect.(ii)Celebrity Bai Baihe event (Baihe-Cheating). The cheating of a female star, Bai Baihe, led to a backlash from her fans and formed a long-term discussion on the behavior of the entertainer on Weibo.(iii)THAAD incident. South Korea’s Lotte Group decided to transfer the Seongju Golf Course to the South Korean defense ministry to deploy the THAAD system. The event triggered an immediate outcry in China against Rakuten. Many businesses and people are boycotting Lotte Group’s operations in China. With THAAD’s deployment in South Korea, the anti-South-Korean sentiment in China, especially Lotte Group, has also been increasing. The incident eventually led to the closure of 55 Lotte Mart stores in China.
The datasets “Chengdu female driver was beaten.csv,” “Baihe Bai derailed.csv,” and “THADD incident.csv” refer to the social event (Chengdu-Driver), an entertainment event (Baihe), and an international news event (THAAD), respectively.
The original hot event data is prepared by crawling user information pages involved in the three event topics, including the post contents, user attributes, follower relations, the number of comments and thumb up, post time, and forwarding (propagation) links. This study implements the data crawling and the proposed prediction models by the programming language Python version 3.7 and runs the codes on a Linux Server with Intel(R) Xeon(R) CPU (E5-2620 v4) and GeForce TITAN X GPU (12 GB memory). The number of valid entries is 95%, 96%, and 98% of the total results returning from our crawler system, respectively. The details about the event datasets are shown in Table 1. The trends for the development of the three events’ datasets are shown in Figure 4.
Our experiments will evaluate and predict the popularity of the three hot events on the Weibo social network. The best environment setting is to get the full structure of Weibo as a map of the spread. However, Weibo allows only a small part of follower relations to be returned by crawlers. Hence, the relations that come from traditional Weibo crawling methods are quite incomplete. On the other hand, the size of Weibo network is enormous. The empirical treatment is to extract a subnet with common characteristics from the original Weibo network. We repeatedly remove the nodes with the degrees less than 2 or more than 1000. Then, the community discovery algorithm reported in  is performed to find the subnet that preserves the neighbors and relations of the event-related users. The subnet has the typical social network characteristics, including the approximate power-law degree distribution (Figure 5) and the high aggregation coefficient.
5.2. Predictive Evaluation
5.2.1. Stationary Sequence Identification and Social Intensity
Historically, the statistical concept “volatility” has been used to describe the degree of return fluctuation on the investment in financial markets. Using this concept to distinguish the stationary and nonstationary events is recognized as an essential experimental step. The experiment requires to be explicit about exactly how to obtain the “volatility” of a hot event on Weibo social network. The calculation process is defined as follows.(1)Calculate the percentage return:(2)Calculate the volatility (standard deviation) of sequence :where the sequence represents the values of an explanatory variable at different time , which is also introduced in equation (25) at Section 4.3. is the average value of the sequence . The forwarding behavior of users on Weibo has been playing an increasingly important role in helping researchers get the active degree of event discussion. The forwarding time series of an event is the concrete expression of the law about event propagation. Hence, our experiments consider the forwarding number of an event on Weibo at time as the explanatory variable . Then, the sequence of percentage return and the subsequent volatility can be calculated. We follow the subsequent steps to identify the stationary/nonstationary state of an event:(1)The life cycle of an event is divided into time-series intervals in days.(2)Count the forwarding numbers from 8 : 00 to 24 : 00 at the interval of 1 hour within a day (obtaining sequence for each day).(3)The percentage return is calculated to obtain the sequence according to equation (27).(4)Calculate the day-to-day volatility, denoted by according to equation (28).(5)The possible value of volatility is divided into several (set as nine) intervals. The forwarding numbers of the event in different volatility intervals are calculated by simple numerical statistics.(6)Estimate the ratio of the forwarding number to the total number of retweets with different volatility intervals.(7)Calculate the sum of the forwarding ratios with volatility values greater than 1. If the parameter is more than 75%, the event is decided to be nonstationary; otherwise, it is stationary.
The threshold of 75% is used to distinguish the application scopes of the two models M and N M when we need to represent, analyze, and predict the process of event popularity evolution on the Weibo social network. The annual report of hot searches on Weibo 3 shows that about 90% of the hot events in Weibo are nonstationary. The volatility distribution of an example using the Chengdu-Driver dataset is shown in Figure 6.
Figure 6 presents the statistical relationship between the forwarding numbers and the volatility intervals for the event Chengdu-Driver. The forwarding longitudinal series is a measure of the depth at which an event post is forwarded on the Weibo social network. Series 1–4 shown in Figures 6(a) and 6(b) illustrate that the statistical results of forwarding numbers contain a minimum of one (resp., three) forwarding action and a maximum of four forwarding longitudinal actions. We plot the data distribution of volatility intervals in a descending order. The right label list of proportions shows the ratio of retweets with different volatility intervals to the total forwarding number. The cumulative line (orange color line) on the second axis identifies the cumulative value from the fluctuating high to the low point as a percentage of the total retweets based on the forwarding numbers.
The red point shown in Figure 6 describes the parameter value to identify the stationary state of the event. As a result, the propagation of social event Chengdu-Driver on the Weibo network is stationary, since the expression 0.780.75 is true. If the experiment ignores the forwarding data of series 1 and 2, the decision point calculation value shown in Figure 6(b) is 0.88. A decision point larger than 0.78 indicates that the vertical propagation of depth causes the event’s principal fluctuation. Similar analyses are applied in the events Baihe and THAAD. We can finally obtain the decision points 0.83 and 0.91, which are presented in Figures 7(a) and 7(b), respectively. What is interesting about the data in these three points (0.78, 0.83, and 0.91) is that the international news event THAAD had the most significant fluctuations in the three Weibo events. The nonstationary state implies a large amount of discussion in a short period.
5.2.2. Correlation Analysis of Explanatory Variables
Having explained what is meant by explanatory variables, we will now discuss how to find the best explanatory variable. A simple linear regression model is used to establish the quantitative relationships between event data and explanatory variables to determine their correlation scores. This study selects the best explanatory variable by comparing the Pearson Correlation Coefficients between all variables and the forwarding numbers at time series.
We refer to the percentage of forwarding numbers, comment numbers, and thumb-up numbers of an event at time as the social intensity, comment intensity, and thumb-up intensity, respectively. The main temporal explanatory variables involved in the analysis of the three experimental datasets included the following: comment number (CN), thumb-up number (TUN), social intensity (SI), comment intensity (CI), thumb-up intensity (TUI), and average comment length (AveL). All the above explanatory variables are time-dependent. Table 2 shows the correlation scores between the six explanatory variables and the forwarding numbers with respect to the three events. The experiment filtered out the explanatory variables with low correlation and temporal fluctuation. We determine the best one, that is, social intensity, for the three datasets according to the correlation coefficient comparison.
In general, the higher the correlation between the explanatory variable sequence and the target sequence, the better the prediction effect. Figure 8 shows the quantitative relationship between the event popularity evolution and the social intensity of the event Chengdu-Driver, and the dataset has been normalized.
5.3. Experimental Results and Case Analysis
In order to evaluate the effect of the time-series model and the temporal model after the introduction of the explanatory variable SI, we compare the gradient boosting regression tree (GBRT) model based on model combinations. The evaluation indexes adopted include the following: the determination coefficient , the root mean square error , and the accuracy of the predicted absolute error between 20% and 50% (namely, and ). The results of the comparison of the evaluation indicators of prediction models are shown in Table 3.
The results of this study indicate that when the popularity prediction of a Weibo hot event only depends on the time information, the error between the prediction result and the real event propagation is usually large. By comparing three hot Weibo events, we can see from Figures 9, 10, and 11 that using highly correlated explanatory variables as the proposed model parameters can improve the accuracy and reliability of event popularity prediction.
We illustrate the experimental results by using event Chengdu-Driver as an example. The blue trend line recorded in Figure 9(a) shows the observation retweets of the social event in a month. The distinct fluctuation of observation to emerge from Figure 9(a) was that there exist the bursts of network traffic flow around the third day. The green and orange lines represent the fitting results with or without the explanatory variable SI, respectively. The purple-dotted line and the red-dotted line are the corresponding prediction results, respectively. Following the time going on, the fitness values for propagating the event Chengdu-Driver which almost coincide with the real trend lines were recorded. A comparison of the two results exposed by the green and orange trend lines in Figure 9(a) reveals that the prediction accuracy of the NSEPP model with explanatory variables is significantly higher than that of models without considering explanatory variables. Furthermore, it can also be seen from the cumulative relative error recorded in Figure 9(b) that the prediction results are more efficient when the explanatory variables are introduced into the N -M model for the nonstationary event popularity evolution with burst traffic.
The entertainment event Baihe (Figure 10) and the news event THAAD (Figure 11) showed similar accuracy rate advantage of data fitting in the prediction results with explanatory variables. Comparing the model prediction results of fitting with SI and the original fitting values, the model presented in this study has obvious expected returns. In contrast to the prediction result of event Chengdu-Driver, however, a slightly larger error of the events Baihe and THAAD of the fitting value with SI was detected. The green trend line is not entirely close to the observation line in the prediction charts. The difference is due to the underlying quality of the three datasets. Together, these results provide important good insights for hot event propagations into the significance of combining dynamic linear models and highly correlated explanatory variables.
The common feature of the three Weibo hotspot events is that they all have burst traffic characteristics. In order to verify the adaptability of these events to the parameters defined by the prediction models, the event THAAD with the highest volatility was selected to decompose the time-series quantification values. Figure 11(a) represents the observation, fitting, and prediction of THAAD with burst traffic. Figures 12(a), 12(b), and 12(c) represent the trend , periodicity , and regression results of the burst flow event, respectively, reflecting the characteristics of its nonstationary time series. The decomposition results further show that the proposed dynamic linear model N -M (Def. ??) can accurately deal with the prediction task of hot event propagation on the Weibo social network.
This research is undertaken to design a time-series analysis framework and provide a generic solution for the evolutionary prediction of information popularity dynamics in social networks. This study also aims to alleviate the contradiction between stationary and nonstationary event propagation in a standard prediction model. The fact that all hot events in a social network are analyzed in a class of models without considering their volatility may not necessarily serve as an effective way to increase the prediction accuracy. Hence, the framework’s investigation divides event propagation into two kinds (stationary and nonstationary) according to event volatility. This study defines two specific DLM models, namely, M and N -M, to analyze stationary and nonstationary propagation, respectively.
This paper mainly reveals the superiority of using the N -M model to predict nonstationary popularity evolution. N -M is innovatively constructed by introducing the nonstationary feature parameters of local linear trend, seasonality, and regression. Three different parameter estimation methods, Kalman filter, Kalman smoothing, and maximum likelihood, are used to simplify the parameter adjustment process.
Experiments on three popular hot events that appeared on the Weibo social network with different topic categories confirm the effectiveness and superiority of the N -M model through the comparison of evaluation indexes R2, , , and . The experimental propagation trend lines also exhibit that the prediction accuracy of event popularity evolution can be significantly improved by introducing explanatory variables into the N -M model. The insights of model construction gained from this study may be of benefit in the complex hot event evolution analysis in social networks with time-series characteristics.
The data used to support the findings of this study are publicly available at https://github.com/YangMin-10/Datasets.
Conflicts of Interest
The authors declare that they have no conflicts of Interest.
This work was supported by the National Natural Science Foundation of China (nos. 61902324, 11426179, and 61872298), the Social Science Planning Project of Sichuan Province (no. SC20TJ020), the Science and Technology Program of Sichuan Province (nos. 2021YFQ0008, 2020JDRC0067, and 2019GFW131), the Foundation of Cyberspace Security Key Laboratory of Sichuan Higher Education Institutions (no. sjzz2016-73), the Scientific Research Fund of Sichuan Provincial Education Committee (nos. 15ZB0134 and 17ZA0360), and the Open Fund Project of Xihua University (nos. 20170410143123 and szjj2015-059).
M. Jalili and M. Perc, “Information cascades in complex networks,” Journal of Complex Networks, vol. 5, pp. 665–693, 2017.View at: Google Scholar
P. Domingos, “Mining social networks for viral marketing,” Journal of Retailing and Consumer Services, vol. 20, no. 1, pp. 24–28, 2010.View at: Google Scholar
H. T. Li, X. Q. Ma, F. Wang, J. C. Liu, and K. Xu, “On popularity prediction of videos shared in online social networks,” in Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, pp. 169–178, ACM, New York, NY, USA, 2013.View at: Google Scholar
P. Bao, H. W. Shen, J. M. Huang, and X. Q. Cheng, “Popularity prediction in microblogging network: a case study on sina weibo,” in Proceedings of the 22nd International Conference on World Wide Web, pp. 177-178, ACM, New York, NY, USA, 2013.View at: Google Scholar
Q. Kong, W. Mao, and C. Liu, “Popularity prediction based on interactions of online contents,” in Proceedings of the 4th International Conference on Cloud Computing and Intelligence Systems, pp. 1–5, ACM, Beijing, China, 2016.View at: Google Scholar
Q. Cao, H. W. Shen, J. H. Gao, B. Z. Wei, and X. Q. Cheng, “Popularity prediction on social platforms with coupled graph neural networks,” in Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 70–78, ACM, New York, NY, USA, 2020.View at: Google Scholar
M. Ra2l, G. Ortiz, M. Postigo-Boix, L. Jos2s, and M. Mels, “Extent prediction of the information and influence propagation in online social networks,” Computational and Mathematical Organization Theory, 2020.View at: Google Scholar
R. Pagare and A. Khare, “Churn prediction by finding most influential nodes in social network,” in Proceedings of the International Conference on Computing, Analytics, and Security Trends, pp. 68–71, IEEE, Pune, India, 2016.View at: Google Scholar
Y. L. Shen, N. D. Thang, H. Y. Zhang, and T. T. My, “Interest matching information propagation in multiple online social networks,” in Proceedings of the 21th ACM International Conference on Information and Knowledge Management, pp. 1824–1828, ACM, New York, NY, USA, 2012.View at: Google Scholar
X. G. He, M. Gao, M. Y. Kan, Y. Q. Liu, and K. Sugiyama, “Predicting the popularity of web 2.0 items based on user comments,” in Proceedings of the 37th International ACM and SIGIR Conference on Research and Development in Information Retrieval, pp. 233–242, ACM, New York, NY, USA, 2014.View at: Google Scholar
J. Yang and S. Counts, “Predicting the speed, scale, and range of information diffusion in Twitter,” in Proceedings of the 7th International AAAI Conference on Weblogs and Social Media, pp. 355–358, Washington, DC, 2010.View at: Google Scholar
S. B. Kong, L. Feng, G. Z. Sun, and K. Luo, “Predicting lifespans of popular tweets in microblog,” in Proceedings of the 35th International ACM and SIGIR Conference on Research and Development in Information Retrieval, pp. 1129-1130, ACM, New York, NY, USA, 2012.View at: Google Scholar
R. Zafarani, M. A. Abbasi, and H. Liu, Social Media Mining: An Introduction, Cambridge University Press, Cambridge, UK, 2014.
H. V. Singh, “Predicting the popularity of online news using social features,” in Proceedings of the 2th International Conference on Green Computing and Internet of Things, pp. 514–518, Bangalore, India, 2018.View at: Google Scholar
C. C. Hsu, L. W. Kang, C. Y. Lee, J. Y. Lee, Z. X. Zhang, and S. M. Wu, “Popularity prediction of social media based on multi-modal feature mining,” in IProceedings of the 27th ACM International Conference on Multimedia, pp. 2687–2691, ACM, New York, NY, USA, 2019.View at: Google Scholar
W. Chen, Y. Yuan, and L. Zhang, “Scalable influence maximization in social networks under the linear threshold model,” in Proceedings of the IEEE International Conference on Data Mining, pp. 88–97, IEEE, Sydney, Australia, 2011.View at: Google Scholar
K. Saito, R. Nakano, and M. Kimura, “Prediction of information diffusion probabilities for independent cascade model,” in Proceedings of the International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, pp. 67–75, Springer, Zagreb, Croatia, 2008.View at: Google Scholar
M. Wang, A. L. M. Vilela, L. Tian, H. Xu, and R. Du, “A new time series prediction method based on complex network theory,” in Proceedings of the IEEE International Conference on Big Data, pp. 4170–4175, Boston, MA, USA, 2017.View at: Google Scholar
Y. Wu, X. L. Chen, and Z. Y. Jiang, “Survey on predicting popularity of information in microblogs,” Journal of Xihua University, vol. 36, no. 1, pp. 1–6, 2017.View at: Google Scholar
Y. Matsubara, Y. Sakurai, B. A. Prakash, L. Li, and C. Faloutsos, “Rise and fall patterns of information diffusion: model and implications,” in Proceedings of the 18th ACM and SIGKDD International Conference on Knowledge Discovery and Data Mining, vol. 8710, pp. 6–14, ACM, New York, NY, USA, 2012.View at: Google Scholar
H. Simon and W. John, “Kalman filtering and neural networks,” Adaptive and Learning Systems for Signal Processing Communications and Control, vol. 88, pp. 170–174, 2001.View at: Google Scholar
J. A. Hartigan, Bayes Theory, Springer-Verlag, New York, NY, USA, 1983.
Z. Mohammed, “Scalable algorithms for association mining,” IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 3, pp. 372–390, 2000.View at: Google Scholar
F. Par, D. G. Gasulla, A. Vilalta et al., “Fluid communities: a competitive, scalable and diverse community detection algorithm,” in Proceedings of the 6th International Conference on Complex Networks and Their Applications, pp. 229–240, Springer, Lyon, France, 2017.View at: Google Scholar