Abstract

Social unrest events are common happenings in modern society which need to be proactively handled. An effective method is to continuously assess the risk of upcoming social unrest events and predict the likelihood of these events. Our previous work built a hidden Markov model- (HMM-) based framework to predict indicators associated with country instability, leaving two shortcomings which can be optimized: omitting event participants’ interaction and implicitly learning the state residence time. Inspired by this, we propose a new prediction framework in this paper, using frequent subgraph patterns and hidden semi-Markov models (HSMMs). The feature called BoEAG (Bag-of-Event-Association-subGraph) is constructed based on frequent subgraph mining and the bag of word model. The new framework leverages the large-scale digital history events captured from GDELT (Global Data on Events, Location, and Tone) to characterize the transitional process of the social unrest events’ evolutionary stages, uncovering the underlying event development mechanics and formulating the social unrest event prediction as a sequence classification problem based on Bayes decision. Experimental results with data from five main countries in Southeast Asia demonstrate the effectiveness of the new method, which outperforms the traditional HMM by 5.3% to 16.8% and the logistic regression by 11.2% to 43.6%.

1. Introduction

The era of information technology boosts the rapid development of the Internet of things, social media, and big data. As a data-intensive science, social computing is an emerging thing that leverages the capacity to collect and analyze data with an unprecedented breadth, depth, and scale. It represents a new computing paradigm and an interdisciplinary research and application field. Topics related to social computing have attracted the attention of more and more researchers.

The social unrest events such as protests, strikes, demonstrations, and occupy movements are important research focuses in the social computing area, which are common happenings in both democracies and authoritarian regimes [1]. Most social unrest events initially intended to be a demonstration to the public or the government. However, in many occasions, they often escalate into general chaos, resulting in violent, riots, sabotage, and other forms of crime and social disorder. Take Thailand for example; a series of political protests and three military coups happened between 1990 and 2015, resulting in the government being deposed, illustrating the power of the social unrest. Figure 1 depicts the activities that causally preceded the protest against the amnesty bill in Bangkok on August 7, 2013. Anticipating these latent instabilities before they occur and applying preventive strategies to avoid them have important ramifications such as prioritizing citizen grievances for the decision makers, issuance of travel warnings for the tourism industry, and insight into how citizens express themselves for the social scientist, which has motivated many social and data science researchers to focus on revealing the patterns contained in these events and further the prediction of future latent social unrest.

Traditionally, the research in the area of social unrest was based on static analysis from the macroqualitative perspective by the political researchers. Fortunately, with the development of data science, especially the rise of big data, there are more and more data-driven approaches proposed on microscopic insight into possible social unrest events. Last century, most researchers conducted the prediction work using human-coded data, including WEIS [2] and COPDAB [3]. In the recent two decades, several small-scale vertical machine-readable datasets [4, 5] and large-scale coded event data like ICEWS (Integrated Crisis Early Warning System) [6] and GDELT [7] appeared, fueling the development of computation methods for the analysis and prediction of social unrest.

Our previous work [8] published in Discrete Dynamics in Nature and Society built a hidden Markov model- (HMM-) based framework to predict indicators associated with country instability. The framework used the temporal burst patterns in GDELT event streams as features to train the hidden Markov models. There are two shortcomings in that work. First, the temporal burst pattern is essentially a simple feature in the number of coded events. The interaction characteristics between event participants are missing. Second, the probability of state residence time in the HMMs decreases exponentially with time, which is obviously not in line with the actual situation of social unrest events.

In response to the above shortcomings, we propose a new prediction framework in this paper, using frequent subgraph patterns and hidden semi-Markov models (HSMMs). The new framework also leverages the large-scale digital history events captured from GDELT to characterize the transitional process of the social unrest events’ evolutionary stages. Our proposed framework converts the GDELT event streams to frequent subgraph patterns for capturing interaction features better. In addition, the mechanism of HSMM guarantees the prediction model can explicitly learn the probability distribution of state residence time from the historical data. Eventually, the social unrest event prediction is formulated as a sequence classification problem using Bayes decision. More concretely, our main contributions in this updated paper are four pronged:(i)First, we identify a sequence of stages of events that potentially lead to a social unrest. Typical evolutionary stages of social unrest include appeal, accusation, refuse, escalation, and eruption, where each stage corresponds to a state in the hidden semi-Markov model. It should be noted that not all unrest events will go through all the four development stages before reaching the eruption stage.(ii)Second, we propose the BoEAG (Bag-of-Event-Association-subGraph) features to capture the characteristics of frequent patterns instead of the temporal burst patterns used in our previous work [8]. The original GDELT data within a certain time are represented as an event element association graph, from which the frequent subgraph patterns are mined. In the end, the BoEAG features are constructed like the classic BoW (bag of word) model [9] used in the text processing.(iii)Third, we propose a hidden semi-Markov model-based framework which contains four major components: ground set extraction, BoEAG feature construction, HSMM training, and event prediction. The ground set contains social unrest events that are significant enough to garner more-than-usual real-time coverage in mainstream news reporting. The BoEAG features of the GDELT stream are taken as the observations. Then, two HSMM models are trained, with one for social unrest-prone sequences and one for social unrest-free sequences, after which new sequences’ likelihoods are calculated and predictions are made by Bayes decision theory to specify the classification rule.(iv)Last, we conduct extensive experiment evaluations with GDELT event data from five main countries in Southeast Asia. The proposed framework outperforms the traditional HMM by 5.3% to 16.8% and the logistic regression method by 11.2% to 43.6% for different countries. Sensitivity analyses are also conducted, revealing the impact of the parameters on the new framework’s performance.

The paper is organized as follows. A coarse introduction of related work is provided in Section 2. Our HSMM-based social unrest event prediction framework is presented in Section 3. In Section 4, extensive experiments to evaluate the performance of the new method are conducted and analyzed. The work is summarized and conclusions are drawn in Section 5.

2.1. Social Unrest Event Prediction

Predictive analysis of social unrest events has long stayed at the level of qualitative analysis relying on the experience of experts, especially political scientists. Since 2009, research studies on social unrest event prediction based on data mining have taken shape in some international political science journals [5, 10]. Especially since 2013, with the popularization of big data technology, big data-driven social unrest event prediction research has ushered in a period of vigorous development. In the conferences such as SIGKDD [11, 12], WWW [13, 14], SDM [15], AAAI [1, 16], and journals such as IEEE Trans. [17, 18], more than 30 related works have been published in succession, and the degree of attention is evident.

Event prediction has been explored in a variety of applications, including elections [19, 20], disease outbreaks [21], stock market movements [22, 23], social unrest event prediction [11, 13, 2431], movie earnings [22], crime [32], and failure prediction [33]. Most recent social unrest event prediction techniques can be categorized into three types: planned event forecasting, classification-based prediction, and time series mining.

Planned event prediction methods do not need to mine patterns from the previous data. They are based on the hypothesis that protests that are larger will be more disruptive and will communicate support for its cause better than smaller protests. Mobilizing large numbers of people is more likely to occur if a protest is organized and the time and place are announced in advance [1, 11, 25]. For example, Basnet et al. [34] used the GDELT data to propose a clustering method based on spatiotemporal k-dimensional structure trees to study the spatiotemporal distribution of conflict events in India in 2014.

Classification-based prediction incorporates volume features and informative features such as semantic topics to train a classification model and then predicts the occurrence of future events. Several classification methods are utilized such as random forest [13], support vector machines [21], logistic regression [22, 24, 28, 35] and LASSO-based logistic regression [26, 27]. Wang et al. [36] used the LSTM model combined with GDELT’s event data to predict the number of conflicting events. Yang et al. [37] used a two-stage sentiment analysis method based on deep neural networks to conduct early warning research on group aggregation behavior. Phillips [38] summarized the use of social media to predict future events, including applied research in the detection of political events and threat events. Parrish [39] used the recurrent neural network GRU sequence model and aggregated the GDELT event data by day, splicing them into feature vectors to determine whether a country has a social unrest event including domestic political crisis, riots, racial violence, and change of leadership. Zhao et al. [40] used the multitask learning of geographical spatial stratification, judging whether unrest events occurred on the specified date. Wu et al. [41] used the “Protest Participation Theory” proposed in the field of political science, combined with the SVM support vector machine model, to conduct early warning research on social unrest events. Deng et al. [12] extracted and learned graph representations from historical event documents. By employing the hidden word graph features, the model predicts the occurrence of future events and identifies sequences of dynamic graphs as event context.

Time series-based mining uses temporal correlation of relevant features such as tweet volume by adopting appropriate approaches. For example, Achrekar et al. [42] used autoregressive modeling to predict flu trends using twitter data. Radinsky et al. [29] utilized NYT news articles from 1986 to 2007 to build event chain and identify significant increases in the likelihood of disease outbreaks, deaths, and riots in advance of the occurrence of these events in the world.

So far, there are few works aiming at utilizing GDELT to make predictions about social unrest. Existing works attempted to use linear regression [43], time series forecasting [44], deep neural networks [36, 39], and frequent subgraphs [28, 35] to conduct the prediction work using GDELT. In [27], GDELT and ICEWS are used as data sources to predict unrest in Latin America. Nevertheless, in these works, comparatively little attention has been paid to consider the event evolutionary stages in the prediction models.

2.2. Hidden Semi-Markov Model

A hidden semi-Markov model (HSMM) is a statistical model with the same structure as a hidden Markov model except that the probability of there being a change in the hidden state depends on the amount of time that has elapsed since entry into the current state. This is in contrast to the original hidden Markov models where there is a constant probability of changing state given survival in the state up to that time.

HSMM was first proposed by Baum et al. [45] and has been successfully used in many applications, including word recognition task [46], daily return series modeling in financial market [47, 48], equipment health diagnosis and prognosis [49], activity recognition and abnormality detection [50], DNA analysis [51], and online failure prediction [52]. It is worth noting that in our work, we referred to the basic idea of online failure prediction in a commercial telecommunication system by Salfner et al. [33, 52]. The works motivate us to apply the hidden Markov model and hidden semi-Markov model to the social unrest event prediction task. The prediction mechanism and Bayes decision-based classification are adopted specifically.

2.3. The GDELT Dataset

The GDELT Project [7] is a real-time network diagram and database of global human society for open research which monitors the world’s broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, counts, quotes, and events driving our global society every second of every day, creating a free open platform for computing on the entire world. Each day, the GDELT Project monitors the news media across nearly every corner of the world and compiles a list of over 300 categories of “events” from riots and protests to peace appeals and diplomatic exchanges, recording the details of the event, including its georeferenced location, into a master “event database” of more than a quarter billion events, dating back to 1979 and updated each morning around 4 AM EST. In particular, from 19 February 2015, GDELT 2.0 has been online which updates every 15 minutes accessing the world’s breaking events and reaction in near real time.

In GDELT event data table, each record has 58 fields (61 fields in GDELT 2.0), capturing information pertaining to a specific event in CAMEO format [53]. In this paper, we use the following nine fields from a record: SQLDATE, MonthYear, EventRootCode, GoldsteinScale, NumMentions, AvgTone, ActionGeo_CountryCode, ActionGeo_Lat, and ActionGeo_Long. SQLDATE and MonthYear are the date the event took place in YYYYMMDD format and YYYYMM format, respectively. EventRootCode defines the root-level category the event code falls under. For example, code 1452 (engage in violent protest for policy change) has a root code of 14 (PROTEST). This makes it possible to aggregate events at various resolutions of specificity. GoldsteinScale is a numeric score from −10 to +10, capturing the theoretical potential impact that type of event will have on the stability of a country. NumMentions is the total number of mentions of this event across all source documents, which can be used as a method of assessing the importance of an event: the more discussion of that event, the more likely it is to be significant. AvgTone is the average tone of all documents containing one or more mentions of this event. The score ranges from −100 (extremely negative) to +100 (extremely positive). ActionGeo_CountryCode is the location of the event, which is a 2-character FIPS10-4 country code for the location. ActionGeo_Lat and ActionGeo_Long are the centroid latitude and centroid longitude of the landmark for mapping.

The dataset is also available on Google Cloud Platform1 and can be accessed using Google BigQuery. In this paper, we export the following GDELT event data for the experiments from the Google BigQuery2 web service.

3. HSMM-Based Social Unrest Event Prediction

3.1. Framework

Proactive reaction to social unrest events is at first glance closely coupled with social unrest event detection: an unrest event needs to be detected before the government can react to it. However, the fact is that not the detection result but the eruption of a social unrest event is the kind of event that should be primarily avoided, which makes a big difference. Hence, it goes without saying that efficient proactive handling of social unrest events requires the prediction of the future level of social unrest, to judge whether the current situation bears the risk of an unrest event or not.

The evolutionary stages of the social unrest event cannot be directly observed. However, the stages have been explicitly coded more or less on the Internet. The basic assumption of our approach is that the eruption of social unrest events can be identified by frequent subgraph patterns of the event sequence prior to the happening time point using HSMMs. Prediction mechanism of the upcoming social unrest events is illustrated in Figure 2. If a prediction is performed at time , we would like to know whether a social unrest event will occur or not between time to .

usually is called the lead time. has a lower bound called warning time , which is determined by the time needed for the specified organization like the government to perform some proactive action, e.g., the time needed to make a public statement. stands for the length of the data window called data window size which contains the predictive sequence of data. The sequence describes the current state of the country or district. The prediction period is the length of the time interval for which the prediction holds.

Based on the above prediction mechanism, our prediction task will resolve around predicting significant social unrest events on the country level and considering that country alone. To accurately predict social unrest events, it is crucial to be able to characterize these events’ underlying stage before the occurrence by utilizing relevant GDELT event records observations. We propose a hidden semi-Markov model-based framework to characterize the underlying development of these events. Figure 3 illustrates the proposed HSMM-based social unrest event prediction framework, which contains four major components: ground set extraction, BoEAG feature construction, HSMM training, and event prediction.

Formally, denote as a basic GDELT event record. (“column name”) means the value of a specified column in a record. Denote as a collection of GDELT event record data split into different countries in time period . The country and the day can be filtered by and , respectively. Since event records are being added daily by the hundreds or thousands to the GDELT event table, we aggregate those event records by day, defined as , meaning the daily aggregated event record on the day in country . Then, a sequence of s is defined as , which contains all the daily aggregated event records in country in the time period .

3.2. Ground Set Extraction

Ground truth is absolutely vital for the prediction problem. Unfortunately, until now, there is no public ground set in the social unrest prediction area. As a result, in this paper, we treat GDELT as the ground truth for social unrest events. Actually, the generated ground set does reflect the real world happenings well according to our manual inspection (see Figure 4).

For each country, the social unrest events we are interested in predicting are those that are significant enough to garner more-than-usual real-time coverage in mainstream news reporting for the country. That is, there is a significant social unrest event in country on the day . In GDELT, root event code 14 can be taken to mean social unrest. More records with event code 14 means more social unrest event report coverage. For each country we are interested in, we firstly aggregate the count of event mention with root event code 14 on each day . Since new events are being added daily by the hundreds or thousands to the GDELT, there is a heterogeneous upward trend in the event mention and what is more than usual in count changes. As a result, to remove the upward trend in the unrest event mentions, we normalize the mention counts with root code 14 by the average volume of the trailing quarter (90 days). That is, we letwhere is the normalized total count of social unrest event mentions on the day in country and is the value of Num Mentions of each record. Next, we define the average event mention count on each day in country aswhere denotes the set of days in the training set.

To smooth the data, we consider a seven-day moving average. By definition, we say that a significant social unrest in country occurs during the 7-day stretches ifwhere is the significance threshold.

3.3. BoEAG Feature Construction

The Bag-of-Event-Association-subGraph (BoEAG) feature is constructed from frequent subgraphs and the bag of word model. The original GDELT data within a certain time are first represented as a big single event element association graph. Then, the frequent subgraph patterns are mined from the big single graph. In the end, the BoEAG features are constructed like the classic BoW (bag of word) model.

The event element association graph draws on the SUBDUE system [54] which analyzed aviation safety events using graph mining. The system converts a series of aviation safety related event records into graph data for processing. The node labels represent the aviation safety event id and the attribute value. The edge labels represent the attribute name (such as location, time, and flight altitude) and the relationship between events. For example, “near_to” relationship means that the distance between the two accidents occurred is within 200 km.

Figure 5 gives a schematic diagram of the event element association graph of this paper. The figure contains two events numbered id1 and id2. The node label in the figure represents the number and attribute value of the GDELT event record, and the edge label represents the attribute name, such as event type, location, participants, and GoldsteinScale value. When two events contain at least one identical participant, there will be a “relate_to” relationship between the two events connected by an edge.

Bag of words model is a feature vectorization method commonly used in the field of text retrieval and text classification. In this paper, BoEAG feature construction is similar to BoW. The collection of GDELT event element association graphs aggregated by day corresponds to the corpus in the BoW model. Each event element association graph corresponds to a document and each frequent subgraph corresponds to a word in the BoW model. The weight of the frequent subgraph of the event element association graph can be calculated by the following formula:where denotes the frequency of subgraph in the event association graph . This value can be directly obtained through the single graph frequent subgraph mining algorithm SSIGRAM proposed in our previous work [55]. denotes the number of event association graphs, that is, the time span of the dataset in days; is the number of event association graphs that contain subgraphs .

Algorithm 1 gives the process of BoEAG feature construction illustrated above. The input of the algorithm includes three parameters: the original GDELT event records, such as a set of event records within a certain period of time in a certain country, the support threshold, and the maximum number of subgraphs. The output is the BoEAG feature vector set. Lines 4 to 19 of the algorithm construct event association graphs. Lines 20–22 use the SSIGRAM algorithm for single large graphs for frequent subgraph mining. The maximum number of subgraphs is to return the maximum number of subgraphs. That is, when the total number of frequent subgraphs found during the mining process reaches , it will stop iterating and arrange all subgraphs in descending order of frequency. Line 24 obtains the standard adjacency matrix coding sequence of each subgraph and uses it as the “Word.” Line 25 calculates the feature vector corresponding to each event association graph according to formula (4).

Require: original event records , support threshold , and maximum subgraphs returned
 Ensure: BoEAG feature set
(1)/ The set of event association graphs /
(2)/ The set subgraphs /
(3)
(4): event records aggregated by day
(5)for in do/ All the event records at date t/
(6)/ All the event association graphs at date t/
(7)for in do
(8)if is not traversed then
(9) constructing the graph unit of event
(10)for in do
(11)If is not traversed then
(12) constructing the graph unit of event
(13)if and contain at least one identical participant then
(14) generating “relate_to” edge between and
(15)end if
(16)end for
(17)end for
(18)
(19)end for
(20)for in do
(21)SSIGRAM/ Mining frequent subgraphs using the SSIGRAM algorithm /
(22)
(23)end for
(24)Representing each subgraph in as its standard adjacency matrix (CAM) coding sequence (for details of standard adjacency matrix, please refer to [55]).
(25)/ Calculating feature set using formula (4) /
(26)Return
3.4. HSMM Training
3.4.1. Structure of HSMM

Usually, the social unrest event has a series of evolutionary stages, through a longer or shorter life cycle, meaning that it is usually not a sudden outbreak. Typical stages in the events’ life cycle often include appeal, accusation, refuse, escalation, and eruption. In this paper, a hidden semi-Markov model which contains five states with left and right structure is designed, whose structure is shown in Figure 6.

The structure contains five states, corresponding to the typical stages of the evolutionary process of social unrest events from left to right, such as appeal, accusation, refuse, escalation, and eruption. The state in this structure starts from (appeal) and ends at state (eruption). During the state transition, the number of the next transition state cannot be lower than the current state number. Correspondingly, the state transition probability matrix has the following form:

In the traditional HMM model, the state residence time probability shows an exponential downward trend with the number of residence time units [56], which is obviously not consistent with the state residence time of many application scenarios in the real world, especially the social unrest events. In order to improve this shortcoming, the state residence time probability distribution can be explicitly introduced into the HMM model so that it can automatically learn the probability distribution of the state residence time from historical data. This is the original intention of the hidden semi-Markov model.

Let denote the set of latent states, . Let denote the vector of initial state probabilities. Given a sequence of the above BoEAG feature observations , a standard continuous HSMM can be defined as , where the initial state probability and output matrix have the same meaning as HMM, while the state transition matrix is defined as

This paper considers the discrete time probability, that is, the state residence time can only be an integer multiple of the residence time unit, e.g., day. Let represent the maximum possible residence time; then, can be denoted as a residence time probability matrix of , whose element value represents the probability of the state lasting time units:

3.4.2. Sequence Likelihood

Given an observation sequence consisting of days’ BoEAG feature vector set . The goal of hidden semi-Markov model training is to optimize the model parameters , , , and so that the likelihood of the model generating sequence is maximized. Given the HSMM model , the sequence likelihood of the observation sequence is defined aswhere represents the hidden state sequence with length . Similar to the traditional HMM, the sum over can also be calculated by the forward-backward algorithm proposed in [57]. The difference is that the state residence time needs to be explicitly added during the derivation process. Define as the forward variable, which means the probability of ending at the hidden state at time , given observation sequence :

can be recursively calculated from front to back as follows:

Finally, the sequence likelihood can be efficiently computed by

The backward variable is defined as , which means the probability of starting at the hidden state at time , given observation sequence :

can be recursively calculated from front to back as follows:

The sequence likelihood can be efficiently computed by

3.4.3. Parameter Estimation

There are 4 parameters to be estimated for the model training, including initial probability distribution , state transition probability , observed probability density function , and state residence time probability density function . and can be calculated directly. and need to specify the description form of probability density function in advance. We use multivariate mixed Gaussian probability density to describe the probability density of observations :where represents the number of mixed Gaussian elements; is the weight of the mixed Gaussian elements in the state ; ; and and are the mean and variance of the Gaussian element, respectively.

We use a single Gaussian distribution to describe the probability density of state residence time :where and are the mean and variance, respectively.

Denote the variable as the probability of transferring from state to state after residing in time units at the time . Given the observation sequence and the model parameters , then

Given the definitions of the forward variable and backward variable, can be calculated as

So far, the parameter estimation can be achieved by the expectation maximization (EM) algorithm, also known as the Baum–Welch algorithm in HMM [57]. The E step of the EM algorithm is to construct a function and then maximize the function in the M step. Thus, we can obtain the re-estimated model parameters , , , and . Then, the process iterates continuously until the parameters converge or the maximum number of iterations is reached, formulated as

As the ground truth contains multiple positive samples and negative samples, we need to use multiple sets of observation data to train the model. Denote as the training data containing observation sequences. All observation sequences have the same length . We assume that each observation sequence is independent with each other. represents the probability of the combination of observation sequences under a given model; then,

Finally, we trained two HSMMs based on two corresponding set of sequences, one set from sequences prior to the positive 7-day stretches minus the lead time period and the other negative. Thus, one model characterizes the evolution process leading to a social unrest event, while the other one characterizes the process that does not lead to a social unrest event.

3.5. Event Prediction

After the training of model parameters, we formalize the social unrest event prediction as a sequence classification problem. For the prediction, an unknown sequence prior to the target 7-day stretch minus the lead time period will be aligned with the above model in each class. The sequence will be classified into the class corresponding to the higher alignment score—higher likelihood. However, likelihood gets small very quickly for long sequences, such that the limit of double-precision floating point operations may be reached. The scaling technique log-likelihood is used for this reason. Besides, different costs should be associated with classification. For example, falsely classifying a -prone sequence as -free might be much worse than vice versa.

We use Bayes decision theory to specify the classification rule: the unknown sequence of observations is classified as -prone, ifwhere denotes the associated cost for assigning a sequence of type to class , e.g., denotes the cost for falsely classifying a -prone sequence as -free. and are constants representing the prior probabilities of sequences and sequences, respectively (see, e.g., [58] for a derivation of the formula).

Thus, given the costs of misclassification, the right hand side of this inequality determines a constant threshold on the difference of sequence log-likelihood, denoted as . If the threshold is small, more sequences will be classified as -prone, increasing the chance of detecting -prone sequences. On the other hand, the risk of falsely classifying a -free sequence as -prone is also high. If the threshold increases, the behavior is inverse: more and more -prone sequences will not be detected at a lower risk of false classification for -free sequences.

4. Experimental Evaluation

This section presents an experimental evaluation of the performance of the proposed HSMM-based prediction framework based on five countries from Southeast Asia.

4.1. Experiment Design
4.1.1. Dataset

Our focus area is distributed across five major nations in Southeast Asia: Thailand, Malaysia, Philippines, Indonesia, and Cambodia. These countries have experienced mass protests of varying degrees over the past decade, so they are ideal sources of research data. As mentioned above, GDELT uses the CAMEO coding system [53], where root event code 14 represents social unrest. Figure 7 illustrates the mention counts of protest event occurring in these countries retrieved from GDELT between January 1, 2001, and February 29, 2016. Among them, Thailand (25877 times) was mentioned the most in protest reports, followed by the Philippines (23381 times), and Cambodia (7322 times) being the least. In consideration of the quarterly normalization in Section 3.2, the actual training data were from April 1, 2001, to December 31, 2013, and the test data were from January 1, 2014, to February 29, 2016.

4.1.2. Comparison Methods

As a comparison, three methods are selected in this paper. One is the traditional hidden Markov model (HMM), and its structure is also a form of left to right as Figure 6, except that there is no explicit state residence time probability distribution estimation during the model training process; the remaining steps are the same as the HSMM method. The second is the logistic regression method. Two logistic regression models are trained, and sequence classification is conducted based on this. The third is baseline which does not train any model. It directly uses the probability of protest event records in a country in history as the future social unrest events’ probability.

4.1.3. Performance Metrics

We evaluate our social unrest event prediction framework using metrics similar to those described in Kallus et al. [13]. We quantify the success of the proposed predictive mechanism and comparison methods based on their balanced accuracy. Let and , respectively, denote whether a significant social unrest event occurs in country during the days and whether we predict there to be one. The true positive rate is the fraction of positive instances correctly predicted to be positive and the true negative rate is the fraction of negative instances predicted negative. The balanced accuracy is the unweighted average of these:

, unlike the marginal accuracy, cannot be artificially inflated. In fact, due to the unbalanced distribution of positive and negative examples in our dataset, always predicting “no social unrest event” without using any data will yield a nearly 90% marginal accuracy but only 45% balanced accuracy. In fact, a prediction without any relevant data will always yield a of 50% on average by statistical independence.

4.1.4. Parameter Settings

In the extraction stage of ground truth, the threshold value of is set to 2.3. This value is approximately equal to the 90% quantile of the standard exponential distribution, that is, approximately 10% of the 7-day time windows in the ground truth will be marked positive.

In the BoEAG feature extraction stage, the maximum number of returned frequent subgraphs is set to 10000. The logistic regression has one parameter: the iteration convergence threshold, which is set to in the experiment. The baseline method does not require any parameter values to be set in advance. The HMM model and the HSMM model both have 6 parameters that need to be set, including the hidden state number , the number of mixed Gaussian elements used in the estimation of the probability density of the observation value , the prediction interval , the lead time , the prediction data time window , and the likelihood threshold . In experiments, , , and are used as fixed parameters, that is, the three values are the same when the experiment is performed on the dataset of five countries. We set , , and , respectively. The meaning of is to determine whether there will be a social unrest event during the 7-day (one week) time window. In addition, , , and are adjustable parameters, and the optimal value is obtained by performing 10-fold cross-validation on the training set of each country. The value interval of is one day to seven days. The values of is 10, 20, 30, and 40 days, and the value interval of is [−2, 2], with a step of 0.1. The final value details are shown in Table 1.

4.1.5. Ground Set

Table 2 gives the ground-truth results on five datasets of Thailand, Malaysia, Philippines, Indonesia, and Cambodia. The experiment uses 7-day time stretches as the time unit. The time span of the dataset (2001.04.01–2016.02.29) contains a total of 778 7-day time windows. The training period includes 666 7-day stretches while the testing period includes 112. The number of positive stretches in the training set (2001.04.01–2013.12.31) and the test set is listed in the table.

Figure 4 takes Thailand as an example, giving its normalized number of protest reports (the red line represents the threshold ). We mark the top ten 7-day time windows with the most reports and give a brief description. These are the social unrest events that have really happened in Thailand in the history, such as the “Tak Bai incident” with about 1500 protesters on October 28, 2004, which occurred in Tak Bai district in Southern Thailand, caused by the detention of 6 Muslim believers. And the protest conflict against the Abhisit government which broke out in Bangkok on April 7, 2009, is also included. This also shows the effectiveness of the proposed method of extracting ground truth from GDELT data.

4.2. Event Prediction Results

Table 3 gives the balanced accuracy (BACC) values of the hidden semi-Markov model (HSMM), the traditional hidden Markov model (HMM), the logistic regression, and the baseline method on the test set. Based on the BoEAG feature pattern, it can be seen that in the test datasets of various countries, the performance of the prediction method based on the hidden semi-Markov model proposed in this paper is the best, which shows that the HSMM model can indeed better model the characteristics of mass protest events due to explicitly considering the residence time of the event development evolution stage. The performance of the HMM model is the second best, followed by the logistic regression, and the baseline performs the worst, which is basically random guessing. A longitudinal comparison of the five countries shows that each method performs best in the Thailand test set, especially the HSMM method, which achieves a BACC value of 95.9%. For all the five countries, our proposed HSMM-based approach achieved the best overall performance in balanced accuracy, outperforming the HMM model by 12.7%, 7.1%, 4%, 16.8%, and 5.3% and the logistic regression by 43.6%, 25.8%, 11.2%, 33.5%, and 12.6% for Thailand, Indonesia, Philippines, Malaysia, and Cambodia, respectively.

In addition, comparing each method’s performance based on the BoEAG pattern and the temporal burst pattern used in our previous work [8], we can see that the BoEAG pattern constructed from frequent subgraphs can better model the stages of social unrest events, as the BACC values of HSMM, HSMM, and logistic regression all improve when the BoEAG patterns are used. This is because the BoEAG pattern considers both temporal burst and the interaction between event participants.

By adjusting the likelihood ratio threshold , a series of correspondences between the true positive rate (TPR) and the false positive rate (FPR) can be obtained, and then ROC analysis can be performed for each method. Figure 8 shows the ROC curve of the three methods of HSMM model, HMM model, and logistic regression. The larger the area under the curve (AUC) under the ROC curve, the better the prediction performance of the model. Obviously, among the three methods shown, the AUC of the hidden semi-Markov model (HSMM) is the largest on each test set, and its performance is the best among the methods.

4.3. Sensitivity Analysis on and

Although the model parameters are fixed on the training set by 10-fold cross-validation, it is still necessary to investigate the performance of the prediction model at different leading time and prediction time window , which also has guiding significance to the actual application model.

Figure 9 shows the trend of the prediction performance of the HSMM model on each test set with and . The leading time is 1 day to 10 days, and the value of is 10 days, 20 days, and 30 days. Two phenomena can be found: First, as the leading time increases, the overall prediction accuracy of the model decreases. In most cases, when , the BACC value is the highest. This is consistent with our common sense, that is, the closer the observation data are to the time point of the event, the more accurate the event can be predicted in the future. Second, the performance of the model is not necessarily related to the length of time windows of the observation sequence data used. It is not that the longer the observation sequence used, the higher the prediction accuracy, and the more the data, the more the interference. Given the trained prediction model and the lead time parameters, different test sets require different time windows for prediction data to achieve optimal prediction accuracy.

5. Discussion

This paper presents a hidden semi-Markov model-based framework for leveraging large-scale digital history coded events captured from GDELT to utilize the frequent subgraph patterns mined from the GDELT event streams to uncover the underlying event evolution mechanics and formulate the social unrest event prediction as a sequence classification problem. Extensive empirical testing with data from five countries in Southeast Asia demonstrated the effectiveness of this framework by comparing it with traditional HMM, the logistic regression model, and the baseline model. It shows that the GDELT dataset does reflect some useful precursor indicators that reveal the causes or evolution of future events.

We plan to conduct our future work in the following three aspects. First, we plan to introduce a multilevel prediction mechanism to our framework, such as city level or province level. Second, in GDELT 2.0, event mention details and global knowledge graphs [59] are also provided in real time, which can bring us with detail insights to the events. More machine learning and deep learning methods like the graph neural networks [60] can be developed with more events’ elements. Third, the prediction framework may be improved by distinguishing widespread news coverage from localized coverage.

Data Availability

The GDELT data used to support the findings of this study are included within the article in Section 2.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This study was supported by the National Social Science Fund of China under grant no. 2019-SKJJ-C-078.