Postmortem Analysis of Decayed Online Social Communities: Cascade Pattern Analysis and Prediction
Recently, many online social networks, such as MySpace, Orkut, and Friendster, have faced inactivity decay of their members, which contributed to the collapse of these networks. The reasons, mechanics, and prevention mechanisms of such inactivity decay are not fully understood. In this work, we analyze decayed and alive subwebsites from the Stack Exchange platform. The analysis mainly focuses on the inactivity cascades that occur among the members of these communities. We provide measures to understand the decay process and statistical analysis to extract the patterns that accompany the inactivity decay. Additionally, we predict cascade size and cascade virality using machine learning. The results of this work include a statistically significant difference of the decay patterns between the decayed and the alive subwebsites. These patterns are mainly cascade size, cascade virality, cascade duration, and cascade similarity. Additionally, the contributed prediction framework showed satisfactorily prediction results compared to a baseline predictor. Supported by empirical evidence, the main findings of this work are (1) there are significantly different decay patterns in the alive and the decayed subwebsites of the Stack Exchange; (2) the cascade’s node degrees contribute more to the decay process than the cascade’s virality, which indicates that the expert members of the Stack Exchange subwebsites were mainly responsible for the activity or inactivity of the Stack Exchange subwebsites; (3) the Statistics subwebsite is going through decay dynamics that may lead to it becoming fully-decayed; (4) the decay process is not governed by only one network measure, it is better described using multiple measures; (5) decayed subwebsites were originally less resilient to inactivity decay, unlike the alive subwebsites; and (6) network’s structure in the early stages of its evolution dictates the activity/inactivity characteristics of the network.
In recent years, online social networks (OSNs) have proven their aptitude as a new medium for sharing news and knowledge, expressing opinions, finding jobs, and many other things. In the literature, there are many works that focus on the growth dynamics of a network, starting with the seminal works of Barabásei and Albert  and Watts and Strogatz , which were the basis for the field of network science, via many studies examining the growth dynamics of social networks [3–6] to community membership evolution , which provide methods and models for analyzing and understanding growth dynamics in social networks. Nevertheless, the dynamics of members’ interactions in social networks is not always growth dynamics; many online social platforms have gone through decay dynamics in terms of low activity among their members and/or members leaving or deleting their accounts. Online social platforms such as MySpace and Orkut are now out of service after being very active for years and are examples of decayed online social networks. This phenomenon has not been studied well in the literature; decay causes, mechanics, and prevention of decay are still open questions that need to be answered.
Here, we approach the decay dynamics problem from a network perspective by modeling the members as network nodes and their social interactions as temporal edges. We aim to better understand the patterns that occur during the decay process by investigating what we call inactivity cascades, which were extracted from decayed Stack Exchange subwebsites. These inactivity cascades are mainly constructed from the structure of the modeled network, where the network structure has already shown to be crucial in understanding the dynamics of any process that takes place on top of a network such as the structure of the World Wide Web networks [8, 9] and social network analysis [10–13]. Moreover, network structure is correlated in many studies to understanding the dynamics of the processes over networks such as epidemic dynamics [14, 15], knowledge spread , and knowledge transfer . The information produced and evolved on the Stack Exchange website as an information exchange platform makes this work also connected to the information dynamics area [15, 18, 19], where we are concerned in the decay of the information production process on the Stack Exchange website as a medium of knowledge production and sharing.
Based on that, the contributions of this work are summarized as follows: (i)Extracting and analyzing inactivity cascades from the decayed and alive subwebsites of Stack Exchange(ii)Devising measures for understanding the decay process and possible patterns in both decayed and alive subwebsites(iii)Finding different inactivity patterns in alive and decayed subwebsites(iv)Finding empirical evidence that an inactivity cascade is not driven by only one network measure(v)Building a machine learning framework for predicting the size and virality of inactivity cascades
The previous contributions can be seen as two parts: (1) analysis of the decay process via cascade modeling and (2) prediction of cascades’ properties. These two parts are complementary because the analysis without prediction limits our control over these platforms and also predicting the properties of a decay requires a better understanding of the decay process itself so that we can provide a good prediction model.
The remainder of this paper is structured as follows. Section 2 describes the related work and highlights how this work contributes to the literature. Section 3 provides the definitions and the methods used throughout this paper, and Section 4 describes the datasets used and some preliminary analyses of these datasets. A detailed description of the results and the prediction framework are provided in Section 5. In parallel to the results, Section 5 also includes a discussion of the results and conclusions of this work. Section 6 presents the limitations of this work and directions for future research.
2. Related Work
This paper is related to studies and works that are concerned with decay or inactivity dynamics in social networks. In this section, we present the related works and show how this work is compared to them.
Due to limitations on existing data about interaction decay, researchers have focused on theoretical work based on random networks. For example, Dorogovtsev and Mendes  presented a model for understanding the properties of random networks if edges are removed, signaling that the dynamics of a network is not limited to adding nodes and/or edges. Later, with the rise of many social networks and social platforms, research primarily focused on growth dynamics, with very few works dealing with decay dynamics. Torkjazi et al.  studied users’ migration from MySpace to Facebook when the latter was getting more attention from users. Their study suggests that OSNs have a life cycle that may end with service decay. Dev et al.  studied the reasons behind the failure of what they call knowledge markets, such as Stack Exchange. They utilized economic production models in order to understand the dynamics of knowledge generated on these knowledge markets. Wu et al.  predicted the activity and inactivity of members of the DBLP coauthorship dataset by modeling the dynamics of the social engagement of the members of DBLP. They also provide insights regarding the characteristics of the members who departed the networks using network measures. Similarly, Fenner et al.  contributed a theoretical model for generalizing the rich-get-richer model of network evolution, which focuses mainly on growth dynamics, by extending it to link deletion in the Web network. Their model implicitly assumes that dynamics is not limited to growth dynamics but may include link removal. Asur et al.  approached the activity of users from trend analysis perspective in Twitter, shedding light on what causes some tweets to be trendy. They also found that the decay dynamics of a trend follows a linear function.
Community activity has also been studied by Kairam et al. ; they provide machine learning prediction models to predict community longevity. The authors also provide insights into the factors that contribute to keeping online communities active. In the same vein, Abufouda  contributed machine learning prediction models for predicting users who left decayed and alive communities, with a focus on the decay dynamics of online communities. Cannarella and Spechler built an epidemic model for predicting the dynamics of the members of Facebook . The results showed that Facebook would lose 80% of its users between 2015 and 2017 (the same model was used by Facebook researchers and predicted that Princeton University would lose half of its students by 2018, see https://www.facebook.com/notes/mike-develin/debunking-princeton/10151947421191849/). Decay dynamics also raised some computational aspects of the decay dynamics problem. Bhawalkar et al.  and Zhang et al.  provided a theoretical model and mathematical framework for finding the set of nodes whose deletion generates the smallest -core subgraph of a network, focusing on the computational challenge of the decay. Their works assure that the node removal problem is relevant in social and other networks. Ribeiro  studied user activity and inactivity by providing a model that uses the number of daily active users as an indicator of the dynamics in membership-based websites. This author also presented a prediction model for predicting whether a community will continue to grow or not, similar to the work in . Malliaros and Vazirgiannis  provide a model for social engagement describing the activity and inactivity of members of social networks based on game theory. Similar to the work in , Garcia et al.  investigated the decay of the Friendster social network using game theory. As one of the results of their work, Garcia et al. argue that decay has a direction, which starts from nodes with less coreness; this was later refuted by Seki and Nakamura , who provide a model that shows that decay starts from nodes with higher coreness. Abufouda and Zweig [35, 36] presented a stochastic model for describing the mechanics of inactivity cascades. The model has optimization guarantees that make controlling the decay computationally viable.
The previous works fall into two categories: (1) works that consider both growth and decay processes as a common behavior of online social networks and (2) works that approach the decay process in social context only via models, which were not validated with real inactivity decayed data using temporal snapshots. Although the first category seems to be more realistic, none of the related work in this category provides any thorough analysis of the mechanics of the decay process compared to the rich analysis of growth dynamics. This means there is little insight into the decay process of online social interaction, which would serve to better understand online behavior. As a result, the second category of the related work realized that decay dynamics needs to be considered as a separate process and requires further thorough investigation, particularly after the decline of many online social networks like MySpace and Friendster. However, these works used either synthesized data, which led to contradictory conclusions on the same research question (see the work in  and an opposing argument in  regarding the decay direction and our attempt to resolve this issue in Section 5), or did not consider the temporal aspect of the problem. This study fills the gap by focusing only on decay dynamics using real temporal data from decayed online social communities. Furthermore, we enhance the analysis using inactivity cascades, which, to the best of our knowledge, have not been covered before. This enables us to better understand the characteristics of real inactivity cascades and, hence, helps us gain more insights into the online behavior of humans.
3. Definitions and Methods
3.1. Networks and Measures
An undirected graph is defined as a tuple , where is the set of nodes of and is the set of edges that is defined as . An edge is defined as a pair of two nodes and , where . Graphs at a specific point of time are denoted as , where and are the set of nodes and edges that are observed at time point , respectively. The set of graphs is a temporal structure of a graph at time points , where . The graph is called the initial network, where .
A tree is a connected graph with no cycles. An inactivity cascade tree, a cascade for short, is a rooted tree where each directed edge contains two nodes such that the last observed time points of nodes and were and , respectively, such that and . Algorithm 1 describes the steps we followed to extract such cascades. That is, node became inactive before its neighbor, node . The root of a cascade is called a cascade initiator, which is any node that becomes inactive while all of its neighbors are active. If no such node exists, we arbitrarily select one of the earliest nodes that became inactive. The number of nodes in a cascade is called cascade size.
The edge formation period for an edge , where , is defined as . Based on that, we measure the normalized cascade duration, which is defined as
For the set of graphs , a set of inactivity cascade trees is extracted. The virality of a cascade measures how far the effect of the initiator of a cascade goes . The measure is defined as (this measure was originally proposed as Wiener index ) where is the length of the shortest path between the nodes and and is the number of nodes in a cascade. We propose a Jaccard-like similarity measure of two cascades. To have more structural similarity, we consider the structural properties of a cascade by considering the neighborhood of nodes in cascades such that if there is a node shared between two cascades with also many shared neighbors, then the two cascades are assumed to be more similar. Thus, we define
In addition, we used the features in Table 1 for building a supervised machine learning model for predicting cascade’s properties.
3.2. Statistical Divergence
3.2.1. Cumulative Distribution Function
The cumulative distribution function (CDF) for a discrete random variable is defined as . If is continuous, then the CDF is defined as . Similarly, the complementary CDF is defined as .
3.2.2. Kolmogorov-Smirnov Test
The Kolmogorov-Smirnov test (KS-test) is a statistical test that tells whether two different samples were drawn from the same distribution or not. The test is used to compare two patterns in order to know if they are the same or statistically different. Informally, it is the maximum absolute distance between the two CDFs of the two samples. More formally, for two CDFs, and , the KS-test statistics is defined as , where is the supremum of a set.
3.2.3. Patterns’ Entropic Similarity
Shannon entropy  quantifies the information in a discrete random variable as follows: . Given two probability distributions and , the Kullback-Leibler divergence  () is a measure that finds how similar these two distributions are and it is defined as . The Jensen-Shannon divergence is then defined as , where , which is a symmetric distance variation of the .
Stack Exchange (https://StackExchange.com/) is a network of questions and answer websites that contain subwebsites for specific topics, such as computer science, German language, or workplace, to name just a few. Before being available to the public permanently, each of these websites must go through a beta version, becoming permanent for the public if it sustains a certain level of activity. If the subwebsite does not meet the activity requirement, it is shut down. Some of these subwebsites go back and forth between being beta and closed. As a result, all of the users’ accounts and their interactions are saved. This information is the dataset used for this work. We parsed, structured, and analyzed a set of closed (decayed) subwebsites as an example of communities that underwent decay dynamics [27, 36] and alive subwebsites, respectively. The decayed subwebsites we considered in this work are Business Startups and Economics. In addition to that, we also have data for alive websites, such as Statistics, Latex, and Music. We used both types in order to make a comparison, if possible, between the patterns and cascades found in the alive and the decayed communities. One advantage of this dataset is that it contains all the temporal information needed to construct temporal social networks based on the interactions among the users. So, we constructed networks where the nodes are the members of these networks and the edges are the interactions among them, including replying to a question, upvoting, or downvoting. Table 2 shows a summary of the datasets used, the monitoring period for the interactions, the number of networks constructed, information about the first and the last networks, and the number of extracted cascades. The monitoring period for the datasets differed according to their active periods; e.g., for the decayed subwebsites (the first three rows in Table 2), the last monitoring day was the last day these subwebsites were active. Conversely, the last three subwebsites are still alive, so the last monitoring day was the same. Note that the set of nodes refers to the core nodes used for this study, which means other nodes emerging in-between were ignored. The core nodes were members with a reputation score of at least ; we tried smaller values, e.g., 100, 200, 300, and 400, for the reputation score, and the resulting temporal networks were too sparse with too many disconnected components which hinders any subsequent analysis. The reason for that in the context of the Stack Exchange websites is that there are many users who come only for one question or make only one comment and then do not appear again on the platform. We consider those users as outliers to the platform’s core activity, e.g., information production, and thus, the chosen value, i.e., reputation score , is justified from the lower bound side. We did not select larger values for two reasons: (1) there are few users in some communities who have reputation score larger than 500 and (2) selecting larger reputation score excludes members with less activity and thus the core nodes become significantly few nodes. Both cases render the constructed networks useless for any analysis. Thus, the chosen value is justified from the upper bound side.
From the table, it is clear that the alive subwebsites Latex and Statistics, which are considered very active, succeeded in keeping nearly 10% of the core nodes in the last network, whereas this percentage is almost zero in the other subwebsites. We found that those 10% of the members were users with very high overall reputation score. For instance, user number 5001 (https://tex.stackexchange.com/users/5001/mico) was active in all of the networks used overtime for the Latex subwebsite, and he/she is in the top among the whole users of Stack Exchange and have reputation score thousands. The same behavior was found on the Statistics subwebsite for user 805 (https://stats.stackexchange.com/users/805/glen-b) who is in the top among the Stack Exchange users and have reputation score thousands. We noticed that these two users were active mainly on the corresponding subwebsite, Latex and Statistics, respectively. For the Music subwebsites, the situation is different. The number of retained members from the core nodes was only two users, which is very similar to the decayed subwebsites. Moreover, those two users were mainly active on other subwebsites; for example, user 932 (https://music.stackexchange.com/users/932/leftaroundabout) was found in all of the networks of the Music dataset, but his main activity was on the Stack Overflow subwebsites. For the decayed websites, it was hard to get information about the retained users from the core users because no user information was available.
5. Results and Discussions
5.1. Analysis and Modeling Results
Here, we start presenting the results of the analysis by providing information about the largest cascades extracted from the datasets. Figure 1 shows the largest cascades of the subwebsites Startups, Economics, Statistics, and Latex. We observe that the cascades of the decayed communities, such as Startups and Economics, contain a larger fraction of nodes from the initial network . The fraction of the nodes in the largest cascades, considering the initial network, is 0.44, 0.45, 0.15, 0.21, and 0.09 for the subwebsites Startups, Economics, Statistics, Latex, and Music, respectively.
The figure also shows that for the decayed subwebsites, the color of the nodes is very close to each other, which suggests that the duration of the decayed subwebsites was short compared to the duration of the alive subwebsites, because the colors of the nodes in the alive subwebsites are clearly lighter at the nodes close to the leaves. This will be statistically supported in the following section.
5.1.1. Cascade Size
The size of a cascade is the number of nodes it contains. Figure 2 shows the results obtained for different subwebsites. We can observe in the figure that all datasets contain cascades that have at least 38% of the nodes from the nodes of the initial network. This percentage is even higher in decayed communities (Startups and Economics) and reaches 55% on the Startups subwebsite. Figure 2 also shows that the cascade size patterns appear visually different. The difference is even clearer in Figures 2(b) and 2(d), where the cascades in the decayed communities contain a lot more nodes. To get statistical significance concerning this phenomenon, we used the KS-test described in Section 3.2. We found that there is statistical significance between the decayed and the alive subwebsites. We found that the probability distributions of the cascade size are the same (e.g., seems to be drawn from the same distribution) in the alive websites (), are the same for the decayed subwebsites (), and are different when testing an alive website and a decayed website (). The only exception to this occurred when testing the statistical significance between the Statistics and the Latex subwebsites; although both are still alive, the cascade sizes were statistically different ().
(a) 100% of the cascades
(b) Largest 50% of the cascades
(c) 100% of the cascades
(d) Largest 50% of the cascades
Discussion Point 1. Different inactivity cascade patterns exist in alive and decayed subwebsites.
The size of the cascades extracted from different subwebsites shows that inactivity dynamics is common in both alive and decayed subwebsites of the Stack Exchange. However, the size of the cascades in the decayed ones was significantly larger than the size of the inactivity cascades found in the alive subwebsites. Based on Figure 2, the smallest cascade in the largest 50% of the cascades contains more than 20% of the nodes from the initial network of the decayed subwebsites, compared to nearly 10% for the alive ones. Our interpretation of this is that there are members of the alive subwebsites who are maintaining the aliveness of these communities and continuously provide content (in terms of, for example, answers to the questions), which keeps the platform active. This can be clearly seen in Table 2, where in the alive subwebsites, the number of nodes found in the last observed network is very much higher than that of the nodes found in the decayed subwebsites. It seems that those members are experts whose existence is vital for sustaining these communities. Investigating the profiles of some of those members (see Section 4) supports our interpretation.
5.1.2. Cascade Virality
Figure 3 shows the Wiener index of the cascades extracted from different subwebsites as a measure of virality. As the size of the networks and the size of the cascades differ across the subwebsites, it was necessary to normalize the Wiener index to enable a meaningful comparison of the distributions. To that end, we used a sigmoid function for normalization (other normalization methods like tanh function and min–max normalization provided almost identical results). Generally, the patterns of virality across different subwebsites are statistically the same (), except for the Economics subwebsite, where the virality patterns are statistically different with . This special behavior of the Economics dataset is ascribed to it being a small dataset with only 17 cascades. Surprisingly, the figure shows that the decayed subwebsite Startups shows fewer viral cascades, with a mean of 0.27. This suggests that there should be another feature affecting the decay of the decayed subwebsites. In the following section, we will discuss this in more detail.
(a) 100% of the cascades
(b) Largest 50% of the cascades
(c) 100% of the cascades
(d) Largest 50% of the cascades
5.1.3. Maximum Degree of Cascade
Another pattern that we looked into is the maximum degree in a cascade. Figure 4 shows the normalized maximum degree in a cascade for different subwebsites. The visualization suggests that the decayed subwebsites Startups and Economics contain cascades of nodes with larger degrees than the alive subwebsites. The statistical analysis shows that the decayed subwebsites have a very similar distribution of the maximum degree in a cascade with . The decayed and the alive subwebsites are statistically different with . Once again, the Statistics subwebsite shows a different pattern. It is neither similar to any of the decayed subwebsites nor to any of the alive subwebsites, with .
Discussion Point 2. Inactivity decay is ascribed to a cascade’s node degrees, not to its virality.
Unexpectedly, the decayed subwebsites we examined had fewer viral cascades than the alive subwebsites. This led us to investigate the microproperties of the cascades rather than relying only on the macroproperties. We found that the cascades in the decayed subwebsites are less viral, but their nodes have larger degrees compared to those in the alive subwebsites. Additionally, we discovered that cascade initiators in decayed subwebsites have larger degrees in the cascade trees than noninitiators. This indicates that the expert members (who have larger degrees due to their activity and contribution) started the inactivity process, followed by nonexpert members. Having said that, one possible reason for the closure of the decayed subwebsites is the lack of activity from those members who should have sustained the community and kept it going until it reached the public version. On the other hand, the more viral cascades in the alive subwebsites, which also have a smaller number of nodes and contain nodes with smaller degrees than the decayed subwebsites, indicate that the effect of inactivity is limited. The reason for this is that the size of the cascades in the alive subwebsites is small, with initiators having smaller degrees, compared to decayed subwebsites. We conclude that expert members acted as obstruction points in the cascade trees, stopping the effect of inactivity cascades from being very disruptive.
5.1.4. Cascade Duration
Here, we provide the results for the analysis of cascade duration defined earlier in Section 3.1, (1). Figure 5 shows the cascade duration of different subwebsites. The normalized -axis reflects how long the cascade takes to be completed, i.e., until the last day of the observed time. The figure shows that the cascades in the decayed subwebsite Startups took noticeably less time to be completed, i.e., it had faster cascades. This is also clearly visible in Figure 5(a). The statistical analysis of cascade duration showed that every subwebsite has its own characteristics, with no common pattern identified ().
Discussion Point 3. Which subwebsite is going to decay next?
Although the Statistics subwebsite is alive and falls into the category of alive subwebsites based on the results described in Sections 5.1.1, 5.1.3, and 5.1.4, we discovered that the Statistics subwebsite inactivity patterns are closer to the patterns found in the decayed subwebsites than to those of the other alive subwebsites. Using the described in Section 3.2, we found, strangely, that the Statistics subwebsite is closer to the decayed subwebsites in terms of cascade size, virality, maximum degree in a cascade, and cascade duration. We investigated this behavior and found that the Statistics subwebsite is the least active subwebsite among all Stack Exchange subwebsites with the fewest answered questions; that is, only 61% of the questions were answered (https://stackexchange.com/sites), whereas on other subwebsites, the answer rate is much higher, for example, reaching 93% and 97% on the Latex and Music subwebsites, respectively. This odd behavior, which was caught by our result, supports the effectiveness of the method we used. We think that the Statistics subwebsite may fall into a decay process if its activity level remains as low as it is.
5.1.5. Cascade Coreness
Here, we examine the coreness of the nodes in a cascade as a microscopic property of a cascade. We start by examining the coreness of an initiator. Figure 6(a) shows a comparison between the coreness of all noninitiator nodes in network and the coreness of the initiators from all subwebsites as CCDF. The figure shows that the probability of having a coreness, say in the initiators, is larger than what is found for all nodes. This suggests that the coreness of the initiators is larger than that of the other nodes in the initial network . This was also statistically confirmed with . However, further examination provided different insights and patterns. We performed the same analysis for each of the subwebsites. For example, in Figure 6(b), there was no clearly different pattern for the subwebsite Startups, where the initiators have higher coreness for the coreness values [22, 27] but less coreness for the coreness values [20, 32]. For the other subwebsites in Figures 6(c), 6(e), and 6(d), the initiators have a clear pattern. They have more coreness than the other nodes in the corresponding . An opposite pattern was found in the subwebsite Music (cf. Figure 6(f)).
The previous analysis only refers to the initiators. To understand the coreness in the temporal context, we define the following: a cascade path is a connected directed subgraph of a cascade , where the maximum degree for all nodes of is 2, with no cycles. The coreness monotonicity of a cascade path is said to be increasing if , decreasing if , and nonmonotone otherwise, . If the nodes in a cascade path have the same coreness, then we consider it nonmonotone. All coreness values are calculated in the initial network . Based on that, we extracted cascade paths from all cascade trees where the first node in a path is the initiator of this cascade tree. Then, we examined the coreness monotonicity of these paths. The results are shown in Figure 7 and indicate that the coreness of the cascade paths is clearly different across different subwebsites. Moreover, the fraction of monotonically increasing and monotonically decreasing paths was nearly identical in some cases (see, for example, the Statistics and Music subwebsites). Also, in the case of decayed subwebsites (see the Startups subwebsite), the fraction of nonmonotone paths was larger than for any of the other two types.
Discussion Point 4. Coreness (and generally speaking, any single measure) alone does not control inactivity cascades.
In their work, Garcia et al.  posed the question of whether the decay starts from the interiors (nodes with high coreness) or from exteriors (nodes with low coreness). In their work, they argued that the decay of the Friendster social network started from exterior nodes. Later, Seki and Nakamura  presented a counter-argument, showing that the decay started from the interiors, and provided a model for understanding the decay process. Here, we argue that the answer to the question “Does the decay start from the interior or the exterior nodes?” is neither. The results of this work show no uniform pattern across different subwebsites that correlates to the direction and the coreness of the decay (cf. Figure 7). Furthermore, we argue that the question contains an implicit unsupported assumption, namely, that coreness only controls the decay. We strongly believe that coreness alone cannot be used to understand the direction of decay dynamics if the direction really matters. In Section 5.1.5, we provided a formal framework defining the direction of the decay considering the temporal decay so that we can explicitly tell whether coreness alone can be used as an indicator for the direction of the decay. We found that the initiators of cascades contain opposing patters in terms of whether their coreness is higher or smaller than the coreness of noninitiators. Additionally, we analyzed the coreness of the nodes in the cascade paths (coreness monotonicity) and found evidence that coreness is not correlated with the direction of the decay. Moreover, we performed an analysis using different measures, like degree and betweenness. We conclude that it is very hard to describe the decay process using only one measure. This is also clearly visible in the prediction results (cf. Figure 8) where the importance of the features used for predicting cascade size and virality was close. To further support our argument, we predicted cascade size and virality using only one feature. In no case were the results better than when we predicted them using multiple features. We found the results of prediction using only one feature to be very close to the baseline predictor; for example, the MAE was 0.23, 0.23, 0.22, and 0.22 for predicting cascade virality using betweenness, degree, coreness, and min. cut, respectively. To sum up this point, we think that inactivity decay may be caused by network-independent factors, like privacy issues, competence between social network providers, and/or content quality. If any of these factors manifests itself, it renders the network measures unusable for describing inactivity decay.
(a) Results for predicting cascade size
(b) Results for predicting cascade virality
5.1.6. Cascade Similarity
Using the similarity measure defined in (3), we calculated the similarity of each pair of cascades. Figure 9 shows a heat map for the similarity of the cascades for different subwebsites. Figure 9(a) clearly shows less similarity between the cascades of the Startups subwebsite, unlike the other panels in Figure 9. It is also observed that cascades with a smaller number of nodes seem to be more similar than those with a large number of nodes. An exception is the Economics subwebsite, where cascades with larger nodes are more similar than those with fewer nodes.
To get statistical confidence about the comparison, we used the statistics described in Section 3.2. We found that although all of the subwebsites exhibit different similarity patterns (), the decayed subwebsite Startups has the smallest average similarity with a value of 0.03, compared to 0.21, 0.16, 0.17, and 0.11 for the other subwebsites. This difference can easily be seen in Figure 10.
Discussion Point 5. Cascade similarity reflects how resilient a network was while it evolved.
The model we described for the extracted cascades in Section 3.1 allows for cascades with the same nodes and/or edges. This means that we can measure the similarity of two cascades. Basically, if there are many similar cascades in a subwebsite, this means that there are fewer paths on which the inactivity cascade took place than less similar cascades. This means that, for cascades with less similarity, there are many decay propagation paths that are susceptible to inactivity and conversely, for cascades with high similarity, there exist fewer decay propagation paths that are susceptible to inactivity. Thus, cascade similarity can be seen as a measure for the resilience (or vulnerability) of a community for any future model or simulation of inactivity decay. Based on the results described in Section 5.1.6, it is apparent that the decayed subwebsites contain more nodes that are susceptible to inactivity than the alive subwebsites. The similarity of the cascades in the alive subwebsites is high, suggesting a lower number of cascade paths.
5.2. Prediction Results
In this section, we provide a prediction framework we designed for predicting some cascade features. We formalize the prediction problem as follows. Given a training set , where is the set of input features of length , is the target value to be predicted, and is the number of data points in the training set. The prediction problem is then defined as estimating a function , where is the predicted target value that is being compared to the real target value . Thus, the optimization problem is generally defined as , where is an arbitrary cost function. In this work, we used the mean absolute error cost function which is defined as . To evaluate the performance of the model, we used data points that had not been used during training and then evaluated them using the cost function with the true values of the target. We used gradient boosting regression (GBR) , which is basically a decision tree with simple rules that is used for iterations, where in each iteration a new decision tree is used to predict the previous prediction residual (the GBR outperformed other algorithms and techniques that we tested, such as logistic regression and classical decision trees. The technical details of the GBR algorithm can be found in ). We used the scikit-learn  Python library implementation of the .
The features we used are shown in Table 1. We used these features to predict cascade size and cascade virality. We used only features from the network and did not use any of the temporal features in order to make the prediction more realistic, as temporal features of a network exhibit proxies for the predicted values, which weakens the applicability of the method. The features described in Table 1 have different effects on the prediction; thus, we performed feature ranking in order to get insights regarding which features are more important during the prediction. Figure 11 shows the feature ranking for predicting cascade size and cascade virality. Figures 11(a) and 11(b) show that the importance of the features is different; for predicting cascade size, the average of neighbors’ degrees was the most important one, whereas the feature coreness was the most importance one for predicting cascade virality. In both cases, the features degree and eccentricity were the least important ones in the set of features. Based on that, we used the five best features from each ranked set. Other combinations of the features resulted in lower, but very close, prediction performance.
(a) Feature ranking for predicting cascade size
(b) Feature ranking for predicting cascade virality
To perform a meaningful prediction, we combined the values of all features of the subwebsites used into one dataset. Then, we split this dataset into two subsets, with 75% (1002 cascades) and 25% (334 cascades) for training and testing, respectively. We used the MAE as a prediction accuracy measure. As splitting the dataset into training was done in a random manner, we ran the prediction experiment 100 times to get statistical significance regarding the results. Additionally, we compared the results to a baseline predictor that uses naive rules, such as taking the mean, the median, or a constant value for the predicted target. We compared the prediction results to the best baseline we got, which was the mean baseline. The prediction accuracy of cascade size in terms of the MAE was 9.9, which is 35% better than the baseline predictor. The prediction results mean that, on average, the predicted cascade size contains ±10 nodes. The prediction accuracy of cascade virality in terms of the MAE was 0.194 which is more than 25% better than the baseline predictor. Figure 8 shows the results of the prediction for the 100 performed runs for predicting both cascade size and the cascade virality in (a) and (b), respectively. The figure shows that there is a clear significance in favor of the GBR algorithm over the baseline predictor.
Discussion Point 6. For temporal networks, early network’s structure encompasses sufficient information to predict the properties of potential decay cascades.
It was surprising that using only network features from the network provided a satisfactory prediction of cascade’s virality and size. These results suggest that the early structure of an evolving network dictates its future. The prediction model described and evaluated in Section 5.2, which used no temporal information at all, indicates that the (in)activity dynamics of social networks is governed by the topological structure of the network itself.
6. Closing Thoughts
Although the method used in this work is reliable and the results have been validated, this work is subject to the following limitations. The networks used in this work were aggregated from different types of interactions on Stack Exchange subwebsites. This aggregation used the social interactions among the members of these subwebsites, and we assumed that the resulting network is a community. In order to make sure that the networks we used represent real temporal interaction among the users, we used different time frames to take a snapshot for each subwebsite. The reason for this is that each subwebsite has a different timespan; for example, the alive subwebsites are still active, unlike the decayed subwebsites, which have a significantly shorter lifespan. We believe that our design decisions for selecting the time frames have no significant effect on the results and the conclusions. Also, the results and conclusions in this work are valid for the Stack Exchange subwebsites and similar platforms. We did not check other types of social networks or aimed at generalizing the results to any type of social network. Nor did we provide a model (other than data fitting using the machine learning regression model we described in Section 5.2) for better understanding the decay of online social communities. Such a model might help to eventually control and prevent such decay. This gap remains open and requires future work.
The dataset and the code used for this work are available upon request.
This work is part of the PhD ongoing research of Mohammed Abufouda supervised by Prof. Dr. Katharina A. Zweig.
Conflicts of Interest
The author is employed as research fellow at the computer science department in the University of Kaiserslautern, Germany.
A.-L. Barabási and R. Albert, “Emergence of scaling in random networks,” Science, vol. 286, no. 5439, pp. 509–512, 1999.View at: Google Scholar
L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan, “Group formation in large social networks: membership, growth, and evolution,” in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '06, pp. 44–54, Philadelphia, PA, USA, August 2006.View at: Google Scholar
J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” in Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 668–677, San Francisco, California, USA, January 1998.View at: Google Scholar
S. Milgram, “The small-world problem,” Psychology Today, vol. 1, no. 1, 1967.View at: Google Scholar
J. L. Moreno, Who Shall Survive? Foundations of Sociometry, Group Psychotherapy and Sociodrama, 1953.
S. Asur, B. A. Huberman, G. Szabo, and C. Wang, Trends in Social Media: Persistence and Decay, 2011.
S. R. Kairam, D. J. Wang, and J. Leskovec, “The life and death of online groups: predicting group growth and longevity,” in Proceedings of the Fifth ACM International Conference on Web Search and Data Mining - WSDM '12, pp. 673–682, Seattle, Washington, USA, February 2012.View at: Publisher Site | Google Scholar
M. Abufouda, “Community aliveness: discovering interaction decay patterns in online social communities,” in In 4th European Network Intelligence Conference, Lecture Notes on Social Networks, p. 2017, Springer. Springer International Publishing.View at: Google Scholar
F. Zhang, Y. Zhang, L. Qin, W. Zhang, and X. Lin, “Finding critical users for social network engagement: the collapsed k-core problem,” in Thirty-First AAAI Conference on Artificial Intelligence, pp. 245–251, San Francisco, CA, USA, 2017.View at: Google Scholar
F. D. Malliaros and M. Vazirgiannis, “To stay or not to stay: modeling engagement dynamics in social graphs,” in Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management - CIKM '13, pp. 469–478, San Francisco, California, USA, November 2013.View at: Publisher Site | Google Scholar
M. Abufouda and K. A. Zweig, “Stochastic modeling of the decay dynamics of online social networks,” in Complex Networks VIII, B. Gonçalves, R. Menezes, R. Sinatra, and V. Zlatic, Eds., pp. 119–131, Cham, Springer International Publishing, 2017.View at: Google Scholar
F. Pedregosa, G. Varoquaux, A. Gramfort et al., “Scikit-learn: machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.View at: Google Scholar
G. Louppe, L. Wehenkel, A. Sutera, and P. Geurts, “Understanding variable importances in forests of randomized trees,” In Advances in Neural Information Processing Systems, pp. 431–439, 2013.View at: Google Scholar