Advances in Processing, Mining, and Learning Complex Data: From Foundations to Real-World ApplicationsView this Special Issue
Research Article | Open Access
Mining Community-Level Influence in Microblogging Network: A Case Study on Sina Weibo
Social influence analysis is important for many social network applications, including recommendation and cybersecurity analysis. We observe that the influence of community including multiple users outweighs the individual influence. Existing models focus on the individual influence analysis, but few studies estimate the community influence that is ubiquitous in online social network. A major challenge lies in that researchers need to take into account many factors, such as user influence, social trust, and user relationship, to model community-level influence. In this paper, aiming to assess the community-level influence effectively and accurately, we formulate the problem of modeling community influence and construct a community-level influence analysis model. It first eliminates the zombie fans and then calculates the user influence. Next, it calculates the user final influence by combining the user influence and the willingness of diffusing theme information. Finally, it evaluates the community influence by comprehensively studying the user final influence, social trust, and relationship tightness between intrausers of communities. To handle real-world applications, we propose a community-level influence analysis algorithm called CIAA. Empirical studies on a real-world dataset from Sina Weibo demonstrate the superiority of the proposed model.
Community-level influence analysis is an emerging problem, which can be used in many filed, for example, recommendation system [1, 2], public opinion prediction , and cybersecurity analysis . There are many researchers who are interested in analyzing the social influence in social networks , but rarely assessing the influence in community level. With the rapid spread of online social networks, such as Twitter, Facebook, and Sina Weibo, large amounts of data with the real world are produced, which provide support for the social influence analysis.
How to establish an effective model for analyzing community-level influence has become an important research for online social network. Community-level influence is greater than individual-level influence, but few researchers have studied community influence. The existing studies establish various social influence analysis models [6, 7], but they just study the influence in the individual level and mostly ignore the existence of a common influence pattern from a community that includes multiple nodes. A large number of achievements have been obtained on individual-level influence, but most of the studies are based on static statistics method [8–11], link analysis algorithms [12–14], or probabilistic models [15–17]. These studies do not consider whether the user is willing to receive or diffuse information or what the role of social trust between users is or do not remove zombie fans. However, these factors are very important for analyzing the social influence. Meanwhile, the existing works about community-level influence focus on the influence strength between communities and ignore the problem of analyzing the community-level influence. For example, Belák et al.  calculated the community-level influence by only averaging influence of all users in a community.
An important observation is that zombie fans have no contribution to the social influence, and the willingness of users to diffuse information has a certain effect on the accuracy of calculating social influence, and social trust plays an important role in social influence. The trust degree of user A to user B determines the influence of user B on user A. The more the user A trusts user B, the more influence the user B has on the user A. Because user influence is the basis of the community influence, a little carelessness on the former will lead to errors on the later.
Aiming to assess the community-level influence effectively and accurately, we construct a community-level influence analysis model that can assess community influence. Based on our model, a community-level influence analysis algorithm (short for CIAA) is proposed, which can assess the community influence more effectively and accurately. The main idea of our model is as follows. First, we eliminate the interference of zombie fans on the social influence to make the results more accurate. Then, in the process of calculating user influence, we consider the social trust and use the random walk method to calculate the user influence. In evaluating the user’s theme information, the user mean willingness is calculated by exploring the content related to the user’s theme information. We combine these two factors (the user influence and the user willingness to diffuse theme information) to calculate the user final influence. Finally, the community-level influence is calculated by comprehensively studying the user final influence, the social trust, and relationship tightness between intrausers of communities. Experiments are conducted on a real-world dataset crawled from Sina Weibo. Comparing with the state-of-the-art algorithm (the averaging user influence algorithm ), the results show that our model is more effective and accurate to evaluate the community-level influence.
The contributions of this paper can be summarized as follows. We formulate the problem of analyzing the community-level influence and design a community-level influence analysis model. CIAA, a community-level influence analysis algorithm based on our model, is proposed, which is effective and reliable to evaluate the community influence of microbloggers from Sina Weibo. We conduct extensive experiments to assess the performance of the proposed model. Experimental results on the real-world dataset demonstrate the superiority of the proposed CIAA.
The rest of the paper is organized as follows. In Section 2, we summarize the related works. In Section 3, we propose the community-level influence analysis model and give an example to illustrate its working principle, and the CIAA is proposed. In Section 4, we conduct experiments on the real-world dataset crawled from Sina Weibo and then analyze the performance of the proposed approach. Finally, we state the conclusion and future work in Section 5.
2. Related Works
Since Katz and Lazarsfeld  found that social influence plays an important role in social life and decision-making in the 1950s, researchers in computer field have spare no effort to study the relevant problems. It is found that the popular users play an important role in adopting innovation, social public opinion propagation and guidance, group behavior formation and development , and so on.
There are a great deal of research efforts to measure individual-level influence [20, 21], typically, the “opinion leaders.” Existing methods can be categorized into three types: the network structure based methods, the user behavior based methods, and the mutual information based methods. The network structure based methods are degree centrality , closeness centrality , betweenness centrality , eigenvector centrality , Katz centrality , PageRank , and clustering coefficient . We know that node degree essentially means the connection between a node and its neighbors. The method based on node degree can intuitively express this meaning, and its computational cost is smaller than other methods . These methods are widely used in measuring the users’ influence in the social network. However, the methods based on node degree only reflect the connection between the users and their neighbors and cannot measure the users’ influence in the entire social network for the local influence of users. For example, based on the community scale-sensitive maxdegree, Hao et al.  proposed an influential users discovering approach called CSSM when placing advertisements. CSSM uses the degree centrality and neighbor’s degree to evaluate node’s (microbloggers) influence. However, the algorithm does not consider the contribution of microblogs to user influence. Comparing with the methods based on the degree, the method based on the shortest path (closeness centrality and betweenness centrality) can measure the individual-level influence in the entire social network. Nevertheless, its computational complexity is higher than the degree centrality method. For example, based on text mining and social network analysis, Bodendorf and Kaiser  proposed an approach to detect opinion leaders in directed graph of user communication relationship. It can predict tendency of network opinion leaders via closeness centrality and betweenness centrality. Moreover, measuring the individual-level influence by the shortest path is an ideal status, and it is difficult to achieve in the real-world application scenarios. Besides, the methods based on random walk only consider the structure characteristics of the node while ignoring the behavior characteristics. For example, Xiang et al.  provided an understanding of PageRank and authority from an influence propagation perspective by performing random walks. However, they did not consider the personal attributes to understanding of PageRank as well as the relationship between PageRank and social influence analysis. Zhu et al.  proposed a novel information diffusion model called CTMC-ICM, which introduces the continuous-time Markov Chain theory into the Independent Cascade Model. Based on the model, they proposed a new ranking metric called SpreadRank. Based on continuous-time Markov process, Li et al.  proposed a dynamic information propagation model called IDM-CTMP to predict the influence dynamics of social network users. IDM-CTMP defined two other dynamic influence metrics and could predict the spreading coverage of a user within a given time period. Zhou et al.  established new upper bounds to significantly reduce the number of Monte-Carlo simulations in greedy-based algorithms, especially at the initial step. Based on the bound, they proposed a new upper bound based lazy forward algorithm for discovering the top- influential nodes in social networks.
The aforementioned models focus only on assessing the social influence of single individuals. However, a small number of works attempt to build models on the community influence analysis. Qi et al.  applied degree centrality, closeness centrality, and betweenness centrality to groups and classes as well as individuals. Latora and Marchiori  put forward a group information centrality to measure the importance of node sets. Mehmood et al.  exploited information diffusion records to calculate the influence strength between different communities. Although these works preliminarily study the community-level influence, none of them focuses on how to measure a community’s influence. Belák et al.  assessed the community-level influence according to the average of the all users’ influence in the same community. Because the distribution of the users’ influence is uneven in different communities, average based method is inequitable to bigger communities, while summation based method is inequitable to smaller ones. At present, community-level influence analysis is still a challenging problem.
3. Proposed Methodology
We construct our model and implement the corresponding algorithm in this section. First, we give the related definitions in Section 3.1. Then, we propose the community-level influence analysis model for microbloggers. Next, we describe the working principle of our model via an example in Section 3.2. Finally, the community-level influence analysis algorithm is proposed in Section 3.3.
3.1. Related Definitions and Community-Level Influence Analysis Model
3.1.1. Related Definitions
Social networks and communities are described as follows: a typical social network can be represented as a bipartite graph , is a set of nodes (users) in a social network, and is a set of edges used to describe the relationships between nodes. A community can be represented as a subgraph of a social network: that is, ; is a set of users in a community. is a set of relationships between users within a community. A node is defined as a user within the community if he/she belongs to the community; otherwise, he/she is defined as a user outside the community. The set of users outside the community is written as UOC. Modeling and calculating the community influence of are the basis of our work, and the objective function of our model is as follows:
denotes the community influence of the community , and the function indicates that the assessment method is based on and . There are two entities (i.e., users and communities) which can produce influence. To study the community-level influence, we give the related definitions as follows.
Trust. A node in a social network has a certain trust degree in other nodes according to its past contact with other nodes or the reputation of other nodes [39, 40]. According to the different sources of trust, we divide the trust into direct trust and indirect trust.
(1) Direct Trust (DT). Assume that the node is the entry node of the node , indicating that there is contact between and . According to the previous contacts and the reputation of , will have direct trust on .
(2) Indirect Trust (IT). Assume that the node is the reachable node of the node ; will have indirect trust on because the reputation of can be transmitted to .
Users not only have mutual trust, but also mutually influence each other. According to the different sources of influence, this paper divides the influence into direct influence and indirect influence.
(1) Direct Influence (). Assume that the node is the entry node of the node ; will have an influence on : that is, produces direct influence on .
(2) Indirect Influence (II). Assume that the node is a reachable node of the node ; will have an influence on through transmission layer by layer: that is, produces indirect influence on .
In order to assess the overall influence of on , we define the user combined influence.
User Combined Influence (UCI). Because has direct trust or indirect trust to , and has direct influence or indirect influence on , we comprehensively combine the four factors to calculate the combined influence of on .
(1) User Influence (UI). User influence refers to the influence of individual on other users.
(2) Community Influence (CI). Community influence is the overall influence of the community, which is formed by the of all the users in the community and the community’s self-factors.
Mean Willingness to Diffuse Theme Information (). In communities, some users receiving the theme information may not diffuse it, some users prefer to post their own blog, and some users prefer to forward others’ blog. We assess the community influence by taking into account the diffusion of information between users. represents a user’ willingness to diffuse the information of a blog. The theme information of the user is stored in the set , where represents the user’s th theme information. If is diffused in a social network, a path map is formed to describe the propagation path. We store the path graphs formed by in the set .
3.1.2. Model Framework
Our model consists of four modules: data preprocessing module, data source module, the user final influence module, and the community influence module. Figure 1 shows our model framework.
Data preprocessing module is used to eliminate zombie fans. We judge the zombie fans from the behavior dimension and time dimension. Behavior dimension is based on the amount of theme information posted by the user and the fans’ influence of the user. Time dimension is based on the user login frequency and the frequency of diffusing theme information. Finally, the data preprocessing results are stored to the data source.
Data source module is responsible for providing the relevant data needed for influence analysis. We establish the user information table, the microblog table, the user fans information table, and the user attention table to access the user’s relevant information efficiently.
The user final influence module first calculates the mean willingness to diffuse theme information for each user in a community and then calculates the user’s influence. Next, it combines these two results to get the user final influence.
The community influence module first calculates the community size, the tightness of user relationship, and the user-integrated influence in the community and then evaluates the community influence by integrating the three factors.
3.2. Working Principle
In this subsection, we introduce the working principle of each module in the model framework in detail. We assume that and are two users in community . After performing data preprocessing, Figure 2 shows the working principle, where the mathematical notations will be described in the following subsections in detail.
The working principle can be described as the following steps.
Step 1. Calculate the and of . Then calculate the of . Finally, calculate of .
Step 2. According to Step 1, calculate the and of .
Step 3. Integrate and to calculate the . Then calculate and . Finally, combine the three factors to calculate the community influence.
3.2.1. Data Preprocessing
In microblogging networks, some users of ulterior motives or business purpose lead to producing the zombie fans. According to the definition in , zombie fans are the users who are fake fans generated and maintained mostly for economic purpose. Zombie fans certainly interfere in analyzing the social influence. A small number of empirical researches have been conducted on recognizing zombie fans [41–43]. The existing studies were mostly subject to the Twitter platform.
Presently, researchers generally detect the zombie fans based on the amount of attention, the number of fans, original and forward information frequencies, and other basic attributes. With the ever-changing escalation of zombie fans, zombie fans will produce more features . The existing feature-based methods to eliminate zombies may gradually fail. We observe that because zombie fans are occasionally managed via software program or a few people behind, zombie fans often rarely speak, even seldom log in, or no longer are used; and their behaviors can be vastly different with ordinary users in profile information and contents. Moreover, no matter how the features of zombie fans change, they can be split into time dimension and behavior dimension. Thus, it is reasonable to recognize zombie fans from the time dimension and behavior dimension, and it is more able to adapt to the needs of detecting zombie fans in microblogging networks.
According to expert knowledge criteria , in the time dimension, we assess zombie fans from the user login frequency and the diffusing advertisement frequency. Thus, time dimension includes login frequency () and diffusing advertisement frequency (). Login frequency refers to the number of logins in a period. The lower the frequency of login is, the higher the probability of the user becoming zombie fans is. The login frequency is calculated as follows:where indicates the number of logins. The higher the diffusing advertisement frequency is, the higher the probability of the user becoming zombie fans is. The diffusing advertisement frequency is calculated as follows:where indicates the number of diffusing advertisement frequencies.
For the same reason, in the behavior dimension, we assess zombie fans from the amount of user theme information and the individual influence of the user’s fans. Thus, we take into account the number of user theme information (), the number of attention users (), and the number of user’s fans ().
To ensure that the criteria of the parameters are reliable, the corresponding criteria are obtained by prior knowledge, expert knowledge, or experimental trial. For example, we select the users who are the last 10% of the login frequency and whose login time interval is greater than 7 days into the set . To reduce the amount of calculation, we filter all users in a microblogging network. If a user has a certified user in his/her fans, the user is not considered a zombie fan. If a user does not have a certified user in his/her fans, the details to eliminate zombie fans can be described in Algorithm 1.
As we can see that, unlike the classification and pattern recognition, the proposed method to eliminating zombie fans does not require labeled data and training model. It is effective and easy to use in practice.
3.2.2. The User Final Influence
The traditional models are simple, not taking into account the degree of social trust between users and the user’s willingness to diffuse theme information. However, the two factors are important to the user final influence. In this paper, the user final influence is calculated by integrating the and . Because the influence of a user on other users is related to the user’s willingness to exert his/her influence, the bigger the value of , the greater the probability of the user diffusing a theme information. is calculated as follows:
Mean Willingness to Diffuse Theme Information. The higher frequency of diffusing theme information means a higher user influence, because more users will know the user. Therefore, reflects the probability that a user has high-impact in a microblogging network. The parameter indicates the state of receiving theme information for the user as follows:
The initial value of is set to 0. Meanwhile, to know the result of diffusing the theme information , we observe . The parameter indicates whether diffuses the theme information that he/she received.
When the outdegree of is greater than 0, it indicates that has already diffused the theme information; otherwise, has never diffused the theme information. The number of users receiving theme information is written as NRTI and the number of users diffusing theme information is written as NDTI.
is calculated aswhere . is the of . is the weight. represents the total number of theme information posts by . is the set of indegree nodes of . represents the weight of the user , which is determined by his/her outdegree. is the total number of . The initial value of is set as 1. We give an example for calculating in Figure 3.
Assume that the of all users initially are 1, , and then calculate the as follows.
(1) . From Figures 3(b)–3(d), we have . For , he/she posts two-theme information, which forms two theme information graphs in Figures 3(b) and 3(c). Thus, we get the set . From Figure 3(d), , NDTI = 0, because the outdegree of node is 0, and forms its one theme information graph. The is calculated as follows:
The User Influence. There are mutual impact and mutual trust between users. Social trust plays an important role in calculating the user influence. She/he is impacted by others including users inside and outside the community.
(1) Calculating Direct Trust and Direct Influence. If is an entry node of , then will have direct trust on .where is the direct trust of on . is the reputation of user . is the set of entry nodes of , and is the reputation of the entry neighbor of . The value of depends on the average reputation of all ’s entry neighbors. For each node, we give the initial direct trust value 0.1. In Figure 3(a), we calculate the direct trust on from other nodes as follows:
has a direct influence on as follows:where is the direct influence of on . is the degree of interest of to . is the amount of the theme information from in the receiving theme information of .
(2) Indirect Trust and Indirect Influence. If is the reachable node of , then will have indirect trust on as follows:
is ’s indirect trust on . is the length of the shortest path from to .
In Figure 3(a), we calculate the indirect trust on gained from other nodes as follows:
has an indirect influence on as follows:
In Figure 3(a), we calculate the indirect influence of other nodes on as follows. The calculation of is the same as the above formula.
(3) User Combined Influence. Assuming that can reach through a path, we introduce the factor .
If is the entry node of , the combined influence of on isIf is not an entry node of node , but is a reachable node of , the combined influence is Assume . In Figure 3, we calculate the combined influence of other nodes on as follows. is the entry node of ; then we have . is the entry node ; then we have . is the entry node of ; then we have . is the entry node of ; then we have . is the reachable node of ; then we have .
(4) User Influence. User influence is got by combining all users’ influence:where SUCP represents a set of users that can reach through a certain path. For example, in Figure 3, the user influence of is calculated as follows:
When we get and , the user final influence can be calculated according to (4).
3.2.3. Community Influence
The community influence is composed of the users’ interaction inside and outside the community. In this paper, we consider it from three factors, that is, the user-integrated influence, the community size, and the degree of relationship tightness among users inside the community.
User-integrated influence () is integrated from the final influence of all users within the community.where is of the community . is the set of users inside community .
The community size () is important to the calculation of the community-level influence. The larger the number of users in a community is, the greater the influence of the community becomes. The formula is as follows:where represents the number of users in a community and represents the total number of users in the social network.
The degree of relationship tightness () represents the degree of closeness between users inside a community. We describe it from the user’s outdegree and indegree as follows:
Therefore, we calculate the as follows:where and () are used to distinguish the importance of different factors.
3.3. The Proposed Algorithm
According to the above description, we propose a community-level influence analysis algorithm, called CIAA, in a pseudo-code format in Algorithm 2. It can be seen from the algorithm that the total time complexity is . This means that our algorithm can be applied on large-scale social dataset.
We conduct experiments to validate the effectiveness of the proposed approach on a real-world microblogging network. In this section, we describe the experimental setup followed by the discussion of experiment results.
The real-world dataset in this paper is crawled from Sina Weibo by Weibo crawler. Similar to a hybrid of Twitter and Facebook, Sina Weibo is one of the most popular sites in China. It has more than 33% of the Internet users in China, and its market penetration is equivalent to that of Twitter in the United States. As released by the Sina Weibo, as of June 2016, the active users from different social and cultural backgrounds have reached 282 million monthly and 86.8 million daily. Moreover, there are nearly 100 million new microblogs every day. They promote and disseminate views and attitudes on business, culture, education, and so forth. The crawled data includes 20,151,129 microblogs, 932,578,467 comments, and 9,218 users. In this paper, we collected more than 1000 users from the crawled dataset and divided the related information into Tables 1, 2, 3, and 4 for data sources according to our model framework. They are stored in txt-formatted files.
4.2. Experimental Setting
All experiments are conducted on a PC with Intel Core i5 processor, 8 GB RAM. According to prior knowledge, we set the parameters of the experiments as Table 5.
4.3.1. Community Structure Analysis
In order to mine and study the characteristic of community, we plot the outdegree distribution and degree distribution of users in community. In a directed social network, the indegree of nodes is the number of fans of the user. The outdegree of nodes is the amount of the user’s attention. Figure 4 shows the outdegree and degree distribution of data sources.
As shown in Figure 4, the outdegree distribution and the degree distribution of Sina Weibo dataset follow the power-law distribution, which indicates that the social network composed of the dataset is a scale-free network.
4.3.2. Eliminating Zombie Fans