Quantitative Similarity Evaluation of Internet Social Network Entities Based on Supernetwork
How to accurately characterize similarities of entities is the basis of detecting virtual community structure of an Internet social network. This paper proposes a supernetwork based approach of quantitative similarity evaluation among entities with two indices of friend relation and interest similarity. The supernetwork theory is firstly introduced to model the complex relationship of online social network entities by integrating three basic networks: entity, action, and interest and establishing three kinds of mappings: from entity to action, from action to interest, and from entity to interest, that is, one hidden relation mined through the transfer characteristic of visible mappings. And further similarity degree between two entities is calculated by weighting the values of two indices: friend relation and interest similarity. Experiments show that this model not only can provide a more realistic relation of individual users within an Internet social network, but also, build a weighted social network, that is, a graph in which user entities are vertices and similarities are edges, on which the values record their similarity strength relative to one another.
Many systems can be represented as complex networks or graphs, that is, collections of vertices joined in pairs by edges. Examples include Internet and World Wide Web, citation networks, social networks, and biological networks. Especially, with the rapid development of the Internet, more and more people treat it as a place to express their voice, vent their feelings, and communicate with each other. Great changes of interpersonal communication have taken place, moving from the real social network to virtual network communities, which provide open platforms for expression and communication so that strangers from real world could share their ideas, form one group with strong influence, and even lead network events. Thus, complex networks have attracted considerable attention in many fields for representation of a variety of complex systems. Detecting community structure, which looks on communities as groups of nodes within which there are higher density of edges and between which the edges are sparser, has become one of the hot research topics in the field of complex networks. Its main task is to divide a whole network into subgroups with strong similarity for analyzing the dynamic evolution regularity of a virtual community and identifying functional units such as collections of pages about a single topic on the Web networks or cycles in metabolic networks. Community structure detection has been widely used in many fields, such as crime investigation and web search.
In recent years, many kinds of algorithms are put forward to uncover the features of a virtual community [1–5]. All existing mining methods are mainly based on three types of information: topology structure of a network (such as clustering coefficient, betweenness), dynamic properties of the whole network, and graph partition using spectral analysis theory of matrix in mathematics. Great achievements have been made in the field of community structure detection of complex networks. Moreover, most complex networks studied have been binary in nature, that is, the edges between vertices are either present or not. Such networks are represented by (0, 1) or binary matrices. They ignore that there may be stronger or weaker social ties between individual nodes . Most methods of analyzing nodes’ connections only take into account a single factor, such as friend relation of individuals or hyperlink of web pages. This has bad effect on mining effective and realistic community structure. Two strangers may belong to the same group if they have strong interests on some topic. In fact, the similarity degree of two entities in a community depends on multifactors/attributes, such as friend relation, hobby, and social status. Similarity analysis of vertices is the basis of mining communities of a complex network. Thus, it is significant for detecting community structure to design a novel model exploring the similarities of nodes based on multifactors.
This paper builds an Internet social network modeled as a graph in which user IDs or entities are vertices and their similarities are edges. Aiming at this network, we put forward a novel supernetwork-based model of characterizing network social groups by multi factors. This model consists of three basic networks, two visible mappings and one hidden network mapping. The similarities of individuals in defined social networks are calculated by weighting the values of two indices: friend relation and interest similarity. We well solve the problem of getting intrinsic weight values, that is, connection strength values between any two user IDs in online social networks . A weighted social network is constructed so as to lay the foundation for community structure detection, that is, dividing a whole network into subgroups with strong similarity.
The remainder of this paper is organized as follows. In Section 2, we review related work about community structure detection of complex networks. Section 3 presents a new supernetwork based relation model of characterizing network social groups by multi factors. Section 4 gives its concrete algorithm implementation. Experimental analysis and discussion are described in Section 5. Finally, Section 6 outlines conclusion and future work.
2. Related Work
Researchers have uncovered that many complex systems in various fields share some significant topological properties such as small-world property, scale-free property, and community structure, which have already attracted much attention. In particular, the community structure is indeed of considerable importance because it is helpful to understand both structure and function of networks. So far, most works of detecting community structure focus on two kinds of networks: computer-generated networks and real-world networks, including Zachary club network, jazz musician network, C.elegans metabolic network, a university e-mail network, football match network [3, 7, 8], and Web network [9, 10]. This paper treated Internet social networks as the analysis target and further built the weighted social networks. Because the online social networks share many similarities with Web networks, such as running in Internet environment and having virtual community structure, the following section shows related work of detecting community structure of Web networks and online social networks, and methods of analyzing the weighted networks.
Generally, the Web network is modeled as a graph in which Web pages are vertices and hyperlinks are edges. Gary William Flake et al. defined a Web community as a collection of Web pages in which each member page has more hyperlinks within the community than outside the community . And identification of highly related Web pages was put forward based solely on connectivity, without the inherent bias of text-based approaches. The result could be applied to improved search engines, content filtering, and objective analysis of relationships within and between communities on the web. Lefteris Moussiades et al. proposed a novel definition of refined community for the community structure of a Web site. It requires that the number of links connecting a vertex to its community is higher than the number of links connecting the vertex to any other community, but not necessarily higher than the number of links connecting the vertex outside its community. And a novel graph clustering algorithm was put forward to mine refined communities from web sites . It can be seen that much attention was paid to grouping highly related web pages which are identified through hyperlinks of Web sites in Web networks.
In the field of extracting online social community structure, Alan made the first study to examine multiple online social networking sites at scale, including Orkut, YouTube, and Flickr. The publicly accessible user links on each site were crawled for obtaining a large portion of each social network’s graph. The authors analyzed the linkages of more than 11 million users from virtual online communities and further illustrated that online social networks have three characteristics: power law, small world, and scale-free [11, 12]. Feng Fu et al. analyzed the structure of one blog community Sina, a kind of social networks, by calculating degree distribution, clustering coefficient, and average shortest path length. They also demonstrated that the blogging network has small-world property and the in and out degree distributions have power-law forms . Makoto Uchida et al analyzed a weblog network in the Japanese Blogosphere in which weblogs are written by bloggers and reflect people’s up-to-date interests. And a community extracting method based on modularity is applied on the weblog network where individual posts are vertices and hyperlinks to/from other weblog posts are edges .
We found that most Web and online social networks studied in the current literature were unweighted networks. Ju Xiang et al pointed that the strength of connections is neglected and only the network topology is retained in unweighted networks . Two new concepts: intrinsic weight and structural weight of edges are put forward in this paper. The former represents the strength of connections among various elements of networks. The later that is extracted from the topological structure of network is also related to network topology and helpful to identify inter- or intra- community edges. But how to get the weight values of each edge in weighted network is not mentioned. Newman analyzed weighted networks, in which connections have associated weights that record their strength relative to another . He illustrated that weighted networks can be analyzed using a simple mapping from a weighted network to an unweighted multigraph. Like work in , the method of determining intrinsic weight values of edges is not presented.
This paper aims at proposing a new way to evaluate the similarity level of online social entities and constructs weighted social networks in which user IDs are nodes and their similarities are edges. In fact, many factors, such as social relation and interest, have an effect on the similarity between two social entities. Thus, it is very important to propose a formal model that combines multi attributes to determine connection strength of nodes in online social networks. We should consider the effect of multi factors on the connection strength of vertices in order to evaluate the similarities of user entities comprehensively.
3. Supernetwork Based Relation Model within Internet Social Groups
With the limit of space, this paper chose two factors : friend relation and interest to measure the similarities of user entities within Internet social groups. This section firstly introduces the basic knowledge of supernetwork. Then the model of entities’ relation of an online social group based on supernetwork is presented.
3.1. Introduction of Supernetwork
A supernetwork is a compound network that is “above and beyond” existing networks, which consist of nodes and edges . It can link multitire networks according to many kinds of criteria and provide us with tools to study interrelated networks . It also allows for the application of efficient algorithms for computation. Supernetworks are conceptual in scope, graphical in perspective, and with the accompanying theory, predictive in nature. The supernetwork model can be represented by hypergraph. Its formalized definition is shown as follows.
Given a finite set and a set which is a set cluster of , that is to say , is a hyper graph of if and . And all elements of and are nodes and edges of the hyper graph, respectively.
Supernetworks have become a powerful tool of analyzing complex networked systems, such as knowledge network, research network, transportation network, and social network. Therefore, it is feasible to model the complicated relationships and further evaluate similarities of entities in Internet social groups based on supernetwork theory.
3.2. Model of Entities’ Relationship of an Online Social Group
This paper combines two attributes of virtual individuals: friend relation and interest similarity so as to measure the similarities of different entities. We define three kinds of basic networks: entity network, action network and interest network, and build two visible mappings: from entity to action and from action to interest, and one hidden mapping: from entity to interest. A novel model based on supernetwork for modeling the relationships among user entities put forward by this paper is shown in Figure 1.
There are strong links among these three basic networks contained in the relationship model of online social users. The fact whether two user entities are friend or not can be directly identified by capturing the value of friend attribute of an entity or judged through the number of communications between them. But it is very difficult to obviously determine whether two individual entities share the common interest or not. We only can observe all actions done by entities, such as publishing posts, expressing opinions, concerning homepages, to obtain their interest similarity degree indirectly. The hidden mapping, from entity to interest, can be obtained by computing above supernetwork based relation model.
Given m individual user nodes within one Internet social network, the formal description of supernetwork based relation model is shown in the following two sections.
3.2.1. Three Kinds of Basic Networks
Network 1. Entity
Represented by , where finite entities’ set and edges’ set between two nodes . Each component of may be one registered account ID or user name. The variable m is used to represent the number of user entities within a social network. The component of represents the friend relation of two entities: and . If its value is 1, it means these two entities are friends and there is one edge between them in the constructed hypergraph. Otherwise, no friend relation exists between and .
Network 2. Action
Denoted by , where finite actions’ set and edges’ set between two action nodes . The variable n represents the number of all actions launched by entities. Each component of is one action done by entities, such as expressing his opinion about some topic in a forum or participating one network application provided by different communities. Moreover, each network activity can reflect one kind of interest of corresponding ID entity. The component of represents common interest that two activities: and reflect. If there exists one edge between two action nodes: and , its value is set to 1. Otherwise, its value is set to 0, which means that they do not have common interest according to prior knowledge of actions and their corresponding interests.
Network 3. Interest
Expressed by , where finite interests’ set and edges’ set between two interests . The variable is the number of interests corresponding to all actions. Each component of denotes one kind of predefined interest. Each boolean component of means there maybe exists a common interest core between and , which can reflect main characteristics of interests. If its value is 1, there is one common interest core between two kinds of interests. On the contrary, no common interest core exists between them and its value is set to 0. Moreover, some interests with same interest core can be clustered into one hobby. This paper clusters all interests into a set of hobbies so as to lower the dimension of relation matrix between entity and interest, and further to simplify the correlation model proposed in this paper.
3.2.2. Three Kinds of Network Mappings
Mapping 1. From Entity to Action
Conveying all network activities done by entities within a period of time. The mapping from entity to its activities set is shown as , where the fact that is set to 1 signifies entity performed action .
Mapping 2. From Action to Interest
An obvious mapping that represents all interests corresponding to any entity action. The mapping from network action to its interests set is shown as , where denotes action performed by an entity is corresponding to interest .
Mapping 3. From Entity to Interest
A hidden mapping that provides all interests owned by any network user node. The mapping from entity to his interests is represented by , where the fact that value of is equal to 1 means entity has bias on interest . This mapping can not be obtained from raw Web dataset directly. It is gotten through transfer characteristic of two visible network mappings: Mapping 1 and Mapping 2.
4. Algorithm of Similarity Evaluation among Social Entities
The supernetwork based entities relation model proposed in this paper calculates the similarity degree of two entities by weighting the value of two indices : friend relation and interest similarity. In our model, similarity matrix of user entities is defined by (1), where and are the index matrices of friend relation and interest similarity, respectively, and are weightings whose values are assigned by system administrators according to the importance assigned to each index,
If there are entities within an Internet social network, , and are three matrices with the dimension . The following two sections describe how to determine the values of and .
4.1. Approach to Quantify Friend Relation
For any element of friend relation matrix , its value can be two possibilities: 1 and 0, which represent two entities: and are friends or strangers, respectively. Friend relationship is clarified in the following two situations.(a)When information about friend list of entity or entity is provided in the social networks, can be obtained directly by extracting their friend list. It is shown as follows: where is the set of ’s friends. If belongs to , is set to 1. On the contrary, it is set to 0.(b)When no information about friend list of entity or can be obtained directly from Internet social networks, the value of friend relation is obtained by the number of interactive activities between entity and entity , which can be counted from posts published by entity and replied by entity or published by entity and replied by entity . If is larger than a predefined threshold, these two entities are treated as friends and the value of is set to 1. Otherwise, it is set to 0. It is given as follows:
4.2. Approach to Quantify Interest Similarity Degree
It is well known that cosine similarity is widely used to measure similarity between two vectors by calculating the cosine of angle between them. It is often used to compare documents in text mining, including clustering and classifying Web texts . The result of the cosine function is equal to 1 when the angle is 0, and it is less than 1 when the angle is of any other value. Based on this fact, this paper proposes one method of calculating interest similarity matrix of virtual individuals based on cosine similarity of two vectors in hidden relation matrix of entities and interests. Firstly, the operation of matrix iteration is done on three obvious mapping relations so as to get the hidden mapping relation : from entity to interest according to the proposed supernetwork model (see Figure 1). And further interest similarity matrix among user ID nodes is acquired by calculating the cosine between two paired vectors of . The details of the algorithm are shown as follows.
Network user entities set , network actions set , interests set , hobbies set .
Interest similarity matrix .
Step 1. Constructing the activities set of according to network actions done by entity .
Step 2. Establishing the interest set of according to prior relation knowledge of network actions and their corresponding interests.
Step 3. Building the hobby set of interest based on prior knowledge of interests and their corresponding hobbies.
Step 4. Generating three relation matrices: , , and based on vectors , , and by supernetwork mapping rules, respectively. They represent three kinds of relations between entity and action, between action and interest, and between interest and hobby, respectively. These three matrices are shown as follows: The components of the above three matrices are binary, that is, the value of each component is 0 or 1. If entity performed action , the value of component of matrix is set to 1. Otherwise, it is set to 0. Similarly, the component of represents whether action is corresponding to interest . The component in signifies whether interest belongs to the hobby or not.
Step 5. Calculating relation matrix of entities and interests through transfer characteristic of network mapping relations. It can be obtained by the following equation: In matrix , the larger the value of is, the higher is the degree of interest owned by the node entity .
Step 6. Computing the cosine of two vectors: and of for getting similarity which is one element of interest similarity matrix . The similarity degree of entity and entity is given as follows: The larger the value of is, the higher is the interest similarity degree of two entities: and in online social networks.
5. Experiment and Analysis
5.1. Experiment Environment
Because of the limit of Internet access, this paper chose one real social network: RenRen as a platform for verifying the validity of proposed entity similarity relationship model [18, 19]. RenRen, which currently has about 338 million registered users, provides the largest social network service in China and adopts the real name system for user registration. Its service is basically the same as Facebook's in terms of functions and features. It can be regarded as China's version of Facebook. A registered user may easily find his friends according to the educational information in the profile and add them to the friend list. He can also update personal state, publish blogs, upload/share photos, leave a message, or comment on a photo, blog, state, and so on. Especially, there is an area of common web-pages which provides user entities with an interactive platform through logs, music, and videos. According to the linkage structure, the area of common web pages is divided into three levels: domain, topic, and common web page. It involves seven domains: Company, Sports & Events, Famous Icon, Relaxation, Education & Science, Movie & Cartoon, and Media & Organization. And each domain of the common web pages consists of several kinds of topics. For example, the domain Sports & Events is further divided into World Cup, Asian Games, World Exposition, and so on. There is one common web page corresponding to each topic, where its creator shares photos or videos, and publishes logs. If other registered users have interest on some topic, they usually leave their comments on the photos, videos, or logs in the corresponding common web page. Thus, this page is called focused common web page by some users who express their ideas about it.
In the social network RenRen, the common web pages focused by users can well mirror their interests. In order to identify the actions, interests, and hobbies of an individual social user, this paper looks on his comments on the common webpages as their actions set. And further, the topics corresponding to the focused pages and domains which the focused web pages belong to are regarded as users' interests and hobbies, respectively.
5.2. Experiment Results and Analysis
Virtual network community RenRen provides people with a communication platform. Individual posts and comments posted by registered users can well reflect their concerns. Strangers from the real world may form one group with strong influence, and even lead to network collective events as long as they have strong interest in one topic. On the contrary, friends who don’t have common concerns may not form one influential group. Thus, this paper assigned the values 0.3 and 0.7 to and , respectively, to simulate different levels of importance assigned to each index.
We chose 12 registered network users as analysis objects because of the space limit. In real society, the selected 12 network users who are from different cities are master students of our university. Z. Y. Zhang is the student leader in our school. Thus, almost everyone knows him. He likes to collect common homepages related to the interests of celebrity entrepreneur, celebrity model, astrology, video games, and Athletic club. F Liu, X. Y. Wang, S. X. Du, Y. Ding, and C. G. Liu work at the same laboratory: network attack and defense. Both F. Liu and S. X. Du pay close attention to the homepages about Famous Icon, Relaxation, Movie & Cartoon, Sports & Events, and Education & Science. T Qiao, X. X. Wei and Y. Wang engage in analysis of information content security and work at the lab of content security. H.J Jin and H. Li work at another lab: security management. Coincidentally, both H. Jin and X. X. Wei like to concern some web pages with the subject of Famous Icon, Relaxation, and Sports & Events, such as World Cup, AC Milan, YuShu earthquake. Only J. Zha does research work about encryption and decryption in information security lab. Moreover, there is a tendency among young students that more and more people add their friends into the friend list of RenRen social network, express their ideas about the events they focus on and know about latest activities by accessing the posts written by their friends.
The web crawler technology was adopted to obtain the source code of each user ID's homepage. And further we got its friend list by extracting information between tag “id” and tag “vip” and its focused common web pages. Their friend relations in virtual network world are shown in Figure 2. It can be seen from Figure 2 that entity Z. Y. Zhang is friend of all other entities except S. X. Du, which is consistent with his wide social relation in real society. There are many lines between F. Liu, X. Y. Wang, S. X. Du, Y. Ding, and C. G. Liu. Three users T. Qiao, X. X. Wei, and Y. Wang are friends with each other. H. Li and H. J. Jin are well known to each other. Figure 2 well reflects the real work relationship of 12 registered users.
These 12 social network users focused on 36 common web pages, that is, 36 kinds of interests involved in this experiment, such as TOEFL and World Cup. There are seven hobbies: Company, Sports & Events, Famous Icon, Relaxation, Education & Science, Movie & Cartoon, and Media & Organization in the area of common web pages of RenRen community. Each kind of interest can be corresponding to one hobby. According to prior knowledge of actions, interests and hobbies, we got the supernetwork relation model of a group of 12 social entities from RenRen social network. It is given in Figure 3 where triangle, square, small dot, and large dot represent entity, action, interest, and hobby, respectively.
According to the algorithm of similarity evaluation among user entities introduced in Section 4, we further got relation mappings from entities to hobbies and their degrees, which are shown in Table 1 where the first column and first row represent node entities and user hobbies, respectively.
It can be seen from Table 1 that T Qiao and X.Y Wang have the highest concerning degree on hobby Relaxation. The concerning degree of Y Wang on hobby Famous Icon is 5.5. By analyzing his focused common web pages from sampled dataset, we found that he collected 8 web pages: AC Milan, Eason Chen, World Cup, Liu Yan, Lady GaGa, ZL Zhang, redemption, and YUSHU earthquake. And half of them, including Eason Chen, Lady GaGa, ZL Zhang, and redemption, show that he has interest in musician, celebrity host, celebrity model, and celebrity actress, which belong to one hobby: Famous Icon. Table 1 also shows that 12 network users have no concern on two hobbies: Media & Organization and Company. Meanwhile, we analyzed the sampled web pages and found that 12 registered users don't concern web pages related to these two hobbies.
Finally, the values of similarity degree of 12 user entities are shown in Table 2. Figure 4 shows the weighted network of social entities' similarities based on the values of friend relation and interest similarity.
The result of entities' similarities, shown in Figure 4, effectively illustrates the effect of two factors: friend relation and interest similarity on the comprehensive similarities of entities. The obtained results are explained as follows. (1)We found that the value of similarity degree between F. Liu and S. X. Du is 0.99. It is the highest among all values of similarity degrees because they are friends, share five kinds of hobbies: Famous Icon, Relaxation, Movie & Cartoon, Sports & Events, and Education & Science and have strong bias towards the former three hobbies. The most correlation degrees between entity Z. Y. Zhang and other entities are above 0.7 because of his wide friendships. It can be seen from Figure 2 that entity Z. Y. Zhang is friends with all other entities except S. X. Du. On the contrary, the remaining 11 registered users have at least 6 strangers. Moreover, Z. Y. Zhang has at least two kinds of common hobbies: Famous Icon and Relaxation with any other network users. It can be concluded that network users, who are friends and share more hobbies, have larger similarity degree values and may cluster into one powerful virtual community.(2)By comparing Figures 2 and 4, we found that entity H. J. Jin only knows two entities : Z. Y. Zhang and H. Li. Although he does not know Y. Wang, C.G Liu and X. X. Wei, they have high similarity degrees with values of above 0.55 because they share many hobbies. The value of similarity degree between H. J. Jin and X. X. Wei is equal to 0.59. The reason is that both H. J. Jin and X. X. Wei share three kinds of hobbies : Famous Icon, Relaxation and Sports & Events. This well illustrates the fact that two strangers with much common hobbies may contribute to the formation of a virtual community.(3)Table 2 shows that Y. Ding and J Zha have the smallest similarity degrees 0.2. Figure 1 shows that they are not friends. We also found from Table 1 that they have weak bias towards two common hobbies: Famous Icon and Relaxation. The experiment results further prove that two strangers who share less common hobbies would not form one virtual network group and have less effect on the tendency of popular network opinions.
We have proposed an approach of quantitative similarity evaluation of Internet social users based on supernetwork theory utilizing two indices of friend relation and interest similarity. Three kinds of basic networks: entity, action, and interest are integrated for describing the complex correlation relationship of network users. An algorithm of cosine similarity of two vectors is introduced for getting the interest similarity degree between two network entities. Experimental results in real network environment illustrate that two indices of friend relation and interest similarity are more efficient than previous approaches with only one factor for describing the correlation degree of network nodes. It successfully builds a weighted online social network, that is, a graph in which user entities are vertices and similarities are edges. Moreover, the weight values of each edge record their similarity strength relative to one another.
There is much more to be explored in the future. Our future work will focus on studying the method of detecting community structure of a weighted social network so as to understand both structure and function of online social networks.
This paper was supported by State Key Development Program of Basic Research of China (no. 2010CB731403/2010CB731406) and National Natural Science Foundation of China (no. 61071152).
Network Data, http://www-personal.umich.edu/~mejn/netdata.
A. E. Mislove, Online Social Networks: Measurement, Analysis, and Applications to Distributed Information Systems, Ph.D. thesis, RICE University, 2009.
M. Uchida, N. Shibata, and Y. Kajikawa, “Identifying the large-scale structure of the blogsphere,” Advances in Complex Systems, vol. 12, no. 2, pp. 207–219, 2009.View at: Google Scholar
A. Nagurney, “Supernetworks: an introduction to the concept and its applications with a specific focus on knowledge supernetworks,” http://supernet.som.umass.edu/articles/uksupernetworks.pdf.View at: Google Scholar
Cosine Similarity, http://en.wikipedia.org/wiki/Cosine_similarity.