Efficient Community Detection in Heterogeneous Social Networks
Community detection is of great importance which enables us to understand the network structure and promotes many real-world applications such as recommendation systems. The heterogeneous social networks, which contain multiple social relations and various user generated content, make the community detection problem more complicated. Particularly, social relations and user generated content are regarded as link information and content information, respectively. Since the two types of information indicate a common community structure from different perspectives, it is better to mine them jointly to improve the detection accuracy. Some detection algorithms utilizing both link and content information have been developed. However, most works take the private community structure of a single data source as the common one, and some methods take extra time transforming the content data into link data compared with mining directly. In this paper, we propose a framework based on regularized joint nonnegative matrix factorization (RJNMF) to utilize link and content information jointly to enhance the community detection accuracy. In the framework, we develop joint NMF to analyze link and content information simultaneously and introduce regularization to obtain the common community structure directly. Experimental results on real-world datasets show the effectiveness of our method.
The past few years witnessed the emergence and popularity of online social media. Millions of users participate in online social media such as Twitter and Facebook, making up many social networks. Then social network analysis attracts enormous attention and community detection is one of the fundamental tasks in this field. A community can be defined as a group of users that interact with each other more frequently than with those outside the group and are more similar to each other than to those outside the group . The research on community detection is beneficial for a variety of real-world applications such as online marketing and recommendation systems.
Many community detection methods utilizing only one type of social relation are developed [2–4]; however, one type of relation can only provide limited and incomplete information for detecting communities [1, 5]. In real-world social media, the networks are often heterogeneous containing multiple types of social relations and user generated content [5–7], combining the social relations and content information is a better strategy for community detection [1, 8–13]. For example, Facebook has different interactions among the same set of users, users can be friends with each other, and a user can also follow someone. These interactions among users are regarded as link information and modeled as graph [6, 14, 15]. There is also user generated content in Facebook: users can post photos and share articles and other things they are interested in, and then users are related to certain items. The relations between user and content are regarded as content information [9, 14] and represented by user-content matrix. Figure 1 is an example of heterogeneous social network containing various link information and content information. Figure 1(a) shows three types of link information among the same set of users. Each type of relation provides part of topological information of the whole social network. Figure 1(b) shows the content information; users are related to various items; these relations reflect users’ intrinsic features. Based on the definition of community, we know that both users interacting closely and users with high similarity tend to be in the same community. In heterogeneous social networks, different types of link information reveal the network topology from different perspectives, and various content information implicates user similarity. Furthermore, network topology and user similarity indicate a common community structure, which provide complementary information from different views for mining the common community structure [9, 11, 13, 16]. Thus, researchers turn to combine the two kinds of information to make the community detection results more reliable. Note that the research is based on an assumption that different information indicates a common community structure [1, 8–12].
Community detection in heterogeneous networks has several challenges. First, there exists some noise in link data; how to handle the noisy data to extract the accurate community structure is challenging. Moreover, the link information (represented by adjacency matrix) and the content information (represented by user-content matrix) are different types of data, and how to combine them efficiently and effectively is a complicated problem. Some methods are proposed to detect the communities in heterogeneous networks. Gu and Zhou  propose a dual regularized coclustering method based on seminonnegative matrix factorization (semi-NMF) for community detection, where a feature graph is constructed based on user similarity. The clustering method can explore the geometric structure of both link manifold and content manifold by coclustering. Pei et al.  propose nonnegative matrix trifactorization (NMTF) based clustering framework, in which user similarity, message similarity, and user interaction are captured explicitly from user generated content to improve the accuracy of community detection. Hidru and Goldenberg  develop a graph regularized multiview NMF-based method for data integration, in which the parameters are automatically learnt from data. Ni et al.  discuss the case in which multiple clustering structures across different networks are allowed and construct a network of networks based on the domain similarity to regularize the clustering structure in different networks. Hu et al.  develop a scalable method which can handle any large number of networks; it identifies recurrent patterns across multiple networks to discover biological modules. In the method, different networks are compressed into two metagraphs and clustering is performed in these two graphs. Lukk et al.  construct a global gene expression map by integrating microarray data from different cell and tissue types, allowing researchers to do further researches. Tang et al.  review four existing strategies for integrating multiple relational information to find out the shared community structure: network integration, utility integration, feature integration, and partition integration. Ruan et al.  first calculate the content similarity and combine link strength with content similarity to construct the final backbone graph and then use community detection algorithm for single view on the backbone graph to derive the common structure.
Although many algorithms have developed, community detection in heterogeneous networks remains an issue to be addressed. The existing methods have some shortcomings: the methods in paper [6, 13, 17, 18] are developed to integrate the multiple link information for community detection and cannot deal with content information. The strategies in paper [8, 14] first turn the content information into link information and then process the constructed network. However, compared with processing the user-content matrix directly, preprocessing adds extra time consumption. Besides, constructing network according to content information is usually based on -nearest neighbor sets; the best value of parameter has to be learnt. The methods in paper [1, 8] concentrate on dealing with one type of content and one type of link information; it is hard to extend them to deal with multiple social relations; besides, the constraints introduced in paper [1, 10] are so strict that the methods have poor performances on some datasets. Moreover, in most methods, the individual community structures of different views are detected simultaneously; then one of the detection results is chosen as the common community structure [1, 9, 11, 13, 17]. Although community structures of different views are restricted to be similar, each individual view contains some private components  which make the detected individual result not consistent with the common structure. Furthermore, to choose which result as the common structure is also an unsolved problem.
In this paper, we try to develop a community detection framework in heterogeneous social networks, to derive the common community structure by utilizing multiple link information and multiple content information. Particularly, we derive content information from matrix factorization rather than constructing similarity network and derive the common structure directly from combining multiple link information and content information rather than taking one of individual results as the common structure. Specifically, we propose a regularized joint NMF (RJNMF) based community detection framework. Our contributions are summarized as follows:(1)We investigate the community detection problem in heterogeneous networks and develop a framework to deal with multiple link information and content information simultaneously; the content information can be processed without being turned into link information, and the framework is simple and effective.(2)We introduce regularization into the optimization function to control the similarity of community structures explored from different data sources; thus, the effects of the noise can be reduced. Furthermore, the common community structure can be derived directly by obtaining the common community indicator from solving the optimization problem; although the individual result of each view can be also detected with the assistance of other views, we do not take the individual result as the common community structure.(3)We carry out experiments on the real-world datasets to evaluate the effectiveness of the proposed method.
The rest of this paper is organized as follows: Section 2 includes a brief review of related work. In Section 3, we introduce the basic problem definition and related notions; then we describe our method in detail. Experimental results on real-world data are presented in Section 4. Finally, some conclusions are given in Section 5.
2. Related Work
Community detection using link and content information has been studied for years. The methods are mainly classified into two categories. In the first category, the methods deal with the link and content information in the same way. Gu and Zhou  proposed a regularized coclustering method based on semi-NMF to find community structures. In the method, the content information was turned into link information, and two types of regularization were introduced to explore geometric structures, requiring the cluster labels of data points to be smooth with respect to the link manifold, while the cluster labels of features are smooth with respect to the content manifold. However, the method can only deal with one type of link information. He et al.  extended NMF for multiview clustering by jointly factorizing the multiple matrices through coregularization and proposed a coregularized NMF framework for combining multiple content information without turning content information into link information. In the framework, pairwise regularization and cluster-wise regularization were developed to enforce similarity on different views. Pei et al.  proposed a clustering framework by integrating NMTF with three types of graph regularization. In the framework, user similarity, message similarity, and user interaction were captured to contribute to community detection. However, the constraints put on structures of different views were strict, resulting in bad performance on some datasets. Tang et al.  presented a joint NMF optimization framework to integrate multiple views. The components were separately constrained with -norm regularization or Frobenius norm regularization to ensure sparsity and accuracy. However, the structures derived from multiple data sources were not restricted to be similar. Cheng et al.  discussed the multiview clustering problem in a complex scenario where users in different domain might not match and proposed a robust framework which allowed partial mapping and could handle graphs of different sizes. Cheng et al.  mined the common structure across multiple views by concatenating the low-rank matrices; particularly, they sought the sparsity-consistent low-rank affinities from the joint decomposition of multiple feature matrices into pairs of sparse and low-rank matrices, and the noisy data was removed by introducing norm. However, the method which was developed for image segmentation could not be used for social networks because the link data is neither sparse or low-rank. Guo et al.  enhanced codetection by extracting a shared low-rank representation of the object instances in multiple feature spaces. The representation was based on a linear reconstruction over the entire data set and the low-rank approach enables effective removal of noisy and outlier samples. The extracted low-rank representation could be used to detect the target objects by spectral clustering. However, the method only has good performance on content information; when it is used to deal with link information, the performance is poor. Xia et al.  proposed a shared transition probability matrix of the multiple views to conduct spectral clustering. They firstly constructed a transition probability matrix from each single view and then used these matrices to recover a shared low-rank transition probability matrix as a crucial input to the standard Markov chain method for clustering. Deng et al.  established a framework to capture both common components of all the views and private components of individual views. They decomposed an input data matrix concatenated from multiple views as the sum of low-rank, sparse, and noisy parts. Then a unified optimization framework was established, where the low-rankness and group-structured sparsity constraints were imposed to simultaneously capture the shared and private components in both instance and view levels. However, because the link data is neither low-rank or sparse, the methods in paper [19, 22] perform poor on link data. Nguyen et al.  discussed community detection in multiplex social networks in which a user can have multiple accounts. They developed a unifying approach which aggregated multiple accounts of the same users and a coupling approach which used coupling techniques to find a consistent community structure. However, the method only considered the relations between users while neglecting the content information. Mahmood and Small  developed a community detection algorithm which was fundamentally different from most existing methods based on graph theoretics. The algorithm was based on the fact that each network community spanned a different subspace in the geodesic space, and each node was efficiently represented as a linear combination of nodes spanning the same subspace. Sparse linear coding with norm constraint was used to make the detection process more robust. The algorithm showed excellent performance on both benchmark and real-world networks. It is promising to extend this method to detect communities in heterogeneous networks. Guesmi et al.  proposed a method to deal with multiple types of objects and relationships derived from a bibliographic networks. The approach first constructed the Relation Context Family (RCF) to represent the different objects and relations using the relational concept analysis methods and then explored such RCF for community detection.
In the other category, the methods try to turn the content information into link information according to some rules. Ruan et al.  analyzed the community signal strength between nodes in the social network by fusing the link strength with content similarity. Content similarity was first estimated through cosine similarity or Jaccard coefficient; then the final network based on link information and similarity was constructed, and community detection algorithm for single view was used to analyze the network. Greene and Cunningham  proposed to produce a single unified graph based on the combination of -nearest neighbor sets for users. With regard to content information, the neighbor was gained by estimating content similarity. However, the complexity is , where is the number of nodes. Compared with processing the user-content matrices directly to derive community structure, turning content information into link information adds extra time consumption.
There is also ensemble method proposed. Zheng et al.  proposed a NMF-based ensemble clustering method which combined multiple clustering results into a single consolidated partition. In the method, community detection was implemented on each data source separately; then the detection results were integrated with different weights to capture the common community structure. Liu et al.  proposed spectral ensemble clustering which employed spectral clustering on the coassociation matrix to find the consensus partition. The method could deal with large scale datasets and had a good performance; moreover, it was robust to incomplete basic partitions with many missing values.
In this paper, we develop our method based on NMF. NMF  is a popular factorization method where all the elements related are restricted to be nonnegative; it is widely used to model social networks and cluster users into communities [8, 23, 29], and it is extended to multiview clustering [1, 8–11]. Its core idea is to approximate a higher dimensional matrix with nonnegative lower dimensional matrices. The factorization can be described as , in social network analysis field; is regarded as community indicator matrix which implicates the community membership [23, 29]. Here, we extend NMF with proper regularization to integrate link and content information for community detection and derive the common community structure by obtaining the common community indicator from solving the optimization problem.
3. Proposed Framework
In this section, we first introduce symbols used in the paper and then propose the framework for detecting common communities directly utilizing link and content information.
3.1. Problem Statement
The main notations used in this paper are listed in Table 1. Given a set of nodes in the heterogeneous network, link information is represented by the adjacency matrix which describes the social relations between users and content information is represented by the user-content matrix which describes the relations between users and content. is th adjacency matrix and is th user-content matrix, and denote the number of adjacency and user-content matrices, respectively, and the nodes are grouped into communities. and are individual community indicator matrices decomposed from and , respectively. We use to denote the common community indicator matrix for deriving the common community structure in the heterogeneous network. Particularly, indicator matrix represents the membership a node with respect to communities [23, 29]. For example, shows the degree of th node belonging to the th community. Normally, the th node belongs to the th community if is the max value of .
With the notations given in Table 1, the community detection problem in this paper is formally defined below.
Problem 1. Given adjacency matrices and user-content matrices , find out communities that with maximal links within community and minimal links across communities in terms of link based views and members are more similar to each other than to those outside the community in terms of content based views.
We first investigate how to combine the multiple matrices for community detection jointly. NMF has been shown useful in clustering problem [8, 29]. The 2-factor factorization NMF can be written as follows:In paper [1, 9, 13], NMF was extended for multiview clustering by establishing the joint optimization model. Given data sources, the joint NMF can be presented as below:where is the parameter to control the weight of each data source.
Since the multiple adjacency matrices and user-content matrices indicate a common underlying structure, similar indicator matrices are expected to be learnt. By introducing different regularization, various NMF-based integration methods have been developed. In paper , similarity constraints are imposed on each pair of indicator matrices; specifically, the regularization is defined as follows:where is the parameter of each view pair. Based on the definition above, community detection in heterogeneous networks is a joint optimization problem, where the individual community indicators of multiple link and content data are learnt simultaneously. But the above methods just restrict the individual indicator matrices of different views to be similar and fail to find out the common indicator across multiple data. The popular detection method is to choose one of the individual detection results as the common community structure [1, 9, 11, 13, 17]; however, because each view contains some private components, the individual detection results are not completely consistent with the common structure; besides, the detection results of different views also have some differences, and how to choose a best result is an unsolved problem.
3.2. Proposed Method
In this subsection, the proposed community detection framework based on regularized joint NMF is presented. Given a set of nodes in the network, the link and content data are represented with and , respectively. The joint optimization problem is defined as follows:where and are parameters to control the contributions of different data sources. Particularly, the adjacency matrix can be weighted and symmetric. With regard to directed networks, the symmetrized graph can be gained by .
Based on the joint optimization framework developed above, we try to deal with the link and content information simultaneously. As the column vector of the indicator matrix represents a cluster, when we adopt the vector-based norm, each entry of gives the cosine similarity between two clusters ; then can be interpreted as the cluster similarity matrix. In order to reduce the effects of noise and obtain the common community indicator, the individual indicators are constrained to be similar to the common indicator. Here, we introduce a consensus similarity regularization, which minimizes the disagreement between the similarity matrices for individual indicators and common indicator. It is defined as follows:where is the common community indicator. Incorporating the consensus similarity regularization into the joint NMF process, the proposed method can be presented as follows: where and are parameters to control the weights of regularization. In the framework, we can obtain the common community indicator by solving the optimization problem (6); then the common community structure in the heterogeneous network is derived directly.
Since the objective function formula (6) is not convex, the optimal solution can be achieved using the iterative updating algorithm. To enforce the nonnegativity constraints, we need to incorporate Lagrange multipliers. Let be the Lagrange matrices for constraints , respectively. Then, the Lagrangian function is as follows: where Then, the derivatives of with respect to are
Using the KKT conditions that , , , and , we have , , , and . Solving the above equations, we derive the following update rules.
Updating . Update according to formula (13), while other variables are fixed:
Updating . Update according to formula (14), while other variables are fixed:
Updating . Update according to formula (15), while other variables are fixed:
Updating . Update according to formula (16), while other variables are fixed:
With the above updating rules, the optimization algorithm is presented in Algorithm 1. Note that the iterative process is stopped if these cluster matrices converge or the number of iterations reaches a given threshold. Besides, the community detection results are obtained from by finding the max value in each row.
3.4. Theoretical Analysis
3.4.1. Correctness Analysis
In this subsection, we prove the correctness of the updating rules according to KKT condition. Taking formula (13) as an example, we show that the solution is a KKT fixed point.
From the derivatives of with respect to in formula (9), and according to the KKT condition , then the fixed point must satisfy the following function at convergence:At convergence, , and according to solution of formula (13), we have
It is obvious that formula (18) is identical to formula (17); thus, the solution of formula (13) satisfies formula (17); that is, the solution satisfies the KKT condition. Similarly, the correctness of the other updating rules can be proved.
3.4.2. Convergence Analysis
Definition 2 (see ). is an auxiliary function for , if conditions and are satisfied.
Lemma 3 (see ). If is an auxiliary function for , then is nonincreasing under the update rules .
We give the auxiliary function for the objective function formula (6) with regard to according to the definition above. Let denote the sum of all terms in formula (6) which contain ; then the following functionis an auxiliary function for , where is the th row and th column element of the adjacent matrix and is the th row and th column element of the indicator matrix . It is also a convex function in and its global minimum is
Then we show the convergence of updating by formula (13); that is, we prove the following statement: when other variables are fixed, updating according to formula (13) monotonically decreases formula (6) until convergence.
Proof. According to Definition 2, Lemma 3, and the auxiliary function we develop, at any iteration during updating , we have where donates the updated at th iteration. Thus, monotonically decreases. Since the objective function formula (6) is bounded below by 0, the updating of will converge. Then, the convergence of updating rule of formula (13) is proved.
Similarly, the updating rules of can be proved convergent. Therefore, alternately updating by formulas (13), (14), (15), and (16) monotonically decreases formula (6) until convergence and the stationary point is a KKT fixed point, which guarantees the correctness and convergence of Algorithm 1.
3.4.3. Complexity Analysis
We now analyze RJNMF’s time complexity, using standard NMF as the basis for big notation. RJNMF is essentially an extension of NMF for multiple data matrices. In terms of the standard NMF, , where , , and , it is known that the cost for update rules in each iteration is . As RJNMF’s update rule for each is the same as the original NMF; its cost is also . For each in formula (13), the additional cost in terms of standard NMF is the second term of the numerator and denominator, whose time complexity is . As is the number of communities, which is a small constant s.t. , then the cost of updating each is . Similarly, the time complexity of updating is , and that of updating is , where and are numbers of link views and content views, respectively. Therefore, the time complexity of RJNMF updating rules in each iteration is , making RJNMF a linear extension of NMF.
4. Experimental Results
In this section, we evaluate the RJNMF algorithm on real-world datasets and compare it with some existing community detection algorithms.
Four real-world datasets containing both link data and content data are used in the experiment. The datasets are collected from Twitter (http://mlg.ucd.ie/networks/); the ground truth results of community detection are available, so that we can evaluate the performance of our method according to the ground truth. The details of the datasets are summarized in Table 2. The politics-uk dataset is a collection of 419 Members of Parliament in the UK, and it consists of 5 communities, corresponding to the political parties. The politics-ie dataset describes the Irish politicians and political organizations which are clustered into 7 communities. The football dataset contains 248 English Premier League football players and club active on Twitter; the ground truth corresponds to 20 clubs. The olympics dataset contains a collection of 464 users (https://twitter.com/Telegraph2012/london2012) which consists of the athletes and organizations involved in the London 2012 Summer Olympics; they are assigned to 28 communities according to different sports.
A collection of different link and content data is available for each dataset. We choose to use follows, mentions, retweets, and user list, tweets in the experiment. Particularly, the follows describes the follow relationship, the mentions contains links between users who mentioned each other, the retweets describes the retweet interaction, all the three relations are regarded as link information, and three adjacency matrices are constructed from them, respectively. The user lists is constructed based on Twitter lists to which each user has most recently been assigned, and tweets is constructed from the concatenation of the 500 most recently posted tweets for each user; then two user-content matrices are obtained from the two types of content information.
To further evaluate the performance of our method, we carry out the experiments on Last.fm and Yelp datasets which contain thousands of items . The Last.fm dataset consists of 9694 artists which are clustered into 21 music genres; for each artist, his or her biodescription and user comments are crawled; then 131153 users, 31172 comments, and 14076 descriptions are obtained. Users, comments, and descriptions are all content information used to cluster the artists and three corresponding user-content matrices are gained. The Yelp dataset consists of 2,624 items from 7 categories. There are also three types of content information, that is, users, comments, and businesses’ names (descriptions), from which we obtain three user-content matrices. The summary demographics the datasets are showed in Table 3.
4.2. Baseline Methods
To demonstrate the performance of our method, we also carry out experiments of the following baseline algorithms on the datasets described above and compare the performance of RJNMF with these algorithms’.(i)Pairwise Coregularized Spectral clustering (PCoSpec) and Center-wise Coregularized Spectral clustering (CCoSpec) : two coregularization schemes are adopted in spectral clustering framework; PCoSpec utilizes a pairwise coregularization to enforce the eigenvectors of each pair to be similar and CCoSpec employs the centroid based coregularization to enforce the eigenvectors to be similar with a common center. In the experiments, all the information contained in each dataset is utilized. The inputting affinity matrices of link information are all the adjacency matrices of each dataset, and affinity matrices of content information are gained by the default Gaussian kernel from all the user-content matrices according to the authors’ suggestions. The regularization parameters are set to 0.01 as suggested by the authors.(ii)Coregularized Graph Clustering (CGC): CGC  is based on symmetric NMF with coregularization to deal with multiple link data. The private clustering result of each adjacency matrix is obtained by solving the joint matrix factorization problem. In the experiments, all the information contained in each dataset is utilized. The inputting affinity matrices of link information are all the adjacency matrices of each dataset; the affinity matrices of content information are computed using the RBF kernel from all the user-content matrices with the authors’ guidance. We set the regularization parameters to 1, as suggested by the authors.(iii)Pairwise Coregularized NMF (PCoNMF) clustering and Cluster-wise Coregularized NMF (CCoNMF)clustering: CoNMF is proposed in  to combine the link and content information for joint factorization and find out the separate clustering solution to each view; particularly, two kinds of coregularization penalties, pairwise and cluster-wise constraints, are developed to ensure the similarity of each pair of community indicators. In the experiment, all the information contained in each dataset is utilized; the inputting matrices are all the adjacency matrices and user-content matrices. In the experiments on the first four datasets, the parameters for follows, mentions, retweets, user list, and tweets are set to 2, 1, 1, 2, and 1, respectively; because follows and user list are more important according to prior knowledge, the regularization parameters are set to 1 as suggested. In the experiments on Last.fm and Yelp datasets, the parameter for each view is set to 1, and regularization parameters are set to 2, as suggested in .
In this paper, accuracy  and normalized mutual information (NMI)  are adopted to evaluate the community detection performances of different methods. Formally, let be the set of communities in the ground truth and be the communities extracted by different approaches. Both accuracy and NMI adopt the ground truth as a baseline; the values range from 0 to 1 and the higher value means better performance.
Accuracy is used to measure how the extracted community structure approaches the ground truth community structure; it calculates the ratio of the nodes clustered into the correct communities relative to all the nodes contained in the network. To compute the accuracy, each ground truth community is assigned a label, which is also assigned to each node in the community as the true label, denoted as . Then, we scan the nodes in to count the occurrences of each true label in each derived community and take the label occurring most frequently in as the label of community. After this process, some communities may have the same labels. For these communities, we keep the community with the largest number of nodes with the same label, and, for each of the other communities, if the nodes in the community have no other labels, that community is removed from , all nodes in that community are taken as misclassified nodes; otherwise, we take the next label whose node number is the next-largest in the community as the label of that community. Then, community and community with the same label match with each other, and we assign the label of community to each node as its detected label, denoted as . Accuracy is defined as follows:where donates the community label of node in the ground truth, is the detected community label of node , and is the Kronecker delta function which is 1 if the community labels of node are same, 0 otherwise. If , accuracy equals 1. If and are completely different, accuracy equals 0.
NMI is used to measure the partitioning quality based on the ground truth; NMI estimates the similarity between true partitions and the detected. The NMI of our partition and the ground truth is defined as follows. Let be the confusion matrix whose element is the number of nodes of community of the partition that are also in the community of the partition . NMI is calculated as follows:where () is the number of groups in the partition , () is the sum of the elements of in row (column ), and is the number of nodes. If , . If and are completely different, .
The scores of accuracy and NMI for different methods on the datasets are presented in Tables 4 and 5, respectively. From the results shown in the tables, some conclusions can be drawn. Note that all experiment results are averaged over 50 different runs. The parameters are tuned for optimal performance of all methods.
It can be observed that RJNMF performs better than the other community detection algorithms on most of the datasets. For example, RJNMF algorithm can get about 19% and 8% improvement in accuracy and NMI, respectively, compared with PCoSpec which is the second best method on politics-uk dataset and can get about 30% improvement in both accuracy and NMI compared with CCoNMF algorithm. In terms of Last.fm and Yelp datasets, which contain thousands of nodes, RJNMF also performs well. Although PCoNMF performs better than RJNMF on Last.fm dataset, the gap is relatively small. The better performance of RJNMF may due to the following reasons: the similarity regularization contributes to reducing the effect of noise and catching the core community structure. RJNMF drives the common community structure directly by obtaining common community indicator from solving the optimization problem and utilizes the link and content information more effectively than the other methods. Overall, the results demonstrate the effectiveness of RJNMF for community detection in heterogeneous networks.
4.4. RJNMF Parameter Study
There are four sets of parameters in RJNMF to balance the effects of different parts in the optimization: for the link data, for the content data, and and for regularization. The values of and determine the importance of different link and content data in the optimization, respectively, while the values of and determine the weights of the similarity regularization. We carry out experiments on politics-uk, politics-ie, football, and olympics datasets to evaluate the performances of our method with varying parameters.
We first discuss the relative values of parameters for different link data and content data. In terms of the first four datasets, from the detection results of single view and the ground truth, we know that the follow data indicates community structure more clearly than the other two types of link data ; then we set the weight of follow higher than those of mention and retweet. Similarly, the weight of user lists is set higher than that of tweets. However, in most real-world cases, which data is more important and reliable is usually unknown; then parameters () and () are set to the same value by default, so are () and (). If some prior information is available, one can also choose different values based on the importance of individual view; the view indicating the underlying structure more clearly can be given a higher weight to emphasize its effect.
Then we focus on the performance of RJNMF with varying weights of factorizations relative to regularization. Fixing the other parameters, we change () through multiplying by the varying coefficient , note that the parameters for different domains change in the same pace. Similarly, , , and are varied in the same way. Figures 2–5 show the performances of RJNMF on the first four datasets with varying parameters. The lines of PCoSpec, CCoSpec, and CGC represent the best results of these methods; they are used as baselines. As can be seen, for all the four datasets, RJNMF has a relatively stable performance with all the varying parameters. RJNMF remains performing better than the best results of PCoSpec, CCoSpec, and CGC on , when is larger than . In terms of , RJNMF has a better performance across a range of to and performs poorly out of the range. It can be also observed that RJNMF is relatively stable on and across a wide range and performs better than the best results of the other algorithms. The results also suggest that the ratio of the parameters for regularization and factorizations can be set to to achieve a good performance.
In this paper, we investigate community detection in heterogeneous networks systematically and propose a regularized joint NMF community detection framework. The key idea of the framework is to formulate a joint matrix factorization process with regularization that derive the common community structure directly and more accurately. In the framework, both the link data and content data are considered in the integration strategy and analyzed simultaneously; furthermore, in order to make better use of the two types of information, we develop a consensus similarity regularization to push the individual community detection result of each data source towards a common solution. Experiments on six real-world datasets demonstrate that RJNMF can effectively utilize the compatible and complementary information of link and content data for the joint community detection task.
As, in this paper, we focus on detecting communities with the presupposition that all the data share common underlying structure. However, in some real applications, the assumption may not hold where the community structures of multiple data sources may be different; therefore, in the future, we plan to investigate community detection in such cases.
The authors declare that they have no competing interests.
This work was supported by the National Natural Science Foundation of China (no. 61473149) and Natural Science Foundation of Jiangsu Province, China (no. BK20140075).
Y. Pei, N. Chakraborty, and K. Sycara, “Nonnegative matrix tri-factorization with graph regularization for community detection in social networks,” in Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI '15), pp. 2083–2089, July 2015.View at: Google Scholar
D. Kuang, H. Park, and C. H. Ding, “Symmetric nonnegative matrix factorization for graph clustering,” in Proceedings of the 12th SIAM International Conference on Data Mining (SDM '12), pp. 106–117, Anaheim, Calif, USA, April 2012.View at: Google Scholar
S. Van Dongen, “A cluster algorithm for graphs,” In Centrum voor Wiskunde en Informatica (CWI), 2000.View at: Google Scholar
L. Tang and H. Liu, Community Detection and Mining in Social Media, Morgan & Claypool Publishers, 2010.
N. Wang, P. Chen, and X. Li, “Community detection in heterogeneous multi-mode social network via Co-training,” in Foundations of Intelligent Systems: Proceedings of the Eighth International Conference on Intelligent Systems and Knowledge Engineering, Shenzhen, China, Nov 2013 (ISKE 2013), vol. 277 of Advances in Intelligent Systems and Computing, pp. 531–538, Springer, Berlin, Germany, 2014.View at: Publisher Site | Google Scholar
X. He, M. Y. Kan, P. Xie et al., “Comment-based multi-view clustering of web 2.0 items,” in Proceedings of the 23rd International Conference on World Wide Web, pp. 771–782, International World Wide Web Conferences Steering Committee, 2014.View at: Google Scholar
Y. Ruan, D. Fuhry, and S. Parthasarathy, “Efficient community detection in large networks using content and links,” in Proceedings of the 22nd International Conference on World Wide Web, International World Wide Web Conferences Steering Committee (WWW '13), pp. 1089–1098, Rio de Janeiro, Brazil, May 2013.View at: Publisher Site | Google Scholar
D. Greene and P. Cunningham, Producing a Unified Graph Representation from Multiple Social Fs, Association for Computing Machinery, 2013.
A. Kumar, P. Rai, and H. Daumé, Co-Regularized Multi-View Spectral Clustering, Advances in Neural Information Processing Systems, 2011.
C. Deng, Z. Lv, W. Liu et al., “Multi-view matrix decomposition: a new scheme for exploring discriminative information,” in Proceedings of the 24th International Conference on Artificial Intelligence, AAAI Press, 2015.View at: Google Scholar
R. Xia, Y. Pan, L. Du, and J. Yin, “Robust multi-view spectral clustering via low-rank and sparse decomposition,” in Proceedings of the 28th AAAI Conference on Artificial Intelligence (AAAI '14), pp. 2149–2155, July 2014.View at: Google Scholar
S. Guesmi, C. Trabelsi, and C. Latiri, “Community detection in multi-relational bibliographic networks,” in Database and Expert Systems Applications, vol. 9828 of Lecture Notes in Computer Science, pp. 11–18, Springer International Publishing, Cham, Switzerland, 2016.View at: Publisher Site | Google Scholar
X. Zheng, S. Zhu, J. Gao, and H. Mamitsuka, “Instance-wise weighted nonnegative matrix factorization for aggregating partitions with locally reliable clusters,” in Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI '15), pp. 4091–4097, Buenos Aires, Argentina, July 2015.View at: Google Scholar
D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Advances in Neural Information Processing Systems, pp. 556–562, 2001.View at: Google Scholar
D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Advances in Neural Information Processing Systems, 2000.View at: Google Scholar