Abstract

Nonnegative matrix factorization (NMF) model has been successfully applied to discover latent community structures due to its good performance and interpretability advantages in extracting hidden patterns. However, most previous studies explore only the structural information of the network while ignoring the rich attributes. Besides, they aim at detecting densely connected communities (also called community structures) and fail to identify general structures, such as bipartite structures and mixture structures. In this paper, we research on general structure discovery and propose a new method GCDNMF (General Community Detection based on Nonnegative Matrix Factorization), which integrates structural information and node attributes through consistency module constraint to capture the community interactions. It can discover the general community structures of nodes by iteratively updating the community-interaction matrix and the node-membership matrix. We also introduce matrix initialization based on centrality and dispersion of nodes for center selection to reduce the sensitivity of random initialization. Experimental results on real-world networks with a variety of characteristics validate the performance of our approach, especially on networks with general structures. In addition, the associated initialization evaluations demonstrate the effectiveness of our method in obtaining stable results.

1. Introduction

Many complex systems in the real world can be described as networks, such as social networks, transportation networks, and citation networks. Community structure is an essential and common topological property in these networks. The identification of community structure is a fundamental issue in understanding network topology and functional modules, and it has attracted the attention of many researchers [19]. A comprehensive review of existing community detection methods can be found in the literature [10]. Furthermore, with the rapid emergence of user-generated media(e.g., Microblog, WeChat, and Twitter), while structural connections between nodes indicate various interdependencies between individuals or organizations [6], real-world networks also contain rich attribute information that characterizes nodes and are referred to as attribute networks. As revealed in previous work, informative node attributes can help to find meaningful groups of users with similar interests, backgrounds, or purposes, which can further effectively support applications in recommendation, sentiment analysis, and user profiling [11]. Moreover, realistic complex networks often contain multiple structures, in addition to the traditional community structure, also known as assortative mixing, i.e., defined as a structure with tight intra-community node links and sparse inter-community links, such as the classical citation network Cora dataset; they also contain multiple complex network structures, such as the bipartite network [12] generated by the English lexical link network Adjnoun, and mixture structures containing both structures, also called disassortative mixing [13]. Mining the various underlying structures and interaction patterns between communities in a network is of great theoretical and practical significance for understanding the function of networks, discovering hidden patterns and predicting the behavior of individuals in the network.

In the past decades, several methods have been proposed to detect communities in attributed networks. They are mainly classified into modularity based methods [14, 15], clustering based methods [1620], random walk based methods [21, 22], statistical inference models [13, 23, 24], and matrix factorization based methods [3, 2527]. Among them, nonnegative matrix factorization (NMF) based methods have attracted much interest due to their good performance and strong interpretability. For example, Jia et al. [28] developed a modularized trifactor matrix factorization model Mtrinmf to exploit the topological and the modularity information of the network. Zhang et al. [3] used the NMF method to improve density peak clustering in community detection. However, node attributes are not considered in these models. To introduce node attributes, Jin et al. [29] used the node attribute matrix to construct a NMF framework for underlying community membership. Chen et al. [27] proposed CDCN by combining node attribute information and community structure information using the NMF framework to identify communities with semantic annotation. However, above attributed methods commonly assumed that the community structure obtained from the link structure is consistent with the community structure obtained from node attribute mining. Hence, they embed the structure and attribute into the same space and obtain the common node community matrix. In this way, they typically aim to extract traditional communities that are assortative, i.e. nodes are mostly connected with others in the community. They may overlook intercommunity relationship, making it difficult to exploit the generalized community structures, including assortative communities, disassortative communities, i.e. most connections are from different communities(such as bipartite networks), or the mixed community structures.

To address these issues, in this paper, we research on discovering general structures and propose a new nonnegative matrix factorization model named GCDNMF. It integrates the structural information and node attributes of networks through consistency module constraints to capture the interactions between communities. By iteratively updating the community-interaction matrix and node-membership matrix, it captures the general community structures of nodes. In addition, we initialize the initial matrix by the centrality and dispersion of nodes to reduce the sensitivity caused by random initialization. In summary, the innovation of this paper is threefold: (1)We propose a novel NMF-based model to detect general structures in attribute networks, which naturally combines structural connections and node attributes into a joint decomposition model. To the best of our knowledge, we are the first to model the general structures using the NMF-based model.(2)We propose consensus factorization to exploit general communities by studying the consistency between nodes and communities in terms of structural connections and node attributes. It is addressed by alternately updating the community-interaction matrix in the link structure and node-membership matrix in the node attributes.(3)Extensive experiments are conducted on benchmark networks to demonstrate the effectiveness of our proposed method by comparing it with the state-of-the-art methods. The experimental results show the superior performance of our model in detecting general structures.

The remainder of this paper is organized as follows. Section 2 introduces the related work on community detection based on nonnegative matrix factorization. Section 3 presents our proposed GCDNMF model which integrates the topological information and node attributes based on consistent-model constraints. To verify the performance of our method, several experiments are carried out in Section 4. Section 5 draws the conclusion and gives further consideration.

Nonnegative matrix factorization (NMF) has good capability in extracting hidden patterns and structures from high-dimensional data. Kumari et al. [30] pioneered a standard community detection method based on NMF, which can effectively mine the structural characteristics of the community by quantifying the link relationships between nodes. Owing to its advantages of simple implementation, innate interpretability, and outstanding performance, it has become a vital technique for community detection [31] and has attracted much attention by researchers to improve the performance of NMF-based community detection.

For directed and undirected networks classified according to the directionality of edges, Kuncheva et al. [32] proposed SNMF and ANMF to extract the intrinsic community structures, respectively. Considering the modularity information of the network, Jia et al. [28] presented a trifactor NMF model that combines the modularity information as a regularization term. To further capture the complex underlying network structure effectively and preserve the global and local structures, Li et al. [33] proposed a multilayer model based on NMF, which consists of an encoder module and a decoder module. Li et al. [34] explored the implicit association between nodes and presented a community detection method based on SNMF. However, the above conventional methods mainly explore the topology of the network to obtain communities.

Some of the most recently developed state-of-the-art methods use both topology and attribute to extract communities. Jin et al. [29] utilized the NMF technique to combine the observed network structure and node attributes, but the model does not focus on factorizing the node attributes matrix and ignores the various implications of edges in forming the community structure. Li and Liu [35] proposed a trifactor nonnegative matrix factorization clustering framework NMTF to combine three types of graph regularization in social networks. This approach utilizes additional content information to detect communities, and fails to explore the relationship between communities and this content. Tang et al. [36] proposed a weighted nonnegative factorization method for attributed graph clustering, which incorporates a weighting scheme to distinguish the importance of attributes. Jin et al. utilized both the community structure matrix and the node attribute matrix in NMF framework SCI [29]. Chen et al. [27] combined node attribute information and community structure information in the NMF framework to accurately find the relationships between networks. Some recent research work focus on building NMF model to learn low-dimensional representation of nodes for discovering communities in attributed networks [33, 37].

3. The GCDNMF Model

In this section, we present the proposed general community detection method GCDNMF, which incorporates topological information and node attribute in a collective NMF-based model. It consists of three main parts: network structure modeling, node attribute modeling, and joint modeling. Next, we describe each part in detail and give the optimization method.

3.1. Problem Formulation

We denote a network as , where is a set of nodes, is the set of edges between nodes, and is a set of attribute vectors. The nodes and their connections are interpreted by an adjacency matrix , if nodes and are connected, the corresponding entry , otherwise . The attributes of nodes in the network are represented by an attribute matrix , where is the dimension of node attributes. Our proposed method aims to partition the network into communities by jointly decomposing the adjacency matrix and the attribute matrix . In this paper, we summarize the notations and their definitions in Table 1.

3.2. Modeling Network Structures

Community detection refers to find those nodes with relatively close relationship from a network and divide them into different communities. The idea that the parts constitute the whole in the nonnegative matrix factorization provides an effective solution to this problem. To model the topological structure of nodes, we improve the traditional NMF method and propose a three-factor factorization method to decompose the adjacency matrix . The objective function can be expressed as: where is the community membership matrix, in which indicates the propensity of node belonging to community ; is the community relation matrix and is the probability of edges existing between community and community . Intuitively, is used as a measure of the strength of relationships between communities. Compared with the traditional NMF method, this method adopts a trifactor decomposition instead of a two-factor decomposition. On the one hand, the trifactor NNF model is suitable for both directed and undirected networks. More importantly, it has a clear physical meaning for and . The relation matrix is further combined with node attribute modeling to exploit the generalized community structure.

3.3. Modeling Node Attributes

In attribute networks, nodes and their correlated attributes can be regarded as the relationship between documents and keywords. Using the bag-of-words approach, the attribute matrix is denoted as , where represents the number of documents and is the number of keyword features. Assuming that the documents consists of clusters, based on NMF text clustering, can be decomposed into two nonnegative matrices and . We then have the following objective function related to the node attributes: where is the probability distribution matrix between nodes and communities, represents the membership degree of node belonging to community . is the probability distribution matrix between node attributes and communities, and indicates the propensity that community can be describe by keyword . In this way, we can divide the communities by node attributes and obtain attribute community matrix , which indicates that nodes in the same attribute community have a large attribute similarity.

3.4. Joint Community Detection Model

In the above two subsections, the community detection results are obtained from two perspectives of structural information and node attributes, respectively. To ensure that the final result is consistent, GCDNMF introduces a consistency module to jointly formulate the above two aspects. Different from traditional methods that focus on embedding structures and attributes into the common node-community space, we also concentrate on the relationship matrix between communities. Intuitively, based on attribute community matrix , we can further obtain the matrix describing the relationship between communities in the attribute information by . Specifically, each entry in portrays two communities in which nodes have attribute similarity and indicates the propensity of edges existing between two communities based on attribute similarity.

Considering that the topology is consistent with the clustering structure of the node attributes, the structure community relation matrix and the attribute community relation matrix obtained are inclined to be approximated. Therefore, we derive the following objection function:

Then, we propose a weighted joint NMF-based framework to integrate above objectives. Our goal is to minimize the following optimization problem over , , , and : where and are positive weights to balance the structure/attribute fusion and the strength of consistency constraint on the community relationship.

3.5. Model Optimization

Since minimizing Eq. (4) with respect to , , , and is not convex in all variables together, we utilize an alternating iterative updating scheme to optimize the objective for convergence to a local minimum. First, according to matrix properties and , the objective function can be derived to the following form:

To optimize Eq. (5), w.r.t. , , , and , four Lagrangian multipliers are introduced, , , , and . According to Karush-Kuhn-Tucker(KKT) condition, which characterizes the necessary and sufficient condition that the optimal solutions need to satisfy: where is the Hadamard product operator(like the operator ‘. in matlab), for example, . Thus, we derive the Lagrange function:

Setting partial derivatives of , , , and to zero, we have:

To eliminate Lagrangian multipliers by Eq. (7), we obtain:

From Eq. (9), we obtain the following updating formulas:

3.6. Initialization of GCDNMF

Since the NMF-based approach is sensitive to random initial values of the variables, to overcome this problem, the node membership matrix of our model is initialized based on our previously presented work K-rank-D [38], which utilizes the centrality and dispersion of the network to determine the cluster centers.

To formulate the centrality of nodes in the network, according to the algorithm in [38], by modifying the transition probability matrix , where , we get the centrality of nodes by: where is the Euclidean distance between node and node . The nodes with higher centrality are more likely to be selected as the center points. What is more, as in real networks, the distance between community centers is usually far from each other, to measure the degree of dispersion among centers, a dispersion measurement by computing the distance between node and other nodes with higher centrality is defined as:

According to the centrality and dispersion of nodes formulated as Eq. (11) and Eq. (12). The CV (comprehensive value) of any node in the network can be defined as:

We sort the CV values of all nodes in descending order and select the node with the highest CV value as the center of the network. Then we chose the columns(corresponding to the selected centers) in the similarity matrix to obtain the initialized matrix, where is an -dimensional identity matrix and is the step length of signal propagation, and we take this value as in this paper.

3.7. The GCDNMF Algorithm

Algorithm 1 outlines our proposed GCDNMF. Taking the adjacency matrix , the attribute matrix and the number of communities as input, after initializing the initial matrix, the membership matrix of nodes is obtained by iteratively update. Finally, the community partition result of nodes is obtained by the maximum assignment. The GCDNMF algorithm is decreasing with step 4 to step 7 and converges to a local optimum. Since , , and are constant, the complexity of updating , , , and is for iterations. With well initialization based on centrality and dispersion of nodes, GCDNMF can converge quickly and reduce required iterations largely in partitioning nodes of a network. Generally, 100 iterations will give a promising performance.

4. Experimental Results and Analysis

In this paper, to verify the effectiveness of our proposed method, we compared against the sate-of-the-art community detection method based on NMF model. The average results with 10 trials were recorded. All the algorithms were ran on a PC with RAM:8.0GB, CPU: Intel i7-4600 U, and Platform: MATLAB 2014b.

Input: Adjacency matrix ;
  Attribute matrix ;
  Number of communities ;
  Number of iterations ;
Output: Community label for each node
1: initialize , and randomly
2: initialize by K-rank-D
3: for do
4:  
5:  
6:  
7:  
8: end for
9: return
4.1. Data Description

We used both synthetic and real-world networks to test the effectiveness of our proposed algorithm. The details of these datasets are given below, and the detailed parameters of the data are given in Table 2. (1)Cora the Cora dataset is a subset of the large Cora citation dataset. It contains 2708 research papers from seven subfields of machine learning. In this network, each node is characterized by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary, which consists of 1703 unique words.(2)Citeseer the Citeseer dataset is a citation network of computer science publications. It contains 3312 publications, each of which is labeled as one of 6 categories. Similar to Cora dataset, each publication is described as one binary vector indicating the presence or absence of the corresponding word from a dictionary of 3703 unique words.(3)WebKB the WebKB dataset consists of 877 scientific publications and 1608 links, which includes Web page networks of four universities: Cornell, Texas, Washington, and Wisconsin.

According to the block matrices of the networks, the first two datasets are assortative mixing(traditional community structure) as shown in Figure 1(a) taking the Cora dataset as example, while the WebKB datasets is mixture structure, which is neither assortative mixing nor disassortative mixing(e.g.,bipartite structure or multipartitie structure) show in Figure 1(b) taking the Washington dataset as example.

4.2. Evaluation Measurements

In this study, three commonly used metrics are used to measure the performance of an algorithm, accuracy (ACC) [39], normalized mutual information (NMI) [39], and Pairwise -measure (PWF) [40]. These metrics are defined as follows. (1)Accuracy (ACC). Given node , is the assigned label by an algorithm, and is the true label. The accuracy is defined as the fraction of all nodes whose predicted labels are the same with the true labels. The ACC of a particular division of a network is defined as follows:where is a Kronecker function that the value is 1 if , otherwise, 0. is a permutation mapping function that maps the label of node to the corresponding label in the ground-truth. is the overall number of nodes in a network. (2)Normalized mutual information (NMI). The NMI is defined by:where is the ground-truth cluster label, is the computed cluster label, is the number of communities, is the number of nodes in the ground-truth community , is the number of nodes in the computed community , is the number of nodes in the ground-truth community that are assigned to the computed community . In general, the higher NMI, the better result an algorithm get. (3)Pairwise -measure (PWF). Let denote the set of nodes having the same community label in the ground-truth, and be the set of nodes in the same community divided by a given algorithm. is the cardinality of . The balanced PWF is the harmonic mean of precision and recall. It is defined as follows:

where and . The higher the PWF, the closer the division is to the ground-truth.

4.3. Experimental Results and Analysis

In this section, to validate the effectiveness of GCDNMF for community detection, in addition to comparing with structure based nonnegtive matrix factorization methods, such as stdNMF, ANMF, and Mtrinmf, we also compare with the state-of-the-art NMF-based methods that combines structure and attribute, like CDCN and SCI. In addition, we compare GCDNMF with promising attributed clustering method SA-cluster and probabilistic model PCL-DC (focusing on traditional community detection) and GSB (aiming to detect general community detection). What is more, we compare the performance of GCDNMF with different initialization settings, where GCDNMF is with random initialization and GCDNMF is with the centrality-based initialization.

The average results of ten random trials are shown in Table 3. From the table, we notice that GCDNMF perform well on most of the datasets and achieves the best and second-best on all metrics, ACC, NMI, and PWF. Furthermore, we have the following observations.

Different from the method only considering topology, such as Mtrinmf, stdNMF, and ANMF, the method of integrating topological connections and node attributes can significantly improve the performance of community detection. The best and second-best results are obtained by methods that integrate two types of information. For instance, the highest ACC score among the methods that focus on network topology information is 0.455(stdNMF), and the best ACC score among the methods that combine both the topology and attribute is 0.565(GSB) on Texas dataset. Compared with those methods, our method GCDNMF achieves 0.630.

Among all the comparison methods of structure and attribute fusion, PCL-DC achieves the best results and GCDNMF is second on Cora and Citeseer datasets. This is in accordance with the original intention of PCL-DC, which is mainly used for mining traditional community structures. But on the other four datasets with mixture community structures, methods based on joint matrix decomposition (such as GCDNMF and SCI) and statistical inference model GSB have better performance. Specifically, compared with PCL-DC, GCDNMF improve ACC and PWF by almost on Washington dataset. For the general community structure discovery, i.e., on the WebKB datasets (including Cornell, Texas, Wisconsin and Washington), overall, GCDNMF and GSB have comparable effects, but GCDNMF achieves the best on three of the four datasets in terms of ACC and PWF. In addition, GCDNMF is able to extract traditional community structures better than GSB, such as on the Cora and Citeseer datasets. This observation implies that GCDNMF can capture more complex community structures, which is shown in that it can not only discover traditional community structures, but also detect mixture structures in networks. This verifies the effectiveness of our proposed GCDNMF by introducing consistency constraints to explore the community-interaction between linkages and attributes.

4.4. Convergence and Stability Study

To solve the proposed joint formulation, we adopt an iterative update technique. In this subsection, we experimentally study the convergence of our proposed GCDNMF. The convergence rate on six datasets are shown in Figure 2. From these figures, we can see that our proposed GCDNMF converge within 50 iterations on all datasets.

Further, we experimentally validate the effectiveness of the initialization strategy in our work, and we compare the performance of GCDNMF with random initialization. Due to limited space, we take the Citeseer and Texas networks as examples to test the results of GCDNMF with and without initialization. As shown in Figure 3, the blue, red, and green lines, respectively, reflect the results of ACC, NMI, and PWF in ten iterations. The solid line is the results of using initialization, and the dotted line is the result of without initialization. From the figures, it can be noted that the initialization mechanism significantly improves the stability of the results. More specifically, the standard deviation of ten-round results is 0.0006, 0.0002, and 1.07E-4, respectively, in these three metrics with initialization strategy, while the standard deviation without initialization strategy is 0.025, 0.021, and 0.314. Additionally, we find that the accuracy of network community detection is significantly improved through the initialization strategy. Similarly, the same conclusion can be obtained on other networks. In conclusion, the experimental results show that GCDNMF can converge quickly while maintaining high community detection quality compared to random initialization.

4.5. Parameters Analysis

The GCDNMF model has two hyperparameters: indicates the contribution of the attribute information to the community detection results, and controls the strength of consistency between community-relation matrices derived from structure and attribute. Because the results of different networks have similar tends, here we demonstrate and analyze the performance effect of hyper-parameters on Texas in experiments. We vary and in range of and observe the results while holding the other parameter fixed. Figures 4(a) and 4(b) demonstrate the performance of GCDNMF when and and ranging the other parameter from 0.01 to 100, respectively. From the figures, we observe that GCDNMF is sensitive to and is relatively stable with different settings of when fixed. Specifically, from Figure 4(a), when becomes larger, the performance of GCDNMF first keeps rising slightly and then drops sharply at a certain value. This indicates that too large will introduce noise by excessive consistency constraint from attribute clusters. Therefore we suggest to be 1 (with equal attention to structure and attribute information) and properly tune in so as to achieve a high performance.

5. Summary

In this paper, GCDNMF is proposed for general attribute network community detection by exploring the consistency relationship between node-community structures based on structural connectivity and node attributes. By comparing with several state-of-the-art methods, it is demonstrated that the GCDNMF method has better performance in revealing the general community structure for all benchmarks. In addition, we demonstrate that GCDNMF has stable performance after adopting initialization. However, there is still space for improvements in future work. Interesting issues include the proposed approach to overlapping community detection and semisupervised general community detection.

Data Availability

The datasets used in the manuscript are from the hyperlink: https://linqs-data.soe.ucsc.edu/public/lbc/

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported by the Beijing Science and Technology Planning Project under grant KM202010005015 and the National Natural Science Foundation of China under grant 62006009.