Abstract

With the rapid development of the Internet and communication technologies, a large number of multitype relational networks widely emerge in real world applications. The bipartite network is one representative and important kind of complex networks. Detecting community structure in bipartite networks is crucial to obtain a better understanding of the network structures and functions. Traditional nonnegative matrix factorization methods usually focus on homogeneous networks, and they are subject to several problems such as slow convergence and large computation. It is challenging to effectively integrate the network information of multiple dimensions in order to discover the hidden community structure underlying heterogeneous interactions. In this work, we present a novel fast nonnegative matrix trifactorization (F-NMTF) method to cocluster the 2-mode nodes in bipartite networks. By constructing the affinity matrices of 2-mode nodes as manifold regularizations of NMTF, we manage to incorporate the intratype and intratype information of 2-mode nodes to reveal the latent community structure in bipartite networks. Moreover, we decompose the NMTF problem into two subproblems, which are involved with much less matrix multiplications and achieve faster convergence. Experimental results on synthetic and real bipartite networks show that the proposed method improves the slow convergence of NMTF and achieves high accuracy and stability on the results of community detection.

1. Introduction

Community structure is an important feature of real-world networks as it is crucial for us to study and understand the functional characteristics of the real complex systems. A community is usually thought of as a group of nodes with more interactions among its members than outside of the network. With the rapid growth of the Internet and computational technologies in the past decade, numerous community detection algorithms [13] have advanced swiftly from the simple clustering of homogeneous datasets (1-mode network) to heterogeneous datasets (multiple mode networks). Unlike homogeneous networks that only contain 1-mode nodes and have explicit community structure, the community structures of heterogeneous networks are usually obscure and complicated owing to the coexistence of multiple-mode interactions. Therefore, it is challenging to effectively integrate the network information of multiple dimensions in order to discover the hidden community structure underlying heterogeneous interactions.

The 2-mode network [4] is a simple and important kind of heterogeneous networks, which is composed of 2 kinds of disjoint node sets (for simplicity, we call them as subject-node set and object-node set ). So the links only connect two end nodes from different sets and no links between nodes of the same set (Figure 1). In many studies, 2-mode networks are called bipartite networks by most researchers. Bipartite network is also the most common complex networks in the real world. Many real-world networks may be naturally modeled as bipartite graphs, such as actor-movie, author-paper, word-document, product-consumer, and context-tag networks. Thereby, it is necessary and valuable for us to study community detection in bipartite networks.

Due to the special link patterns and multirelational nodes of bipartite networks, it is more suitable for mining the special community structure via coclustering methods. Recent works [57] have shown that clustering multimode datasets simultaneously, that is, coclustering, can effectively improve the clustering performance in the sense that coclustering can make full use of the dual interdependence between heterogeneous nodes to discover certain hidden community structures. In contrast to traditional one-side clustering (on either columns or rows), coclustering treats the data matrix in a symmetric form that a partitioning of rows can induce a partitioning of columns, and vice versa. By clustering both rows and columns of a data matrix simultaneously, coclustering can effectively deal with the multimode networks and discover the structure of them.

Among different approaches of coclustering, nonnegative matrix trifactorization (NMTF) provides a superior model for co-clustering, in which the relations between row and column clusters are explicitly embodied. There are three major advantages with NMTF methods. First, as shown in (1), NMTF introduces one more factor matrix to absorb the different scales of , , and and provides increased degrees of freedom such that the low-rank matrix approximation remains accurate. Second, instead of being independent, the clustering tasks of multimode nodes in heterogeneous networks are often closely related. NMTF coclusters both the rows and columns of the original networks simultaneously by making efficient use of the duality (intratype and intertype) information between 2-mode nodes. Finally, in NMTF-based coclustering scenario, the three factor matrices jointly determine the appropriate latent community structure of bipartite networks, where and , respectively, cluster the subject nodes and object nodes and is the correlation matrix reflecting the relationship between subject clusters and object clusters. Through the NMTF model, we can seamlessly integrate multiple node information to discover the underlying community structure in bipartite networks, which is highly valuable in many real world applications.

Here we adopt nonnegative matrix trifactorization as a tool to find the communities because of its powerful interpretability and applicability for coclustering in heterogeneous networks. In this work, we present a novel fast NMTF solution for community detection in bipartite networks. To summarize, the main contributions of this work include the following. (1) We construct the affinity matrices of 2-mode nodes as manifold regularizations of NMTF, which efficiently incorporate the intertype and intratype information of subject and object spaces, to enhance the community detection in bipartite networks. (2) We present an optimization algorithm for fast nonnegative matrix trifactorization (F-NMTF), which decouple the original optimization problem into two smaller subproblems requiring much less matrix multiplications, and then cocluster the 2-mode relational nodes from the heterogeneous interactions.

The remainder of the paper is organized as follows. In Section 2, we define the NMTF model of community detection and demonstrate the theoretical foundations of our approach along with an illustrative example. In Section 3, we formulate the coclustering NMTF method for community detection in bipartite networks and present an optimal algorithm to achieve fast convergence. Then we test our algorithm on a variety of artificial and real bipartite networks and present the experimental results in Section 4. Finally, Section 5 concludes the paper.

2. Preliminary

2.1. Related Work

In the past years, NMF-based algorithms [811] for detecting network communities have gained great attention, because the data matrix factorization effectively reflects the community structure of the networks and promises a meaningful community interpretation that is independent of the network topology. In addition to a quantification of how strongly each node participates in each community, nonnegative matrix factorization (NMF) does not suffer from the drawbacks of modularity optimization methods [12], such as the resolution limit [13]. Psorakis et al. [8] proposed a Bayesian NMF model to extract overlapping modules. This method can automatically determine the number of communities, but also may mislead the factorization to return a bad solution, when some errors come out with its estimate of the community number. Cao et al. [9] used nonnegative matrix factorization with I-divergence as the cost function and introduce two approaches which are respectively applied to the directed and undirected networks. In [10], Nguyen et al. developed a novel NMF model where vertexes are measured by their centrality in communities and detect overlapping communities, hubs, and outliers from the NMF framework altogether. Based on the importance of each node when forming links in each community, He et al. [11] use nonnegative matrix factorization to form a generative model and take it as an optimization problem to discover the structure of link communities.

However, most community detection methods still focus on homogeneous networks and might not work well in the bipartite networks. There are two main reasons. First, the special link patterns of bipartite networks greatly limit the effectiveness of these methods, which tend to cluster the heterogeneous nodes by constructing the node or edge similarities of them, as they did in the unipartite networks. But for bipartite networks, the similarities among one-mode nodes sometimes can only be defined by the nodes of the other mode. That made these methods cannot keep working well in bipartite networks. Second, bipartite networks contain multiple types of nodes that are related to each other. Tackling each type independently will lose these interactions. So it is necessary for us to utilize the 2-mode nodes to gain a full understanding of the bipartite networks.

Motivated by recent progress in coclustering and matrix factorization, several novel NMF-based algorithms have been proposed to detect the underlying community structure, including GNMF [14], DRCC [15], and BNMTF [16]. These researches showed that the NMTF model is more applicable to discover the hidden structures in the bipartite networks. Compared to the classical clustering algorithms, nonnegative matrix trifactorization provides a good model for coclustering, in which the relations between row and column clusters are explicitly embodied. In view of these facts [1720], our approach aims to cocluster both the rows and columns of the bipartite networks simultaneously by making full use of the background information of 2-mode nodes. It is believed that adding intratype information as additional constraints can definitely improve the clustering performance.

In this work, we extend the NMTF model by adding the intratype information of 2-mode nodes as manifold constraints and develop an optimization solution to improve the coclustering performance of bipartite networks. In addition, due to the fact that NMTF-based methods often require the community numbers of networks to be specified beforehand, several methods [10, 21, 22] have been developed to solve this problem. Due to the simplicity and practicability of the existed method in [10], here we choose it to get the community numbers.

2.2. An Illustrative Example

Before the introduction of our NMTF-based method for community discovery, let us see an illustrative example on how NMTF works on the bipartite networks. Considering a bipartite network , where is the subject-node set, is the object-node set, and denotes the link set of the network , , . Given an asymmetric adjacent matrix denoting the bipartite network , NMTF methods approximates by looking for 3 low-rank factor matrices with the form where and , respectively, cluster the subject nodes and object nodes and jointly determine the appropriate latent community structure in bipartite networks, . is associated with the number of the subject-node clusters, and is the number of object-node clusters. In most cases, let as the preseted community number. In this way, NMTF can simultaneously group the subject-node set and the object-node set into clusters, where each community is the mixture of the 2-mode heterogeneous nodes.

Let represent the weight of the edge connecting subject-node and object-node . Generally, the original NMTF can be computed using the following multiplicative update rules [23]:

The NMTF framework of community detection on bipartite networks is shown in Figure 2. Here the product can be regarded as an approximate form of the network. Thus, the results of NMTF methods can be interpreted in which indicates the membership internal-strength of the subject-node in the th community and denotes the membership internal-strength of the object-node in the th community.

Consider a toy network with , nodes and edges of varying weights (Figure 3(a)). Specifically, the NMTF coclusters the 2-mode node sets to yield a comprehensive network partition solution (Figure 3(b)), where is the subject nodes’ community indicator, and reflects the community structure of object nodes. The larger square indicates the larger value of a corresponding element in the matrix. Specially, it is worthwhile to note that the matrix represents how subject-clusters are related to object-clusters. Each column in reflects which subject-clusters make contribution to each object-cluster. In this case, subject-cluster 1 and 3 contribute to object-cluster 1, while subject-clusters 2 and 3 contribute to object-cluster 2.

Hence, our NMTF framework can simultaneously cocluster both the subject set and object set. By incorporating the duality information of 2-mode nodes, we can effectively capture the heterogeneous community structure in bipartite networks. From the following illustrative example, we can see that the coclustering results of NMTF intuitively agree with the real community structure of the bipartite network and directly indicate the relationship between subject clusters and object clusters.

3. Fast NMTF Algorithm for Coclustering Community Detection

In this section, we propose a fast nonnegative matrix trifactorization (F-NMTF) method as the solution of community detection on bipartite networks. First, we respectively formulate the affinity matrices of 2-mode node sets in bipartite networks and incorporate them as two manifold regularizations of subject-set and object-set in the objective function. Then we decompose the original optimization problem into two smaller subproblems and present an optimization algorithm to develop the iterative updating rules of three factor matrices. For convenience, we present in Notations section the important notations used in this paper.

3.1. Constructing the Affinity Matrices

For the original NMTF framework (on (1)), it just considers the intertype information of 2-mode nodes. Such formulation assumes each subnetwork to be independent and fails to model the bipartite networks in a unified way. Recently, some researchers [15, 24] have found that coclustering data on manifold is well applied to bipartite networks for regularization-based clustering, because it can promote the performance of the intrinsic structure discovery in multimode networks. As a result, by constructing the two affinity matrices, the optional intratype information of 2-mode nodes is incorporated into NMTF as manifold regularizations. More importantly, we can exploit the manifold structures in both subject and object spaces to group like-minded users from different social perspectives, thus strengthening the community detection in bipartite networks.

First, we construct the affinity matrix as [15, 25] whose entries correspond to subject nodes . Generally, if nodes and share most connections to the nodes of the other mode, it means they are close to each other, and then their corresponding indicator vectors and should be close as well. It is formulated as follows: where is the Frobenius norm, is the th row of the subject indicator matrix and indicates the community membership of subject-node . is the affinity relationship between community indicator vectors and . For simplicity, we define the affinity matrix as follows: where denotes the -nearest neighbor of . In addition, other kinds of affinity can also be adopted, for example, heat kernel [26].

And (3) can be further rewritten as where is the diagonal degree matrix and is the normalized graph Laplacian of the subject-node set .

Likewise, we also define the affinity matrix whose entries correspond to object nodes as follows: where denotes the -nearest neighbor of . And (3) for object-node set can be further rewritten as where is the th column of the object indicator matrix and indicates the community membership of object-node . is the diagonal degree matrix, and is the normalized graph Laplacian of the object-node set .

Here we construct two affinity matrices based on the intratype information of 2-mode nodes and introduce them as manifold constraints to explore the hidden community structures of bipartite networks. In the following, we impose these two constraints on NMTF to achieve additional flexibility and incorporated the intratype information to enhance the orthogonality and accuracy on matrix factorization.

3.2. Objective Function of F-NMTF

Applying the two manifold regularizations in (1), the objective of our F-NMTF approach is transformed to minimize where and are regularization parameters to balance the reconstruction error of F-NMTF in the first term and manifold regularizations in the second and third terms. Adding regulations to NMTF is a common strategy, since it not only improves interpretability, but also enhances numerical stability of the estimation by making the NMTF optimization less underconstrained.

Since and are constrained to be cluster indicator matrices, it is difficult to solve (8) in general. Hence we simplify this problem by using the following proposition.

Proposition 1. Let a symmetric matrix ; the term is equivalent to the following optimization subproblem:

Proof. is equivalent to that is further equivalent to . By definition, the low-rank approximation of is given by ; then the objective term becomes . approximating is equivalent to approximating . Hence, can be reasonably transformed to (9), thus completing the proof of Proposition 1.

The two manifold regularizations terms in (8) can be rewritten as

Then, applying Proposition 1 in (8), the objective of our F-NMTF approach is transformed to minimize where and are computed from and following the procedures described in Proposition 1.

This section presents a general framework of NMTF, which are developed to introduce the affinity matrices as the dual manifold regularization. Hence, our F-NMTF method successfully incorporates the intratype and intertype information of the bipartite network to coclustering multitype relational networks.

3.3. Optimization Iteration

Due to the fact that NMTF algorithms always have high computation complexity, it is essential and valuable to introduce fast iterative rules for iteration optimization. In this subsection, as a step toward accelerating convergence of NMTF, we apply a fast iterative algorithm to alternatively optimize the objective, with computing one factor matrix while fixing the other variables.

Theorem 2. Given a general optimization problem: when and are fixed, the optimum is given by , where and the singular value decomposition (SVD) of is given by .

Proof. When is fixed, is equivalent to . Let , where , , and are the th entry of and , respectively.
Note that is orthonormal; that is, , thus . is the th singular value of , . Therefore, . That is to say, reaches its maximum when . Therefore, the solution to is . Theorem 2 is proved.

According to Theorem 2, the optimization problem of F-NMTF can be decoupled into two subproblems with much smaller sizes, and the decoupled subproblems would involve much less matrix multiplications. In this way our approach can be computationally efficient and scale well to large-scale real world networks.

Now we alternatively optimize the four variables of the objective function (11). First, fixing and , by setting the derivative of (11) with respect to as 0, we obtain

Second, by fixing , , and , we can decouple (11) into two following subproblems:

Applying Theorem 2 to (11), where and are obtained by SVD on ; where and are obtained by SVD on .

Then, we fix , , and to update , and (11) is decoupled to the following simple problems: where ; denotes the th row of .

Due to orthogonality and sparsity, we emphasize that in each row (column) of () only one nonzero element can be set to 1, which clearly indicate the community the corresponding node belongs to. Suppose corresponds to th community of the object-node set; thus the solution is obtained by

When fixing , , and , let , , and (11) is decoupled to the following problems:

Similarly, we obtain as where corresponds to th community of the subject-node set. Repeat this procedure until convergence. The convergence and correctness of this alternating iterative procedure have been proofed in [15, 23]. We skip it due to space.

After iterations, we can infer the community membership of heterogeneous nodes based on the F-NMTF results. For simplicity, the community indices of subject/object nodes are determined by taking the maximum of each column/row in . The detailed procedure is illustrated as shown in Algorithm 1.

Input: matrix , ;
Output: Community detection results;
(1) Initialize the factor matrices , ;
(2)Calculate , , and with Proposition 1;
(3)% Obtain Community indicator matrix , %
(4)repeat
(5) Compute by (13);
(6) Compute ; // and are obtained with SVD on ;
(7) Compute ; // and are obtained with SVD on ;
(8) Update by (16);
(9) Update by (18);
(10)  until  Converges;
(11)   % Inferring community labels from , %
(12)   , ;
(13)  for    do
(14)   add subject node to when ;
(15)   add object node to when ;
(16)  end

In our algorithm, and are sparse matrices, and the computation of them only involves vector norm enumeration without matrix multiplication; thus it is more computationally efficient. Moreover, instead of minimizing each matrix factor optimally with time-consuming multiplications of large matrices, F-NMTF decouples the original optimization problem into two smaller subproblems requiring much less matrix multiplications and coclusters the 2-mode relational nodes in bipartite networks. Compared to the other representative NMF solutions, it effectively optimizes the matrix factorization with faster convergence and lower computational complexity.

4. Experimental Results

In this section, the experiments use a series of computer-generated synthetic networks and some real networks to validate the algorithms’ performance. For all the bipartite networks, we compare the experimental results with other 4 well-known algorithms of community detection: Kmeans [27], GNMF [14], DRCC [15], and BNMTF [16]. All the experiments are performed on an Intel Core2 Duo 2.0 GHz PC with 1 GB RAM, running on Windows XP.

For NMF-based methods, including GNMF, DRCC, and our F-NMTF methods, the number of nearest neighbor is set by the grid as in [25], and the regularization parameters and are set to 0.1. In addition, we obtain the community numbers from the method as suggested in [10], which has been shown to well predict the number of network communities. In our experiments, we repeat each method with 50 times on all the bipartite networks and compute the average results.

In the following tests, different measures are introduced to evaluate the partition quality of the classical algorithms for community detection in bipartite networks. Since the structures of synthetic networks are all known, we adopt two standard measures widely used for clustering: normalized mutual information (NMI) [28] and execution time to quantify the partition quality of the community detection methods. For the real networks with unknown structure, we use bipartite modularity [29] and execution time for the collective validation. In particular, bipartite modularity is extended from Newman–Girvan modularity [12] and regarded as an effective standard to measure the community structure of partition results.

(1) Synthetic Networks.  Here we have applied the model bipartite networks [29] to generate 9 groups of benchmark datasets with known structures. Each group contains 10 networks that are generated with the same parameters. In this experiment, we randomly choose 5 networks of each group to obtain the average values for comparison. Here is fixed at the value of 0.9 while is varied by tuning from 0.1 to 0.9 with steps of 0.1. and , respectively, denote the intracommunity and intercommunity link probability. Generally, the higher the intracommunity link probability of the network is, the stronger community structure can be detected. Conversely, as becomes bigger, the community structure of the model bipartite network should become more and more obscure. We set community number , subject-node number , and object-node number .

Due to the fact that Kmeans costs much less time for computation and the fluctuation of has limited effect on it, we mainly display the execution time of NMF-based methods in Figure 4(a), where of the model bipartite networks ranges from 0.1 to 0.9. As increases, the community structure of the model bipartite networks become weaker, and all the algorithms suffer varying degrees of performance degradation, and their execution time rises rapidly. Because iterative methods are usually necessary for NMF solution, they need to update matrix factors by multiplying each entry in each iteration round until convergence. Meanwhile, we can see that F-NMTF converges faster than other NMF methods, and the gap of execution time between them becomes greater, because F-NMTF is decoupled into two subproblems with much smaller sizes and optimizes the factorization of each matrix. Even the community structure of the synthetic bipartite networks tends to be weaker and weaker; our algorithm also shows much better robustness than the other three algorithms.

Figure 4(b) shows the average NMI values of different algorithms against changes of . When , the NMI scores of all the algorithms performance exceed 0.8. Specifically, three NMF methods have the similar well performance under the strong community structures, and only Kmeans is slightly inferior to other methods. But when , the community cohesion of model bipartite networks degrades along with gradual increase, and the performance of F-NMTF still remains to have relatively higher NMI scores and keeps the stability and accuracy of community detection, rather than tending to quickly decrease like other algorithms. Our method is also superior in terms of stability as well as approximation accuracy, which means that it can achieve small recovery errors for various source networks. Specifically, the average NMI of F-NMTF is up to 5.83% better than that returned by GNMF, and 3.13% better than DRCC. In summary, the performances of F-NMTF are highly competitive to those of other methods on bipartite networks.

(2) Real Networks. Real networks are always more irregular and various than synthetic networks and have more complex community structures. Here we choose 5 popular real bipartite networks in different sizes: Southwomen [30], Scotland [31], Irvine forum [32], MovieLens [33], and Cond-mat [34]. These networks’ basic properties (about nodes, edges) are presented in Table 1.

Southwomen shows the participation of 18 women in 14 social events over a nine-month period. The dataset was collected in the Southern United States of America in the 1930s. There is an edge for every woman who participates in an event. Irvine forum contains user posts to forums. The users are students at the University of California, Irvine. An edge represents a forum message. MovieLens consists of 100000 user-movie ratings from https://www.movielens.org/. An edge between a user and a movie represents a rating of the movie by the user. Cond-mat contains authorship links between authors and publications in the arXiv condensed matter section from 1995 to 1999. An edge represents an authorship connecting an author and a paper. Scotland dataset contains the corporate interlocks in Scotland in the beginning of the twentieth century. It lists the (136) multiple directors of the 108 largest joint stock companies in Scotland in 1904-1905, including 64 nonfinancial firms, 8 banks, 14 insurance companies, and 22 investment and property companies.

The average execution times found by different algorithms are shown in Table 2(a). We can see that Kmeans costs much less time than NMF-based algorithms, as it does not need the matrix factorization iterations. For all the real bipartite networks, F-NMTF effectively accelerates the convergence speed of nonnegative matrix factorization and converges in less iterations and CPU seconds than other NMF methods. Because the network scales are quite different, the corresponding performances of F-NMTF are different, too. For the larger networks, F-NMTF has a greater competitive advantage than other methods. Our method is only slower than Kmeans, which, however, has much worse clustering performance.

Table 2(b) shows the bipartite modularity values found by different algorithms. The methods using manifold constraints, including GNMF, DDRC, and F-NMTF, generally achieve better clustering results, which are consistent with the widely accepted hypothesis that clustering of both intratype and intertype information can help clustering of bipartite networks. F-NMTF method attains the maximum modularity community structure for most test cases, which means that our method has better partition quality and achieves accuracy community structure on the real bipartite networks. More importantly, our method does not suffer from the problems of modularity optimization methods and clusters both the rows and columns of the networks simultaneously by making full use of the duality information between 2-mode nodes, which can greatly enhance the performance of clustering algorithms. Therefore, compared to other 4 methods, we can conclude that F-NMTF has competitive clustering performance in terms of both accuracy and partition quality against popular community detection methods and with much faster computational speed.

5. Conclusions

In this work, we proposed a novel fast nonnegative matrix trifactorization method for community detection in bipartite networks. Based on the idea of nonnegative matrix trifactorization, we introduce the intratype information of 2-mode nodes into NMTF via the dual manifold regularizations, thus helping to extract meaningful communities of the bipartite networks. Meanwhile, we decouple the NMTF problem into two subproblems with much smaller sizes, and the decoupled subproblems involve much less matrix multiplications, which make our approach of particular use for real world large-scale data.

Different from the traditional NMF-based methods, our work is an instructive attempt to cocluster multitype nodes in bipartite networks. In practice, our coclustering NMTF framework jointly takes intertype and intratype information of 2-mode nodes into considerations, thus making the partitioning results more reasonable and effective and detecting communities with high accuracy and quality. Experimental results on the synthetic and real-world datasets show that our algorithm is a competitive method to explore community structures in bipartite networks.

Notations

:Adjacent matrix of a bipartite network
:Subject-node set of a bipartite network
:Object-node set of a bipartite network
:Number of subject-nodes in
:Number of object-nodes in
:Prior-known community number
:Indicator matrix of subject-node partition
:Indicator matrix of object-node partition
:th row of
:th column of
:Affinity matrix of subject-node set
:Affinity matrix of object-node set.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by the National Science and Technology Major Project of the Ministry of Science and Technology of China under Grant no. 2012ZX03006002 and the National High Technology Development 863 Program of China under Grant no. 2011AA010604.