Abstract

Community detection is an important task in network analysis, in which we aim to find a network partitioning that groups together vertices with similar community-level connectivity patterns. Bipartite networks are a common type of network in which there are two types of vertices, and only vertices of different types can be connected. While there are a range of powerful and flexible methods for dividing a bipartite network into a specified number of communities, it is an open question how to determine exactly how many communities one should use, and estimating the numbers of pure-type communities in a bipartite network has not been completed. In our paper, we propose a method named as “biCNEQ” (bipartite network communities number estimation based on quality of filtering coefficient), which ensures that communities are all pure type, for estimating the number of communities in a bipartite network. This paper makes the following contributions: (1) we show how a unipartite weighted network, which we call similarity network, can be projected from a bipartite network using a measure of correlation; (2) we reveal the relation between the similarity correlation and community’s edges in the vertices of a unipartite network; (3) we design a measure of the filtering quality named QFC (quality of filtering coefficient) to filter the similarity network and construct a binary network, which we call approximation network; and (4) the number of communities in each type of unipartite networks is estimated using Riolo’s method with the approximation network as input. Finally, the proposed biCNEQ is demonstrated by both synthetic bipartite networks and a real-world network, and the results show that it can determine the correct number of communities and perform better than two classical one-mode projection methods.

1. Introduction

The bipartite network is a network whose vertices can be divided into two types a and b, where every edge connects a vertex of type-a to one of type-b, and there are no edges connecting vertices of the same type. There are many examples of bipartite networks, such as those described in [13]. Regarding unipartite networks, a common task is to find groups or communities of vertices that connect to the rest of the network in similar ways. Finding this underlying group structure is of significant, which can, for example, divide a heterogeneous network into homogeneous subgraphs for subsequent analysis or modeling [4].

Beginning from Newman’s [5] study, community detection has attracted considerable attention from researchers [6], aiming to identify good ways to divide up a network into communities. A range of powerful and flexible methods for dividing a bipartite network into a specified number of communities have been proposed in recent years [4, 7, 8]. However, most of them have one key shortcoming; that is, they require us to know the number of communities of a network in advance. In the real world, however, we usually do not know this number a priori, and thus, we need to estimate it from the data. Recently, several methods have been proposed for making such estimates for unipartite networks [912] and bipartite networks [1316]. Barber [13] in his work introduced bipartite modularity, a variant of the modularity proposed by Newman and Girvan [17]. A dual-projection approach proposed by Han et al. [14] aims to maximize the Newman’s one-mode modularity. The authors of [15, 16] maximized Barber’s bipartite modularity for bipartite community detection. However, maximizing both modularities noted above proved to be a NP-hard problem [6, 18]. The bipartite network communities generated in the previous studies are of mixed type, and so far, there is no exploration inferring to the numbers of pure-type communities in a bipartite network.

In our paper, we propose a method named “biCNEQ” (bipartite network communities number estimating based on quality of filtering coefficient), which ensures that communities are all pure type. The main innovations and contributions of this study can be illuminated as follows: (1) a percolation idea-based (PIB) method, proposed by Lambiotte and Ausloos [19], is used to project a bipartite network to unipartite correlation networks and reveal the emergence of social communities and music genres by filtering correlation matrices and (2) a first principles method given by Riolo et al. [11] is used for inferring the number of communities in a unipartite network. The quality of filtering coefficient (QFC) is designed to select a threshold to filter the correlation matrix in constructing a binary unipartite network. This method can roughly match the structural features of the correlation and degree of the vertices of the original ones, which cannot be done using PIB. Finally, we use Riolo et al.’s [11] method to estimate the number of communities in each type of unipartite networks. In addition, the proposed biCNEQ is demonstrated by both synthetic bipartite networks and a real-world network, and the results show that our method performs better than two classical one-mode projection methods.

2. Methods

Tests were performed on both synthetic bipartite networks and a real-world bipartite network with a known community structure.

2.1. Synthetic Networks

We construct a synthetic network based on a degree-corrected bipartite stochastic block model (biSBM) formulated by Larremore et al. [4]. Given a bipartite network with adjacency matrix (where and are the vertices of type-a), we divide the vertices of type a into groups and the type-b vertices into groups and express the matrix of group interrelationships as a matrix, where . Let vertex i of type belongs to group and be the type of group r, imposing the constraint , which indicates that vertex types and group types must match and ensures that groups will be pure type. Let the number of edges between vertices i and j follow a Poisson distribution with mean and choose the normalization , where controls the expected degree of vertex i, is a symmetric matrix of parameters to control the number of edges between groups r and s, and is the Kronecker delta. The probability of observing a network G with adjacency matrix A can be written aswhere is the observed degree of vertex i and is the number of edges between groups r and s. After taking partial derivatives with respect to on the logarithm of equation (1), we can get the maximum likelihood parameter as follows:

The maximum likelihood can be found via the constrained maximization of the logarithm of equation (1) subject to using Lagrange multipliers, i.e.,where is the sum of the degrees in group r.

Empirically observed networks are often noisy with missing or spurious edges. Therefore, we examine the ability of biCNEQ to analyze a range of synthetic networks generated by a mixed model, which is a combination of planted structure and a random network model . The later model is used to create various levels of uniformly random noise. We consider two forms, as in [4], an easy and a difficult case, to illustrate the biCNEQ’s performance under different conditions.

We specify and and create mixed networks using . Then,where the mixed parameter takes values between 0 (all noise) and 1 (all planted structure) and according to equation (2). We let , where is the total number of edges in the network.

2.1.1. An Easy Case

In the easy case, we define the mixed matrix to have an easily identifiable community structure which consists of four equally sized, unambiguous, and nonoverlapping components with each made up of one type-a and one type-b community. Let N = 60 for each type and divide these vertices evenly across the four components, where . The symmetric entry has the same value. We create networks using . Finally, we use the code, downloaded from http://www.danlarremore.com/bipartiteSBM/makeEasyCaseNetworks.m, to generate mixed synthetic networks of an easy case for testing with the specification above and with its degree distribution unchanged.

2.1.2. A Difficult Case

In the difficult case, the mixed matrix we define is given a less easily identifiable community structure by creating partially overlapping communities, , and has a broad degree distribution. We set different sizes for the communities with 70 type-a vertices, divided evenly into 2 communities {35, 35}, and 30 type-b vertices, divided into 3 communities {10, 15, 5}. Then, we let and ; can be obtained using equation (3). The symmetric entry has the same value. Finally, we use the code, downloaded from http://www.danlarremore.com/bipartiteSBM/makeDifficultCaseNetworks.m, to generate the mixed synthetic network of a difficult case for testing with the specification above and degree distribution unchanged.

2.2. Empirical Networks

The Southern women network collected by Davis et al. [20] contains the observed attendance at 14 social events by 18 Southern women. This network was commonly used as a benchmark for bipartite network community detection algorithms [4, 13, 21, 22], much like the Zachary “karate club” that was used for benchmarking unipartite community detection algorithms.

2.3. Projection Procedure
2.3.1. Projection

Given a bipartite network with an bipartite adjacency matrix B, where and are the number of type-a and type-b vertices, respectively. , if there is an edge between type-a vertex and type-b vertex ; otherwise, .

A common way to represent and study bipartite networks consists of projecting them onto links of one kind of vertex [23]. The standard projection method simplifies the system to a unipartite network. For instance, from a bipartite network of scientists and papers, one can extract a network of scientists only, who are related by coauthorship. However, such a projection loses a lot of information and leads to an oversimplified and less useful representation [6, 19, 24]. Therefore, we refine it in an alternative way below.

We define for each type-a vertex the vector [19]:where is equal to 1 if there exists one edge between and ; otherwise, it is 0. Then, we calculate the correlation between vertices and using the cosine similarity [25], which is a symmetric correlation measure. That is,where denotes the scalar product between and . Besides,where is the degree of vertex . This measure of correlation, which corresponds to the cosine of the two vectors in an -dimensional space, is equal to 1 when their entries are strictly identical and vanishes when they have no common entries. Specifically, for each pair of type-a vertices, when and have no common edges with any type-b vertices, and will become 1 when they have identical edges. We call the matrix a similarity matrix, with its element , and the unipartite weighted network with similarity matrix C is a similarity network.

In [19], the authors revealed the emergence of social communities and music genres by filtering similarity matrices. However, the threshold of filtering coefficient was selected arbitrarily by the authors, and the community structures they found were not unique. To avoid this issue, we firstly would like to know the relation between the similarity correlation and community membership of vertices. We now make an analysis of relations among edge existence, similarity correlation, and community membership of the unipartite network vertices.

2.3.2. Relation between Similarity Correlation and Community’s Edges

According to the definition of a community [6, 26, 27], there are many edges within communities but few edges between communities. Modularity [17, 28] is the most popular function to measure the division quality of a network. Given a particular network with an adjacency matrix , its modularity is defined as follows:where is the degree of vertex i, m is the number of edges, and denotes the community to which vertex i is assigned. The function yields 1 if vertices i and j are in the same community () and is 0 otherwise. Therefore, each pair of vertices with an edge between them is more likely to be in the same community than in a different community. This is because it will increase the value of Q if they are in the same community but makes no contribution to Q otherwise.

Now, we investigate whether a pair of vertices with a higher similarity correlation is more likely to be in the same community rather than a different community. We let the ith row of A be the ith vertex’s N-vector and use the cosine of the two vectors and in the -dimensional space, to quantify the correlation between vertices i and j. It is obvious that there are more coneighbors between a pair of vertices with a higher correlation. Moreover, we have proven in the previous paragraph that two ends of an edge are more likely to be in the same community than in a different community. Therefore, a pair of vertices with a higher correlation is more likely to be in the same community than in a different community.

We test our inference on two widely used unipartite networks, the “karate club” network of Zachary [29] and the network of political blogs assembled by Adamic and Glance [30]. Both the two networks have a known community structure. We define the average similarity correlation of the ith vertex with the vertices in the same community as and the average similarity correlation of the ith vertex with the vertices of different communities as , where n is the total number of vertices and denotes the vertices number of the community vertex I belongs to. In Figure 1, we plot the average correlation of each of the vertices with the vertices in the same community against those in the different communities for each vertex of the two networks, respectively. Figures 1(a) and 1(b) show that one vertex’s average similarity correlation with the vertices in the same community is greater than that in different communities.

Therefore, we form an edge between each pair of vertices when its similarity correlation value is higher than a given value to construct a binary network from the similarity network of the bipartite network. We will discuss how to select such a threshold in the following section.

2.4. Filtering Procedure

To derive such a binary network from the similarity network of a bipartite network, i.e., transform the correlation values in the continuous range [0, 1] to an edge valued 1 or 0 between a pair of vertices, we define a filtering coefficient as in [19]. We filter the similarity matrix elements using , so that if and is equal to 0 otherwise. We call the unweighted unipartite network , obtained by filtering the similarity matrix, a filtering network, whose adjacency matrix , with one element denoted as , is named a filtering matrix.

We take the Southern women network as an example and plot the total degree of the filtering network as a function of the filtering coefficient on the women similarity network and events similarity network both projected from the Southern women dataset. As shown in Figure 2, the total degree of the filtering network reaches a maximum when , and the number decreases or remains unchanged with increasing , reaching a minimum when . We find a total degree value of 88, which is nearest to the exact number of 89, at on the women filtering network and at on the events filtering network.

This raises a question of how do we know when the filtering is good? To answer this, we first introduce the concept of null model. A null model is a random network which matches the original in some of its structural features but does not have any community structure. The most popular null model is known as the standard null model of modularity [17]. It consists of a randomized version of the original graph, where the edges are rewired at random, under the constraint that the expected degree of each vertex matches the degree of the vertices in the original graph. We call the original network the real network, which is assumed to be a unipartite unweighted network projected from a bipartite network.

Let be the adjacency matrix of the real network of type-a vertices, then and is the total degree of the real network. Now, we build a null mode R as in [17]:

We would like the degree of each vertex of the filtering network to approximate the degree of the vertices in the original graph. Firstly, we define a measure of degree difference between the filtering network and the null model, and we call it degree difference (DD):

From equation (7), we know the degree of each vertex of the filtering network approximately matches the degree of the vertices in the original graph when DD is minimized. By taking a derivative with respect to in equation (10) and let it equals to 0, we havewhere is the total degree of the filtering network and m is the edge number of the bipartite network. Now, we define a measure of the quality of a filtering network of a bipartite network, which we call QFC based on equation (11):where is actually the absolute value of . That is to say, the best filtering network, whose total degree matches or is closest to that of the original network, occurs when the QFC reaches a minimum with . We call the best filtering network the approximation network, whose adjacency matrix is named an approximation matrix.

Next, we perform experiments to test QFC criterion on the Southern women network. We plot the of the filtering network as a function of the filtering coefficient on the type-a similarity network and type-b similarity network projected from the real-world bipartite networks mentioned above. The approximation networks can be obtained when reaches its minimum value at the bottom of the curve, as shown in Figure 3.

In this procedure, we construct the as a function of the filtering coefficient and find the minimum value of and the corresponding . Then, we get the approximation matrix with elements of type-a vertices.

2.5. Estimation Procedure

In the work of [11], the authors introduced a method for estimating the number of communities in a unipartite network. We can use this method to determine the number of communities in the approximation network of a bipartite network.

Riolo et al. [11] employed a more sophisticated approach, the degree-corrected stochastic block model, to overcome the shortcomings of the stochastic block model [31], which gives substantially better results for real-world network data. With the model specified, they find the probability that a particular network with adjacency matrix is found by the following equation:where k is the number of groups, denotes the group to which vertex i is assigned, r and s are the groups to which the vertices belong to, and is the number of edges running between groups r and s. The parameter is used to independently control the average degree of each node and hence match any desired distribution. The parameter is the expected value of the adjacency matrix entry for vertices i and j belonging to groups r and s, respectively, and they control the community structure. The parameters above have been discussed in detail in [31].

After integrating the parameters and , from equation (13), we havewhere is the number of vertices in group r and is the sum of the degrees of the vertices in group r.

We use equation (13) to derive the probability :where

The values k and define the “state” of a statistical mechanical system with the probability . States of this system are sampled in proportion to the probability using Markov chain Monte Carlo sampling. Then, an estimate of the probability of having k communities given the observed network A is found using the histogram of values of k over the Monte Carlo sample. Then, the most likely value of k is the one for which is greatest. For one network, we performed 10000 Monte Carlo sweeps, each one of which include n individual nodes moves of two types [11]. After one sweep, the values k and may change, if k = 5, k = 3, and k = 2 show out, respectively, 5000, 3000, and 2000 sweeps, then fraction of community numbers P(k = 5/A) = 5000/10000 = 0.5, P(k = 4/A) = 0.3, and P(k = 2/A) = 0.2. Thus, the most likely value of k is 5.

We set as the input to the unipartite network communities number estimating method of Riolo et al. [11] to estimate the number of communities in the network of type-a vertices. Then, by transposing the affiliation matrix B and using the same method as above, we can estimate the number of communities in the network of type-b vertices.

Now, let us analyze the time complexity of our method for type-a network. In the projection procedure, we take times to calculate cosine similarity. Then, we take time approximately to finish the filtering procedure, mainly finding the total degree of one filtering matrix. Finally, we take to move n nodes to k communities and to calculate the complete probability , so the estimation procedure takes time . Therefore, the complete time complexity of our method is .

3. Results

In this section, we compare the partitions generated by other one-mode projections with the performance of the proposed biCNEQ. There are two types of projections, which we call classical unweighted projection (CUP) and classical weighted projection (CWP) in order to distinguish from our method. An unweighted projection of a bipartite network onto its type-a vertices is obtained by letting two type-a vertices i and j be connected if they share any type-b neighbor k. Each edge of a weighted projection has a weight equal to the number of shared neighbors. Given an adjacency matrix , the classical weighted projection matrix P and the classical unweighted projection matrix are, respectively, given by and where the diagonal blocks of and correspond to the projections onto type-a and type-b vertices, respectively. The matrix P is equivalent to a “two-step” adjacency matrix, with each entry weighted by the number of length-2 paths between each pair of vertices [4]. Then, we set P or as the input to the unipartite network communities number estimating method of Riolo et al. [11]. We will demonstrate that our method performs better than CUP and CWP in the following sections.

3.1. Synthetic Network: The Easy Case

As the mixed parameter increases, i.e., the level of noise is decreased, the fraction of correct community numbers of the approximation network of type-a vertices and type-b vertices calculated by the biCNEQ increases as a whole (blue line in Figure 4). However, CUP and CWP only give correct community numbers of the network in the noise-free situation (), as the red and green lines in Figure 4. Then, we use our method to derive the approximation networks of synthetic mixed networks generated with (red circles in Figure 4) and show posterior probabilities of the number of communities in the approximation network in Figure 5. As shown in Figures 4 and 5, when , our method can estimate the correct number of communities in the type-a vertices approximation network with the adjacent matrix for this easy case. Analysis of the type-b vertices approximation network is carried out in the same way.

Next, we test whether the biCNEQ scales well when the parameters of synthetic networks were set as Table 1. Firstly, we define success estimation rate (SER) aswhere each run includes 50000 Monte Carlo sweeps. The greater the is, the better the biCNEQ performs. As can be seen from Figure 6(a), our method performs reliably when and whereas CUP and CWP can only deal with the network when noise free. As can be seen from Figure 6(b), the biCNEQ performs less well when the size of communities grows. The biCNEQ performs less well when the level of noise increases, and the number K of planted communities grows and hardly gives right community number when and .

3.2. Synthetic Network: The Difficult Case

For our method, as shown in Figure 7 (blue line), when the level of noise is decreased, the fraction of correct estimates of the number of communities of the approximation network type-a vertices remains stable with small fluctuations while . It increases sharply when . That of type-b vertices remains stable with small fluctuations when and increases sharply when . We used our method to derive the approximation networks of synthetic mixed networks generated with (see red circles in Figure 7) and show posterior probabilities of the number of communities in the approximation network in Figure 8. As seen from Figures 7 and 8, our method can estimate the correct number of communities in the type-a vertices approximation network with the adjacent matrix when and of communities in the type-b vertices approximation network with the adjacent matrix when for the difficult case.

However, CUP and CWP can only correctly identify three communities in the type-b unipartite network without noise () and fail to estimate the correct community number in testing on the type-a unipartite network, as the red and green lines shown in Figure 7, respectively. The reason is the average size of type-a communities (35 nodes) is bigger than that of type-b communities (10 nodes). Furthermore, we found that biCNEQ performs less well when the size of communities grows and fails to work even when reaches 40 nodes with and , which is a very small network.

3.3. Empirical Networks

We use our method, with a filtering coefficient of , as shown in Figure 3, to create the Southern women approximation network, whose adjacency matrix is denoted as Aw. Then, we use Aw as the input to the method of Riolo et al. [11]. As shown in Figure 9(a), we can estimate the correct number of Southern women communities which matches the one determined in [4, 21, 22]. Figure 9(b) shows that, for the events, receives the most weight but comes a close second. It is interesting that the number of events groups in [21] is 2 but in [4], it is 3. However, the community numbers of women and events calculated by both the classical projection methods are 1 which is not plausible.

4. Conclusions

In this paper, we developed a method called biCNEQ for inferring the number of pure-type communities into which a bipartite network can divide. We designed a measure of the filtering quality named QFC to select a threshold of filtering coefficient to filter a weighted similarity network projected from a bipartite network to obtain a binary unipartite network. Then, we used the method of [11] to estimate the number of communities in the approximation network of each type of vertices. Via tests, biCNEQ gives correct answers and performs better than the classical unweighted and weighted projection methods on an empirical network with a known community structure and mixed synthetic networks including an easy case and a difficult case.

As discussed in the last section, the performance of our method degrades when the community size grows, especially in the difficult case synthetic network. This shortcoming makes biCNEQ hard to scale well on the real-world networks where there exists community structure. The reason for this may be due to information loss in the projection and filtering procedure or other stages of the proposed method. Thus, in the future work, the following two issues can be investigated: (1) an improved projection approach to minimize the information lost in the biCNEQ method can be developed and (2) a projection-free approach using a bipartite degree-corrected stochastic block model and Markov chain Monte Carlo sampling may be proposed.

Data Availability

The community and edge data of the “karate club” network and the “political blogs” network are obtained from Newman’s web pages (http://www-personal.umich.edu/∼mejn/dcsbm/ZacharyCorrectOutput/DegreeCorrected/ActualComms.tsv, http://www-personal.umich.edu/∼mejn/dcsbm/ZacharyCorrectOutput/DegreeCorrected/EdgeLists.tsv, http://www-personal.umich.edu/∼mejn/dcsbm/PolBlogsCorrectOutput/DegreeCorrected/ActualComms.tsv, and http://www-personal.umich.edu/∼mejn/dcsbm/PolBlogsCorrectOutput/DegreeCorrected/EdgeLists.tsv).

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (no. 61773250), the Shanghai Municipal Education Commission of China (no. 2019PD1-2-37), and the Program for Shanghai Youth Top-Notch Talent.