Abstract

Community detection is an important analysis task for complex networks, including bipartite networks, which consist of nodes of two types and edges connecting only nodes of different types. Many community detection methods take the number of communities in the networks as a fixed known quantity; however, it is impossible to give such information in advance in real-world networks. In our paper, we propose a projection-free Bayesian inference method to determine the number of pure-type communities in bipartite networks. This paper makes the following contributions: (1) we present the first principle derivation of a practical method, using the degree-corrected bipartite stochastic block model that is able to deal with networks with broad degree distributions, for estimating the number of pure-type communities of bipartite networks; (2) a prior probability distribution is proposed over the partition of a bipartite network; (3) we design a Monte Carlo algorithm incorporated with our proposed method and prior probability distribution. We give a demonstration of our algorithm on synthetic bipartite networks including an easy case with a homogeneous degree distribution and a difficult case with a heterogeneous degree distribution. The results show that the algorithm gives the correct number of communities of synthetic networks in most cases and outperforms the projection method especially in the networks with heterogeneous degree distributions.

1. Introduction

A bipartite network is a network with nodes of two types and edges connecting only nodes of different types. The decomposition of bipartite networks into communities (clusters, modules, or groups), i.e., community detection, plays an important role in revealing the structure of large networked systems, providing new insights into how the network is organized [14].

Many methods [59] have been developed for community detection in bipartite networks in recent years. A fundamental shortcoming of most community detection methods is that they partition networks into a fixed number of groups. However, this number is usually unknown in real-world networks, and we need to mine such information from the network data. A lot of research [1013] for making such efforts to determine the number of communities in bipartite networks has been proposed recently. There are three main problems in these methods. One is that they performed estimation through maximizing the modularity proposed in [10, 14] that is proved to be NP-hard [15, 16]; the second is that they gave the number of communities of mixed-type, which is nearly always substantially less efficient [6]; the third is that the projection method [15] performed poorly due to information loss. The heuristic methods proposed in [17, 18] for community detection in bipartite networks does not need the number of communities to be given a priori.

In this paper, we propose a projection-free Bayesian inference method for determining the number of pure-type communities in a bipartite network. Our method builds mainly on the work as below: (i) the degree-corrected bipartite stochastic block model, proposed by Larremore et al. [6], is used to find the community structure of empirical networks with broad degree distributions; and (ii) the prior probability distribution over divisions of a network into groups and a new prior probability distribution based on a queueing-type process, both proposed by Riolo et al. [4], are used for calculating the number of communities in a unipartite network.

In Section 2, first, we present the first-principle derivation of a practical method, using the degree-corrected bipartite stochastic block model that is able to deal with networks with broad degree distributions, for estimating the number of pure-type communities of bipartite networks. Second, we propose a prior probability distribution over the partition of a bipartite network, with the community-type parameter ensuring that each community is pure type. In Section 3, we design a Monte Carlo algorithm incorporated with our proposed method and prior probability distribution. In the following section, we demonstrate our method on synthetic bipartite networks including an easy case with homogeneous degree distributions and a difficult case with a heterogeneous degree distribution. The results show that the proposed algorithm can determine the correct number of communities and perform better than our projection method in every case.

2. Methods

2.1. Degree-Corrected Bipartite Stochastic Block Model

The stochastic block model is a generative model used to produce networks containing blocks, groups, or communities. This model is very important in network science and is used for recovering the community structure in network data [2, 3]. The classic stochastic block model can be described as follows: divide the number of vertices N into K disjoint communities; any two vertices i and j are connected by an edge with probability , which is an entry of a symmetric K × K matrix, and is the community of vertex i. However, the block model described above finds the community structure merely due to the degree sequence and fails to detect the known communities in a real-world network that has heterogeneous degree distributions [19]. Karrer and Newman [20] extended the classic stochastic block model including heterogeneity in the degrees of vertices and proposed the degree-corrected stochastic block model, which is proved to overcome the problems of the classic block model.

Most stochastic block model community detection methods can be naturally applied to bipartite networks [20, 21]. Unfortunately, the stochastic block model often overfits bipartite data by mixing nodes of different types within communities and it is nearly always substantially less efficient [6]. Built on the work of Karrer and Newman [20], Larremore et al. [6] proposed the degree-corrected bipartite stochastic block model, which is employed in our calculations. In the degree-corrected bipartite stochastic block model, a bipartite network G is given with an bipartite asymmetric adjacency matrix B, where is the number of nodes of type-a and is the number of nodes of type-b. Let be the symmetric adjacency matrix of the network G with . The type-a nodes are divided into some number of communities, labeled , and the nodes of type- are divided into communities, labeled . We express the matrix of community interrelationships as a matrix, where . Let again encode the community node i belongs to. Let be the type of vertex i and be the type of community r, imposing the constraintwhich indicates that node types and community types must match and ensure that communities will be pure type. We writewhere is the Kronecker delta. Let control the expected degree of node i and be the symmetric matrix of parameters to control the number of edges between communities r and s. Following [4], the normalization of can be fixed by imposing the constraintwhere is the number of nodes in community r. Following [22], we let the numbers of edges between nodes i and j follow a Poisson distribution with mean . Enforcing the bipartite constraint of equation (1) produces a restriction on ω:

Given parameters , , , , and T for the specification of the mode, the probability of observing a bipartite network G with adjacency matrix A can be written as

Allowing for the constraint of equation (4), the probability can be simplified to the more convenient form ofwhere is the observed degree of vertexes i and is the number of edges between communities r and s. We have neglected an overall multiplicative constant in (7) since it cancels out in later calculations. Note that a similar probability given by Larremore et al. [6] has been modified in equation (7) as follows:(i)The number k of communities, the objective we will estimate, is incorporated as an unknown quantity(ii)The exponential expression is rather than , with the normalization of under a different constraint condition

Then, we integrate out the irrelevant parameters and ω. We assume maximum-entropy (i.e., least informative) prior probability distributions on the parameters θ and ω. For θ, this means a uniform prior probability distribution over the regular simplex of values specified by equation (4). Then, we let the expected value of the edge probability be equal to the observed average edge probability in the network as a whole: , where m is the total number of edges in the bipartite network. Then, the maximum-entropy prior probability distribution is an exponential distribution . We assume the priors to be independent (conditioned on g, k, and T) so that and

With these choices of priors, integration is performed on equation (8). Then, we havewhere and an overall multiplying constant has been discarded.

2.2. Prior on Community Partitions

Our goal is to estimate the correct values of and for a given bipartite network using this model as the basis for a Bayesian model selection procedure. We havewhere is given by equation (6), and the probability P(A), which in the denominator of equation (10) has no effect on our results, is unknown but cancels out in later calculations. In this paper, our primary focus is to get the posterior distribution on through summing over ; then, we choose a value for and calculate and using equations (2) and (3). Now, we start to choose the prior , which is often the most important and difficult task of the calculation in the case of Bayesian methods.

2.2.1. Prior on Community Partitions

If we know the number of communities k in advanced, let us choose the prior on community partitions of one type of node. We first employ the most commonly used approach, which is described as follows. The prior on the community partition probabilities γ is uniform under the constraint , where , with which nodes are assigned to communities independently at random. We can get a particular community partition with the probability

The values fall on a regular -dimensional simplex with volume , so its probability density is . We integrate equation (11) over the simplex and get the following equation [4, 23, 24]:

Since the process above generates a uniform distribution over possible community sizes, we then introduce an alternative and simpler way (used by Riolo et al. [4]) to derive the prior . We have possible ways to choose k communities with and possible ways to place the nodes in the k communities; thus, any partition of nodes to communities is given with the probabilitythe same to equation (12) without the need for parameters .

However, these two methods may generate partitions with empty communities. As in [4], we have the binomial coefficient possible choices of k communities with nonempty ones and

Then, we allow for two different types of partitions and get

2.2.2. Choice of the Number of Communities

Some previous work has been done for the choices of prior P(k) over the number of communities itself, such as letting P(k) equal [23, 24] or [21]. We again follow [4] and take a different approach, in which community partitions and the number of communities k can be generated synchronously. We use a queueing-type mechanism for processing community partitions . For one type, such as type-a, we order the Na nodes uniformly at random and the first node is placed in community 1. Then, we place each following node either (a) with probability in the same community as the previous node or (b) with probability in the next community; for another type, we repeat the process above. This process ensures that all communities generated are not empty.

For type-a, there are ( for type-b) possible ways to order the nodes, so the probability of each one occurring is the same as ( for type-b). If communities are generated finally, we must create k − 2 new communities ( new type-a ones and new type-b ones). Because for one type each node except the first starts a new community with equal probability q, k communities with sizes are generated with the probability:

For each community of the same partition , the nodes can be rearranged in ways. Thus, any given partition is generated in the process with the probability:

Given that ,

We let , where the expected number of new communities created is µ. Then,

In equation (19), has no effect on our result and cancels out in later calculations.

As in [4], we let µ = 1 and neglecting constants

Now, equation (10) can be written asand here we allow for equations (9) and (20).

Unfortunately, it is hard to sum over since the sum has terms [4]. Instead, we approximate the distribution over ( and according to different types) by Markov chain Monte Carlo sampling.

3. Monte Carlo Algorithm for Bipartite Networks

3.1. Our Algorithm

We design a Monte Carlo algorithm incorporated with the bipartite block model and prior probability distribution discussed above to apply the bipartite networks. We call our algorithm the bipartite network Monte Carlo algorithm (BMCA), and it is built on the unipartite network analysis of Riolo et al. [4]. Our algorithm fulfills the requirements of ergodicity and detailed balance [25].

There are two types of steps used by BMCA:Type 1: moving one node from its current community to a different existing community. There are again two types of processes in this type of rearrangement. In the processes of the first type, BMCA decreases the number of communities ( or ) of the same type as the community whose last node moved, thereby decreasing the value of by one. In the processes of the second type, the community the node moved from contains more than one node and the move here does not change the number of communities.Type 2: moving one node to a community newly created. The number of communities ( or ) of the same type as the community the node moved from and the value of increase by one.

The two types of steps described above make BMCA meet the requirement of ergodicity.

Detailed balance requires that the rate goes from a state to another state and the opposite meetwhere we allow for equation (10). From (20), we have

We consider as , where the previous part of the product represents the probability of proposing a move and the latter represents the probability chosen to satisfy the detailed balance condition for accepting the move. Then,

BMCA is described as follows:Input: the bipartite adjacency matrix and the node-type vector .Initial communities partition: using the process described in Section 2.2.2.Monte Carlo Sampling:(1)(a) In each step of BMCA, we carry out a rearrangement of type 1 with probability . If (i.e., and ), we do nothing. Otherwise, when , first, we randomly select a community label in the range . If the number of communities of type is more than one, then we randomly select a community of type labels s; otherwise, we turn to communities of another type, from which we randomly reselect a pair of communities, respectively, labels and . Then, we randomly select one node from community and move it to community . The number of total communities remains constant.(b) In the process, if community r becomes empty as a result that its last node is removed, the number of communities decreases by one. In practice, for the type-a communities labels , we can efficiently change the community to have label and then change the community to have label ; specially, if , we only perform the latter relabeling. In such a process, the number of communities of type-a decreases by 1 and the number of communities of type-b remains constant. For type-b community labels , we can efficiently change the community k to have label ; specifically, if , no relabeling is necessary. In such a process, the number of communities of type-b decreases by 1 and the number of communities of type-a remains constant. The number of total communities decreases by 1.(2)Otherwise, we carry out a rearrangement of type 2 with probability . We randomly select a community label in the range . If there is only one node in community , we do nothing. Otherwise, if , we change community label to and create a new empty community . Then, we randomly select a node from community and move it to the newly created community . During this process, the number of communities of type-a increases by 1 and the number of communities of type-b remains constant. If , we simply create a new empty community and no relabeling is necessary. Then, we randomly select a node from community and move it to the newly created community ; during this process, the number of communities of type-b increases by 1 and the number of communities of type-a remains constant. The number of total communities increases by 1.(3)We accept the rearrangement proposed above with acceptance probability [4]:(4)Repeat steps 1–3.Output: the posterior probabilities and .Rearrangements of any type are performed, and we always have the following equation (see Table 1):

Taking into consideration equations (22) and (24), the detailed balance condition can be written as

Therefore, our algorithm satisfies the detailed balance with acceptance probability of equation (25) and will sample correctly from the distribution .

3.2. Output of BMCA

In our implementation, a given number of steps per node is performed in a Monte Carlo run on a bipartite network, and then we write approximate posterior probabilities:

For type-a, we can get approximatelyand for type-b,

The most likely number of type-a communities in a bipartite network is with the biggest value , and the number of type-b communities can be given in the same way.

In order to avoid any bias in the results and improve the correctness of BMCA, instead of just using , we performed a given number of Monte Carlo runs for a bipartite network. Thus, we obtained the average value of each as the final posterior probabilities:for type-a andfor type-b.

We take to calculate the acceptance probability of each node move, so the time complexity of our algorithm is .

4. Example Application

Here, we demonstrate our algorithm on synthetic bipartite networks generated by the bipartite stochastic block model [6] and find that it works well in most cases.

There is often some noise in empirically observed networks because of both errors in the measurements and missing data [26]. Therefore, we employ a mixing model to generate noisy synthetic networks for testing the robustness of our algorithm. In this model, we specify , and the expected value of an edge is given bywhere the parameters and are used to generate a pure planted community structure and no community structure, respectively; and the mixing parameter is used to control various levels of uniformly random noise. Following Larremore et al. [6], we let and , where . We employ this model to create synthetic networks of an easy case with a homogeneous degree distribution and a difficult case with a heterogeneous degree distribution.

We performed Monte Carlo runs for each network of steps per node and found that BMCA can determine the correct number of communities for synthetic networks and outperform the projection method in every case.

4.1. An Easy Case

In the easy case, we use the model above to create the synthetic networks including four type-a communities and four type-b communities, i.e., , and each node has the same degree (i.e., the network with planted community structure has a homogeneous degree distribution). We let the number of each type of node equal to 1000, and all communities are equally sized as 250. Then, we let (i.e., the total number of edges is 10000), and let the block structure matrix be defined with (the symmetric entry has the same value). Moreover, the random structure matrix can be defined with . Finally, with these specifications, we create networks of the easy case to test our algorithm (the code to create these networks can be downloaded from [27]).

As the mixing parameter increases, i.e., the level of noise is decreased, BMCA begins to estimate the correct number of communities for the network in an easy case when ; in addition, the fraction of correct communities number of the network calculated by BMCA increases as a whole (blue line in Figure 1). Then, we use our method to derive the synthetic mixing networks generated with and , indicated by red circles in Figure 1(a) and Figure 1(b) and showing the posterior probabilities of the number of communities in the network in Figure 2(a) and Figure 2(b). As shown in Figures 1 and 2, we are able to estimate the correct number of different types of communities when with (); then, when , the proposal probability of the correct number of communities and is equal to 1. However, the projection method gives poorer results (see the red lines in Figure 1), and it begins to estimate the correct number of different types of communities at with () and when (). When , () is always bigger than (), as shown in Figure 3.

4.2. A Difficult Case

In the difficult case, the synthetic networks are created with two type-a communities and three type-b communities , and the degree of each node is different (i.e., the network with planted community structure has a heterogeneous degree distribution). The communities are set with different sizes, and we divide 700 type-a nodes evenly into 2 communities {350, 350} and 300 type-b nodes into 3 communities {100, 150, 150}. Let and ; i.e., the total number of edges is 8000. Then, the block structure matrix can be defined with and (the symmetric entry has the same value), and the random network matrix can be defined with and . The symmetric entry has the same value. Finally, with the specification above, we create networks of the difficult case to test our algorithm (the code to create these networks can be downloaded from [27]).

As shown by the blue line in Figure 3, when the level of noise is decreased, BMCA begins to estimate the correct number of communities for the network in a difficult case when , and the posterior probabilities and increase sharply after . Especially, when , the proposal probability ; i.e., we are always able to determine the correct number of communities. We used our method to calculate the probability of the number of communities in the synthetic mixing networks generated with and , indicated as red circles in Figure 3, and show the result in Figure 4. As seen from Figures 3 and 4, our method can estimate the correct number of communities and in the synthetic network when for the difficult case. However, the projection method fails to estimate the correct number of communities even when and there is no noise, as shown by the red line in Figure 3.

4.3. Further Testing

We tested our method on synthetic networks of two different sizes of community as the number of network communities increases. The networks were generated using the bipartite stochastic block model with , and the other parameters are set as listed in Table 2, where is heterogeneous as real-world networks and the mean node degree of the network for the figures is 10. The number of communities or estimated using BMCA was correct until the actual number of communities increased to about 7, which is about 14 for k. The results are shown in Figure 5.

The results show a tendency to underestimate the number of communities for higher actual numbers of communities, especially when the size of the community changes from 250 (Figures 5(a) and 5(b)) to 500 (Figures 5(c) and 5(d)) and that of the networks increase correspondingly. However, these calculations appear when BMCA is run with a random initialization partition of nodes to communities. When BMCA is started on the same network with community partitions corresponding exactly to the planted community structure (yellow triangles), we always find the accurate number of communities.

Even so, the underestimation of (or ) occurs not because the correct community partition fails to maximize the posterior probability but rather because BMCA has not run for long enough to find the maximum. The method is theoretically sound, but when the number of possible community partitions kN increases very rapidly with k, the numerical calculation becomes too demanding [4]. We can possibly design a more efficient Monte Carlo algorithm to solve this problem although it offers some useful information of a lower bound on the number of communities in the network given by BMCA.

5. Conclusions

In our paper, a new projection-free Bayesian inference method for determining the number of pure-type communities in a bipartite network has been introduced. First, we present the first principle derivation of a practical method, using the degree-corrected bipartite stochastic block model that is able to deal with networks with broad degree distributions, for estimating the number of pure-type communities of bipartite networks. Second, we propose a prior probability distribution over the partition of a bipartite network, with type parameter T ensuring that each community is pure type. Third, we design a Monte Carlo algorithm incorporated with our proposed method and prior probability distribution. We have illustrated the performance of the method with applications to a wide range of synthetic bipartite networks, including an easy case with homogeneous degree distributions and a difficult case with heterogeneous degree distributions. The results show that the proposed algorithm can determine the correct number of communities and perform better than the projection method, especially in networks with heterogeneous degree distributions.

However, our method underestimates the number of communities when the number of communities becomes large. The reason for this is due to the number of possible community partitions increasing very rapidly with the increase in the number of communities and because our algorithm has not run for long enough to find the posterior probability. Thus, our future work will focus on (i) finding a method that can more efficiently sample the posterior distribution over community partitions to correctly estimate a large number of communities in the network and (ii) extending applications on real-world data sets.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request. Our code and test data are available at http://my.shu.edu.cn/Web_LWZZ.aspx?TID=2887

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This work was supported by the Shanghai Municipal Education Commission of China (no. 2019PD1-2-37) and the Shanghai Youth Top-Notch Talent Development Program.