Scientific Programming

Volume 2019, Article ID 4310561, 12 pages

https://doi.org/10.1155/2019/4310561

## New Community Estimation Method in Bipartite Networks Based on Quality of Filtering Coefficient

^{1}School of Management, Shanghai University, Shanghai 200444, China^{2}College of Economics and Management, China Jiliang University, Hangzhou, Zhejiang 310018, China

Correspondence should be addressed to Hu-Chen Liu; moc.liamxof@uilnehcuh

Received 20 January 2019; Revised 13 April 2019; Accepted 24 April 2019; Published 21 May 2019

Academic Editor: Emiliano Tramontana

Copyright © 2019 Li Xiong et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Community detection is an important task in network analysis, in which we aim to find a network partitioning that groups together vertices with similar community-level connectivity patterns. Bipartite networks are a common type of network in which there are two types of vertices, and only vertices of different types can be connected. While there are a range of powerful and flexible methods for dividing a bipartite network into a specified number of communities, it is an open question how to determine exactly how many communities one should use, and estimating the numbers of pure-type communities in a bipartite network has not been completed. In our paper, we propose a method named as “biCNEQ” (bipartite network communities number estimation based on quality of filtering coefficient), which ensures that communities are all pure type, for estimating the number of communities in a bipartite network. This paper makes the following contributions: (1) we show how a unipartite weighted network, which we call similarity network, can be projected from a bipartite network using a measure of correlation; (2) we reveal the relation between the similarity correlation and community’s edges in the vertices of a unipartite network; (3) we design a measure of the filtering quality named QFC (quality of filtering coefficient) to filter the similarity network and construct a binary network, which we call approximation network; and (4) the number of communities in each type of unipartite networks is estimated using Riolo’s method with the approximation network as input. Finally, the proposed biCNEQ is demonstrated by both synthetic bipartite networks and a real-world network, and the results show that it can determine the correct number of communities and perform better than two classical one-mode projection methods.

#### 1. Introduction

The bipartite network is a network whose vertices can be divided into two types *a* and *b*, where every edge connects a vertex of type-*a* to one of type-*b*, and there are no edges connecting vertices of the same type. There are many examples of bipartite networks, such as those described in [1–3]. Regarding unipartite networks, a common task is to find groups or communities of vertices that connect to the rest of the network in similar ways. Finding this underlying group structure is of significant, which can, for example, divide a heterogeneous network into homogeneous subgraphs for subsequent analysis or modeling [4].

Beginning from Newman’s [5] study, community detection has attracted considerable attention from researchers [6], aiming to identify good ways to divide up a network into communities. A range of powerful and flexible methods for dividing a bipartite network into a specified number of communities have been proposed in recent years [4, 7, 8]. However, most of them have one key shortcoming; that is, they require us to know the number of communities of a network in advance. In the real world, however, we usually do not know this number a priori, and thus, we need to estimate it from the data. Recently, several methods have been proposed for making such estimates for unipartite networks [9–12] and bipartite networks [13–16]. Barber [13] in his work introduced bipartite modularity, a variant of the modularity proposed by Newman and Girvan [17]. A dual-projection approach proposed by Han et al. [14] aims to maximize the Newman’s one-mode modularity. The authors of [15, 16] maximized Barber’s bipartite modularity for bipartite community detection. However, maximizing both modularities noted above proved to be a NP-hard problem [6, 18]. The bipartite network communities generated in the previous studies are of mixed type, and so far, there is no exploration inferring to the numbers of pure-type communities in a bipartite network.

In our paper, we propose a method named “biCNEQ” (bipartite network communities number estimating based on quality of filtering coefficient), which ensures that communities are all pure type. The main innovations and contributions of this study can be illuminated as follows: (1) a percolation idea-based (PIB) method, proposed by Lambiotte and Ausloos [19], is used to project a bipartite network to unipartite correlation networks and reveal the emergence of social communities and music genres by filtering correlation matrices and (2) a first principles method given by Riolo et al. [11] is used for inferring the number of communities in a unipartite network. The quality of filtering coefficient (QFC) is designed to select a threshold to filter the correlation matrix in constructing a binary unipartite network. This method can roughly match the structural features of the correlation and degree of the vertices of the original ones, which cannot be done using PIB. Finally, we use Riolo et al.’s [11] method to estimate the number of communities in each type of unipartite networks. In addition, the proposed biCNEQ is demonstrated by both synthetic bipartite networks and a real-world network, and the results show that our method performs better than two classical one-mode projection methods.

#### 2. Methods

Tests were performed on both synthetic bipartite networks and a real-world bipartite network with a known community structure.

##### 2.1. Synthetic Networks

We construct a synthetic network based on a degree-corrected bipartite stochastic block model (biSBM) formulated by Larremore et al. [4]. Given a bipartite network with adjacency matrix (where and are the vertices of type-*a*), we divide the vertices of type *a* into groups and the type-*b* vertices into groups and express the matrix of group interrelationships as a matrix, where . Let vertex *i* of type belongs to group and be the type of group *r*, imposing the constraint , which indicates that vertex types and group types must match and ensures that groups will be pure type. Let the number of edges between vertices *i* and *j* follow a Poisson distribution with mean and choose the normalization , where controls the expected degree of vertex *i*, is a symmetric matrix of parameters to control the number of edges between groups *r* and *s*, and is the Kronecker delta. The probability of observing a network *G* with adjacency matrix *A* can be written aswhere is the observed degree of vertex *i* and is the number of edges between groups *r* and *s*. After taking partial derivatives with respect to on the logarithm of equation (1), we can get the maximum likelihood parameter as follows:

The maximum likelihood can be found via the constrained maximization of the logarithm of equation (1) subject to using Lagrange multipliers, i.e.,where is the sum of the degrees in group *r*.

Empirically observed networks are often noisy with missing or spurious edges. Therefore, we examine the ability of biCNEQ to analyze a range of synthetic networks generated by a mixed model, which is a combination of planted structure and a random network model . The later model is used to create various levels of uniformly random noise. We consider two forms, as in [4], an easy and a difficult case, to illustrate the biCNEQ’s performance under different conditions.

We specify and and create mixed networks using . Then,where the mixed parameter takes values between 0 (all noise) and 1 (all planted structure) and according to equation (2). We let , where is the total number of edges in the network.

###### 2.1.1. An Easy Case

In the easy case, we define the mixed matrix to have an easily identifiable community structure which consists of four equally sized, unambiguous, and nonoverlapping components with each made up of one type-*a* and one type-*b* community. Let *N* = 60 for each type and divide these vertices evenly across the four components, where . The symmetric entry has the same value. We create networks using . Finally, we use the code, downloaded from http://www.danlarremore.com/bipartiteSBM/makeEasyCaseNetworks.m, to generate mixed synthetic networks of an easy case for testing with the specification above and with its degree distribution unchanged.

###### 2.1.2. A Difficult Case

In the difficult case, the mixed matrix we define is given a less easily identifiable community structure by creating partially overlapping communities, , and has a broad degree distribution. We set different sizes for the communities with 70 type-*a* vertices, divided evenly into 2 communities {35, 35}, and 30 type-*b* vertices, divided into 3 communities {10, 15, 5}. Then, we let and ; can be obtained using equation (3). The symmetric entry has the same value. Finally, we use the code, downloaded from http://www.danlarremore.com/bipartiteSBM/makeDifficultCaseNetworks.m, to generate the mixed synthetic network of a difficult case for testing with the specification above and degree distribution unchanged.

##### 2.2. Empirical Networks

The Southern women network collected by Davis et al. [20] contains the observed attendance at 14 social events by 18 Southern women. This network was commonly used as a benchmark for bipartite network community detection algorithms [4, 13, 21, 22], much like the Zachary “karate club” that was used for benchmarking unipartite community detection algorithms.

##### 2.3. Projection Procedure

###### 2.3.1. Projection

Given a bipartite network with an bipartite adjacency matrix *B*, where and are the number of type-*a* and type-*b* vertices, respectively. , if there is an edge between type-*a* vertex and type-*b* vertex ; otherwise, .

A common way to represent and study bipartite networks consists of projecting them onto links of one kind of vertex [23]. The standard projection method simplifies the system to a unipartite network. For instance, from a bipartite network of scientists and papers, one can extract a network of scientists only, who are related by coauthorship. However, such a projection loses a lot of information and leads to an oversimplified and less useful representation [6, 19, 24]. Therefore, we refine it in an alternative way below.

We define for each type-*a* vertex the vector [19]:where is equal to 1 if there exists one edge between and ; otherwise, it is 0. Then, we calculate the correlation between vertices and using the cosine similarity [25], which is a symmetric correlation measure. That is,where denotes the scalar product between and . Besides,where is the degree of vertex . This measure of correlation, which corresponds to the cosine of the two vectors in an -dimensional space, is equal to 1 when their entries are strictly identical and vanishes when they have no common entries. Specifically, for each pair of type-*a* vertices, when and have no common edges with any type-*b* vertices, and will become 1 when they have identical edges. We call the matrix a similarity matrix, with its element , and the unipartite weighted network with similarity matrix *C* is a similarity network.

In [19], the authors revealed the emergence of social communities and music genres by filtering similarity matrices. However, the threshold of filtering coefficient was selected arbitrarily by the authors, and the community structures they found were not unique. To avoid this issue, we firstly would like to know the relation between the similarity correlation and community membership of vertices. We now make an analysis of relations among edge existence, similarity correlation, and community membership of the unipartite network vertices.

###### 2.3.2. Relation between Similarity Correlation and Community’s Edges

According to the definition of a community [6, 26, 27], there are many edges within communities but few edges between communities. Modularity [17, 28] is the most popular function to measure the division quality of a network. Given a particular network with an adjacency matrix , its modularity is defined as follows:where is the degree of vertex *i*, *m* is the number of edges, and denotes the community to which vertex *i* is assigned. The function yields 1 if vertices *i* and *j* are in the same community () and is 0 otherwise. Therefore, each pair of vertices with an edge between them is more likely to be in the same community than in a different community. This is because it will increase the value of *Q* if they are in the same community but makes no contribution to *Q* otherwise.

Now, we investigate whether a pair of vertices with a higher similarity correlation is more likely to be in the same community rather than a different community. We let the *i*th row of *A* be the *i*th vertex’s *N*-vector and use the cosine of the two vectors and in the -dimensional space, to quantify the correlation between vertices *i* and *j*. It is obvious that there are more coneighbors between a pair of vertices with a higher correlation. Moreover, we have proven in the previous paragraph that two ends of an edge are more likely to be in the same community than in a different community. Therefore, a pair of vertices with a higher correlation is more likely to be in the same community than in a different community.

We test our inference on two widely used unipartite networks, the “karate club” network of Zachary [29] and the network of political blogs assembled by Adamic and Glance [30]. Both the two networks have a known community structure. We define the average similarity correlation of the *i*th vertex with the vertices in the same community as and the average similarity correlation of the *i*th vertex with the vertices of different communities as , where *n* is the total number of vertices and denotes the vertices number of the community vertex *I* belongs to. In Figure 1, we plot the average correlation of each of the vertices with the vertices in the same community against those in the different communities for each vertex of the two networks, respectively. Figures 1(a) and 1(b) show that one vertex’s average similarity correlation with the vertices in the same community is greater than that in different communities.