Abstract

Bipartite graph is widely used to model the complex relationships among two types of entities. Community detection (CD) is a fundamental tool for graph analysis, which aims to find all or top- densely connected subgraphs. However, the existing studies about the CD problem usually focus on structure cohesiveness, such as -core, but ignore the attributes within the relationships, which can be modeled as attribute bipartite graphs. Moreover, the returned results usually suffer from rationality issues. To overcome the limitations, in this paper, we introduce a novel metric, named rational score, which takes both preference consistency and community size into consideration to evaluate the community. Based on the proposed rational score and the widely used -core model, we propose and investigate the rational -core detection in attribute bipartite graphs (RCD-ABG), which aims to retrieve the connected -core with the largest rational score. We prove that the problem is NP-hard and the object function is nonmonotonic and non-submodular. To tackle RCD-ABG problem, a basic greedy framework is first proposed. To further improve the quality of returned results, two optimized strategies are further developed. Finally, extensive experiments are conducted on 6 real-world bipartite networks to evaluate the performance of the proposed model and techniques. As shown in experiments, the returned community is significantly better than the result returned by the traditional -core model.

1. Introduction

A bipartite graph is composed of two disjoint vertex sets, and there are only edges connecting vertices from different sets. Due to its proliferation applications like fraudsters detection [1] and collaboration group maintenance [2], many fundamental problems have been investigated to analyze the bipartite graphs. Among these problems, community detection (CD) aims to find all or top- communities by leveraging different models like -core [3], bitruss [4], and so on. Due to its unique feature, the -core model is widely adopted in different domains. Given a bipartite graph, the -core is the maximal subgraph where the degree of each vertex in the upper layer is at least and the degree of each vertex in the lower layer is at least . Nonetheless, previous models mainly focus on the cohesiveness structure of the graphs but neglect the attribute properties with community.

In real applications, the relationships between different entities often have certain characteristics, which can be modeled as attribute bipartite graphs. For example, in the user-movie network of Figure 1, the upper layer denotes a set of users and the lower layer are the set of movies. Each edge is associated with a number denoting the score assigned from a user to a movie. For a discussion group in the platform, it will have a more harmonious atmosphere if users have high consistency of preference (e.g., rating the same score or tag for the same movie). Besides, small discussion group is more conducive to frequent communication among users. However, the existing research cannot capture those properties. Motivated by this, in this paper, we introduce a novel metric, named rational score, which takes both preference consistency and community size into consideration to evaluate a community. Furthermore, we formally define the problem of rational community detection over attribute bipartite graphs (RCD-ABG), which attempts to find the connected -core with the largest rational score. The following is a motivation example.

Example 1. Reconsider the user-movie network in Figure 1, where the number on the edge denotes the corresponding rating for the movie. Note that the scoring mechanism adopts a five-point system, so the score varies from 1 to 5 in the network. Suppose  = 2 and  = 2 here. Based on the definition, the subgraph induced by vertex set is a -core, where the degree of each vertex is at least 2. However, in the -core, many users have distinct scoring schemes for the same movie. For example, users , , and gave three different scores to the movie . Moreover, the community size is too large to facilitate communication between users. For instance, users and even have not watched the same movie ever. Given  = 2 and  = 2, the vertices in the orange rectangle are our identified rational -core community. Note that, due to the complex equation involved, the detailed definition of rational -core community can be found in preliminaries section. As we can observe, in this community, most users share the same movie taste and the number of people in the group is more reasonable.

1.1. Applications

The RCD-ABG problem can find many real-world applications. We list some examples as follows.(i)Discussion Group Mining. In some real-world bipartite graphs such as BookCrossing, edges denote rating relationships between users and books. There are many discussion groups with these platforms. For users, they are more likely to stay active in a discussion group if the users inside share the same taste. Besides, users will prefer to discuss different topics in a group with appropriate size. This is because too many users can make them uncomfortable and too few will make the discussion difficult to carry on. Hence, by retrieving the rational group, the platform can provide group recommendation more precisely, which is helpful for better user experience.(ii)Personalized Product Recommendation. In customer-movie bipartite networks, the customers will rate the movies based on their personal preference and movie performance. By retrieving the rational -core, the personalized movie recommendation can be provided to customers in the rational community. For instance, in the community found in the orange rectangle in Figure 1, the platform can recommend movie for user . This is because is given the common score from other customers (i.e., and ). Similarly, movie can be recommended for user .

1.2. Challenges

To our best knowledge, we are the first to investigate the rational -core detection problem in attribute bipartite graphs. We prove the problem is NP-hard and we adopt the greedy framework to remove the best vertex iteratively. However, removing a vertex from the graph may make many other vertices drop from the result, which limits the effectiveness of the algorithm. Hence, it is necessary to develop optimized techniques to address these challenges.

1.3. Our Solution

Due to the NP-hardness of the problem, a basic greedy framework is proposed by adopting the greedy framework. In general, we remove the vertex with the smallest marginal gain at each iteration and calculate the remaining -core with its rational score. We stop this process until there is no -core and return the -core with the largest rational score as the result. To address the discussed drawbacks of our basic greedy framework, we further develop two improved strategies, namely, 2-hop neighbors-based optimization and followers-based optimization. Specifically, in 2-hop neighbors-based optimization, we approximate the marginal score by considering the 2-hop neighbors of the removed vertex in the same layer. In our followers-based optimization, we consider the followers of the removed vertex and modify the marginal rational score.

1.4. Contributions

The contributions of this paper are summarized as follows.(i)To better capture the properties within bipartite graph community, we conduct the first research to propose and investigate the rational community detection problem over attribute bipartite graphs by leveraging the novel rational score metric developed.(ii)Theoretically, we prove that the problem is NP-hard, and the rational score function is nonmonotonic and non-submodular.(iii)The basic greedy framework is first presented. To further improve the quality of returned results, two optimized strategies are proposed, namely, 2-hop neighbors-based optimization and followers-based optimization.(iv)Experiments over 6 real-world bipartite graphs are conducted to show the superiority of proposed techniques. Compared with the traditional -core model, our model is much more effective.

1.5. Roadmap

We organize the rest of this paper as follows. We first review the related work. Then, we introduce the problem investigated and the corresponding problem properties. Next, we will present the basic greedy framework and two optimized strategies. Finally, we report the performance of our algorithms over real datasets and conclude the paper.

In this paper, we conduct the first attempt to propose and investigate the rational -core problem. Thus, we will present the related work from the following two aspects.Cohesive Subgraphs Mining. In different domains, graphs are widely used to model the complex relationships among different entities. As a key problem in graph analysis, community search has been widely studied in the literature and different models have been proposed to measure the cohesiveness of community, such as -core, -truss, and clique. In many real-world applications, both graph structures and attribute information are considered. For attribute graph processing, community search problem used both link relationship and attributes because the attributes usually can make communities more meaningful and easy to interpret [5]. In [5], Fang et al. proposed attributed community query (or ACQ) problem, which returned an attributed community (AC) for an attributed graph. The returned community should satisfy both structure cohesiveness constraint and keyword cohesiveness constraint. In [6], Huang and Lakshmanan considered communities based on topics of interest and proposed attributed truss communities (ATC) search problem. They aimed to find connected -truss subgraphs that contained query vertices with the largest attribute relevance score. In [7], Zhang et al. proposed a keyword-centric community search (KCCS) problem over attribute graphs. They tried to find a community, where the degree of each vertex should be at least , and the distance between the vertex and all query keywords is minimized. Influential community search has also been studied in [8], where each vertex is associated with a number denoting its influence. Its goal was to find communities with the largest influence.Bipartite Graph Analysis. Recently, the bipartite graph has attracted much attention due to its proliferate applications like online group recommendation and fraudsters’ detection [2]. In [9], Borgatti and Everett were the first to investigate the cohesive communities in bipartite graphs for network analysis. To analyze the properties of bipartite networks, numerous models have been investigated, such as -core [10], bitruss [11], and biclique [12]. In [13], the significant -community search problem was proposed and studied on weighted bipartite graphs, where each edge is associated with a weight. They aimed to find the significant -community that contained query vertex and maximized the minimum edge weight within community. In [4], Wang et al. studied the bitruss model in bipartite graphs. Given a bipartite graph, the bitruss is the maximal subgraph where each edge is contained in at least butterflies. In the literature, considering the fairness constraints, the fair clustering problems [1416] were investigated to find communities on bipartite graphs. However, none of the previous studies take the rationality of communities into consideration.

3. Preliminaries

In this section, we first introduce some necessary concepts and present the formal definition of the rational community detection problem over attribute bipartite graphs. Table 1 summarizes the notations that are frequently used in this paper.

3.1. Problem Definition

We consider an attribute bipartite graph as an undirected graph without multiple edges and self-loops. and are the two disjoint and independent vertex sets in ; that is, . is the edge set and each edge connects one vertex and one vertex ; that is, . is the attribute set. Each edge is associated with an attribute (e.g., number/tag) . We use and to denote the number of vertices and edges in , respectively. Given an attribute bipartite graph , a subgraph is an induced subgraph of ; if , , and . For a vertex , the set of ’s neighbors is denoted by (i.e., the adjacent vertices of ). denotes the degree of in (i.e., the number of ’s neighbor vertices).

Definition 1. (-core). Given a bipartite graph , a subgraph is the -core of , denoted by , if it satisfies the following: (1) degree constraint (i.e., for each vertex and for each vertex ); ( is maximal; that is, any supergraph is not a -core.
To compute the -core, in our paper, we iteratively remove the vertices in two layers violating the corresponding degree constraint until there are no unsatisfied vertices in the graph, the details of which are shown in Algorithm 1. The time complexity is [17]. As discussed before, the people in a rational discussion group are cohesive and have consistent preference. In the following, we first introduce the consensus score of vertex and community, respectively. Note that we only consider the consensus score of the vertex in lower (e.g., movie) layer. The rational -core model is further developed based on the rational score consisting of the consensus score and community size. Then, we present the formal definition of our problem.

Input:: a bipartite graph, , : degree constraints
Output: The -core of
(1)While exists with or with do
(2)
(3)return

Definition 2. (Consensus score). Given an attribute bipartite graph , the consensus score of each vertex , denoted by , where is the maximum number of its adjacent edges in with the same attribute number. For a subgraph of , its consensus score is defined as , where is the sum of consensus score of all vertices in and is the number of vertices in the lower layer of .

Example 2. Considering the vertices in the orange line of the bipartite graph in Figure 1, the consensus score of is . The consensus score of community in the orange line is .
To judge a community, we not only want to consider the consensus but also want to consider the size constraint of it. This is because that the traditional study group with not very large size can facilitate people there to discuss and analyze problem. So, we also combine the size constraint into our rational score function, which is expressed as follows:where is a parameter to make the trade-off between the consensus score and the community size. Based on this rational score function, we give the definition of rational community.

Definition 3. (rational -core). Given an attribute bipartite graph and two positive integers and , a subgraph is a attribute -core of , denoted by , if it meets the following three criteria:(i)Connectivity: is connected(ii)Cohesiveness: is a -core(iii)Rationality: has the largest rational score among subgraphs satisfying the above criteria

3.1.1. Problem Statement

Given an attribute bipartite graph and two positive integers and , we aim to develop efficient algorithms to find the rational -core (i.e., the -core with the largest rational score).

3.2. Problem Properties

As shown in Theorem 1, the problem studied is NP-hard. Besides, the rational score function is nonmonotonic and non-submodular, whose details are in Theorem 2.

Theorem 1. Given an attribute bipartite graph , the problem of computing the rational -core is NP-hard.

Proof. When and , we reduce the biclique problem [17] to RCD-ABG problem. Given an attribute bipartite graph , where for each vertex in lower layer , its adjacent edges have distinct attribute. This means that given a subgraph of , the consensus score of each vertex in is . Hence, our score function is converted to . In order to make the rational score large, for the first term of function, namely, , we need to make the numerator be largest and the denominator be smallest. Due to the degree constraint of lower layer, the lower bound of is . So, the rational score function is . Given the parameter , , and , to find rational -core with largest , and need to be minimized, which means that and should be equal to and , respectively. As discussed, each vertex (resp. ) should satisfy (resp. ). This is a biclique that each vertex in different layers is connect, which is NP-hard [17]. Therefore, our problem is NP-hard. □

Theorem 2. The objective score function is nonmonotonic and non-submodular.

Proof. Nonmonotonic. By considering the example in Figure 1, we first prove its nonmonotonicity. Note that we only keep two decimal places in the following. Suppose ; we can see that in subgraph denoted by solid line, that is, ,  = 0.5. After deleting vertex ,  = 0.53. While, by further deleting vertex , the present score is  = 0.5. Therefore, the function is nonmonotonic.
Non-Submodular. Given two sets and , is submodular if . We show the inequality does not hold by a counterexample in Figure 1. Suppose  =  and  = . We have  = 0.5,  = 0.5,  = 0.5, and  = 0.75. Thus, the equation does not hold and is not submodular.

4. Solution

In this section, a greedy framework is firstly developed to find the result, which is based on the concept of score function and marginal gain that we define. Considering the limitations of the basic method, we further propose two novel strategies with better quality.

4.1. A Basic Greedy Framework (BGF)

Intuitively, to find the -core with largest score, we can delete those vertices whose deletion will increase the score. Based on this, we present our basic greedy framework by introducing the rational marginal gain as follows.

Definition 4. (rational marginal score). Given an attribute bipartite graph and a vertex , the rational marginal gain is defined aswhere is the set of ’s neighbors in that violate the degree constraint after removing vertex .

4.1.1. The Basic Greedy Framework (BGF)

The details of BGF are illustrated in Algorithm 2, which includes three main steps. We use to denote the set of all connected -cores. 1. We find all -cores of and store them into in Line 2. We use to denote the current processing -core. 2. At each iteration, we greedily peel the vertex in graph providing the smallest marginal gain (Line 5), which is called the best vertex. After removing best vertex, we calculate the -core in the remaining graph. If there are many connected -cores, we push back them into (Lines 6–8). We continue this process until there is no -core in the graph. Note that the rational marginal gain may be a negative number, which means the score of function increase. 3. We output the result with the largest score among obtained attribute -cores (Line 9).

Input:: attribute bipartite graph, : degree constraint in upper layer, : degree constraint in lower layer
Output:: the connected -core with the largest rational score
(1)
(2) an empty vector
//Step 1
(3) all connected
//Step 2
(4)While:do
(5)
(6)for each connected in do
(7)  push back into
(8)  
//Step 3
(9)

Example 3. Considering the user-movie network in Figure 1. Suppose  = 2,  = 2. According the BGF, vertex is removed firstly and the rational score of the remained -core is 0.415476. Similarly, we remove vertices , , and , iteratively. The corresponding rational score is 0.481667, 0.6, and 0. Therefore, the returned result is with rational score of 0.6.

4.2. Optimized Strategies

The basic greedy framework is simple but suffers from the following drawback. When removing a vertex from the subgraph , it may make the support of some other vertices decrease and lead them to drop from the community in succession. Note that these vertices are called the followers of including itself, denoted as . If the removal vertex has a large number of followers, it can severely limit the effectiveness of the algorithm. Hence, we need to consider the effect of each removal vertex. In the following section, we propose two improved strategies to handle the limitation.

4.2.1. 2-Hop Neighbors Optimization (OS-I)

As observed, if the removal vertex is in the lower layer, its 2-hop neighbors in the same layer may violate the degree constraint and be deleted, which significantly affect the rational score. Based on this, we use the following equation to approximate marginal score function by ,where is the 2-hop neighbors of in the lower layer. Therefore, the best vertex is adjusted as in Line 5 of algorithm 1 and other steps are the same.

Example 4. Reconsider the user-movie network in Figure 1. Suppose  = 2,  = 2. According to the OS-I, we remove vertex firstly and obtain the rational score of the remained -core. Then, we remove and obtain the corresponding score of 0.6556. After removing , the obtained score is 0. So, we return the result by with a score of 0.6556.

4.2.2. Followers-Based Optimization (OS-II)

The second idea is motivated by the followers of each removal vertex. Generally, instead of removing one vertex and calculating the rational marginal gain, we remove a vertex with its all followers from the current candidate graph that have the smallest attribute marginal gain. Hence, the marginal score is modified as the following equation:and the other steps are the same as BGF.

Example 5. Reconsider the example in Figure 1. Suppose  = 2,  = 2. According to the OS-II, vertex is removed firstly and the obtained score is 0.455333. Then, we remove and calculate the score with 0.65556. After deleting , the score is 0. Therefore, we return the result with a score of 0.65556.

4.2.3. Analysis

The main difference between BGF and optimized algorithm is the best vertex. In BGF, calculating the marginal score of a vertex is time. In OS-II, the time complexity of identifying the followers of the vertex is , which may significantly increase the running time.

5. Experiments

5.1. Algorithms

To the best of our knowledge, there is no existing work for RCD-ABG problem. In the experiments, we implement and evaluate the following algorithms.(i)BGF. The baseline greedy framework is presented in Algorithm 2, which iteratively peels the graph and returns the best result during the search(ii)OS-I. OS-I leverages the baseline framework BGF and further integrates the proposed 2-hop neighbor-based optimization(iii)OS-II. OS-II leverages the baseline framework BGF and further integrates the proposed follower-based optimization(iv)ORI. To evaluate the advantage of proposed model, we also implement the traditional -core search method [10], which iteratively removes the vertex that violates the degree constraints and returns the final subgraph

5.2. Datasets and Workloads

We employ 6 real-world bipartite graphs. Among these datasets, CiaoDVD and TripAdvisor can be obtained on KONECT (https://konect.uni-koblenz.de). Other datasets are publicly available on GroupLens (https://grouplens.org/datasets/). The statistics of datasets are shown in Table 2, where is the number of attributes in bipartite graphs. HetRec (HR) [18] is a user-artists network, where the attribute of edges denotes the number of time that user listens to the music by the artist. CiaoDVD (CD) and MovieLens (ML) [18] are user-movie networks of which the attributes of relationships represent the ratings for movie. TripAdvisor is a user-hotel bipartite graphs and the attribute of its edges denotes the rating taken by users. The BookCrossing (BC) is a user-book network and the edges of it denote the book-rating taken by user. Due to the density of graphs, vary from 5 to 25 in HetRec, CiaoDVD, and TripAdvisor, vary from 15 to 35 in BookCrossing and vary from 50 to 250 in MovieLens and Personality. is set as 0.7 because the density of community will strengthen with the continuous deletion of vertices; thus, we focus on consensus score. All the programs are implemented in standard C++. All the experiments are performed on a server with an Intel Xeon 2.4 GHz CPU and 128 GB main memory.

5.3. Efficiency Evaluation

To evaluate the efficiency, we report the response time of algorithms by varying and in Figures 2(a)2(f). As observed, the time cost of OS-I and OS-II is more than BGF. This is because OS-I needs to calculate 2-hop neighbors of vertex in the lower layer and OS-II needs to calculate followers of vertex. Although the time complexity of OS-I and OS-II is more complex than BGF, there is not much difference of response time between them. We can observe that when and increase, the response time decreases for all methods. This is because the community size decreases.

5.4. Effectiveness Evaluation

To evaluate the effectiveness, we compare BGF, OS-I, and OS-II with ORI and report the rational score of the returned community. ORI is based on the traditional -core model. It first computes the -core of the graph and then directly returns the connected component with the largest rational score. The results are shown in Figures 3(a)3(f). We can observe that original -core has very small rational score. OS-I and OS-II significantly outperform BGF over all the datasets, namely, find community with higher score than BGF. The score returned by OS-I is at least 0.01 higher than the one returned by BGF in all datasets. Due to the feature of the consensus score, the improvement of OS-I is already significant for the overall performance. The rational score decreases when and increase because of tighter degree constraint.

5.5. Case Study

To further evaluate the advantage of the proposed model, we conduct a case study on HetRec dataset. The results are shown in Figure 4. As shown, the movie and user are marked with different colors. The different-color edges denote different scores. The community in the solid line that consists of enlarged vertices and bold edges is the returned result. As we can see, it can find a more rational community with a high preference and density structure.

6. Conclusion and Future Work

In this paper, we propose and investigate the rational -core detection problem in attribute bipartite graphs. We formally define the problem and prove its NP-hardness. To solve this problem, a basic greedy framework is first presented, which iteratively removes the best vertex with the smallest marginal gain and calculate the remaining -core. Two optimized strategies, namely, 2-hop neighbor-based optimization and follower-based optimization, are proposed to improve the performance. Experiments are conducted on real bipartite graphs to demonstrate the advantages of proposed model and techniques. As shown in the experience, the proposed model significantly outperforms the traditional -core model. In real-world applications, there are also attributes within the vertices of the graphs. In the further work, we will consider more complex scenario to design the model and the corresponding approaches.

Data Availability

The datasets in this paper are publicly available at https://konect.uni-koblenz.de and https://grouplens.org/datasets/.

Conflicts of Interest

The authors declare that they do not have any commercial or associative interest that represents conflicts of interest in connection with the work submitted.

Acknowledgments

This work was supported by ZJSSF 21NDQN247YB, ZJNSF Y202045024, ZJNSF LQ20F020007, and ZJNSF LY21F020012.