Abstract

Identification of community structures and the underlying semantic characteristics of communities are essential tasks in complex network analysis. However, most methods proposed so far are typically only applicable to assortative community structures, that is, more links within communities and fewer links between different communities, which ignore the rich diversity of community regularities in real networks. In addition, the node attributes that provide rich semantics information of communities and networks can facilitate in-depth community detection of structural information. In this paper, we propose a novel unified Bayesian generative model to detect generalized communities and provide semantic descriptions simultaneously by combining network topology and node attributes. The proposed model is composed of two closely correlated parts by a transition matrix; we first apply the concept of a mixture model to describe network regularities and then adjust the classic Latent Dirichlet Allocation (LDA) topic model to identify community semantically. Thus, the model can detect broad types of network structure regularities, including assortative structures, disassortative structures, and mixture structures and provide multiple semantic descriptions for the communities. To optimize the objective function of the model, we use an effective Gibbs sampling algorithm. Experiments on a number of synthetic and real networks show that our model has superior performance compared with some baselines on community detection.

1. Introduction

With the advent of the era of big data and the diverse channels for acquiring data, we have obtained a large amount of data from complex systems in the real world [1]. In particular, we can obtain not only diversified entities in complex systems but also a variety of related descriptions (attributes) of them. Attributed complex networks are usually used to analyze and study these data [2, 3]. Taking social systems as an example, nodes denote individuals and edges represent interactions between them. At the same time, individuals have personal information about gender, age, country, job, race, and so on, which represent their unique attributes. The sufficient and effective application of structural and attribute information is of great value for complex network analysis.

At present, exploring the structural regularities and functions of the network is a significant part of complex network analysis [4]. One of the most essential tasks is community detection. It is believed that nodes within the same community typically have similar structural characteristics and properties. The detection of communities or modules in a network is conducive to understanding organization rules of complex networks, exploring latent patterns, and predicting the behavior of complex systems. A number of successful community detection approaches have been proposed, which fall into different categories, such as hierarchical clustering algorithms [5, 6], modularity optimized approaches [7], statistical inference [810], spectral algorithms [1114], generative model [1518], and Markov dynamic algorithms [1921]. For review, the readers can refer to [22].

However, most conventional community detection methods only consider the network structure but ignore the attributes of nodes. In fact, the attributes of nodes help to improve the performance of community detection, because nodes with similar attributes tend to belong to the same community [23, 24]. Different from network structures that specify node connectivity, node attributes provide the semantics of nodes and underlying network [15]. Therefore, when the nodes in the network are divided into different communities, the node attributes in the same community can reveal the community semantics, which is somewhat similar to the Latent Dirichlet Model (LDA). Thus, the missing structural information can be supplemented and more in-depth community detection can be carried out when semantic information and structural information are used complementarily. Recently, some methods have also been proposed to combine the attributes and structural information for better community detection. They include heuristic-based methods [3, 25] and probabilistic inference-based methods [26, 27]. In addition to obtaining better results of community detection, the node attributes also provide semantic descriptions of the communities. These descriptions help to reveal why certain nodes are divided into a group and understand the functions of communities. Therefore, detecting communities and identifying the underlying semantics of communities make complex network analysis full of significance. Some methods have been developed in [15, 28].

Most methods that have been proposed for community detection are typically only appropriate for assortative community structure; i.e., the nodes within a community are densely connected [22, 29]. They usually assume that such certain structural regularity exists in the target network. However, the assumption may not always correspond to the true intrinsic structure of the network, which limits the applicability of the existing methods. Beyond that, there are other types of important structural regularities in the real networks and the networks may contain multiple structures simultaneously, for example, disassortative structure (bipartite structure) [30], i.e., a kind of structure pattern in which most of the edges are across different communities and mixture structure, i.e., a kind of structure contains both assortative and disassortative structures, and so on. Due to the rich diversity of community regularities in real-world networks, there may be several unknown types of structures in the networks. Therefore, it is urgent to propose some methods to adapt to the realistic situations and to carry out generalized community detection. So, in this paper, we called these assortative and disassortative structures in the complex networks as generalized communities similar to [30]. Some methods [4] have been proposed to detect generalized communities in complex networks.

In particular, although node attributes may carry essential semantic information of communities, there are few ways to detect generalized communities, that is, detecting broad types of network structural regularities and combining network structures and attributes. Chen et al. [31] developed a Bayesian nonparametric attribute (BNPA) model and explored various types of network structures, but the model did not provide multiple semantic descriptions of the communities.

As a result, considering the rich diversity of community regularities in real networks, nodes attributes can not only improve the quality of generalized community detection but also identify the latent semantic characteristics of communities, identify the generalized communities, and provide semantic descriptions, which are worth studying in the complex network analysis. All the above methods neglect solving this twofold problem. Instead, we propose a unified generative model to detect communities in a wide variety of network structures without any prior knowledge of the certain type of intrinsic regularities in the networks. We also derive the semantic descriptions of the communities by combining the network structure and attributes at the same time. Our model is composed of two closely related parts by a probability transition matrix. The first is the topology part in which communities are described based on a mixture model, assuming that nodes in the same groups have similar link patterns (no matter whether there are more links within the communities or between communities). The second is the attribute part, in which semantic information is identified by the classic topic model (LDA) [23]. We assume that each community has several topics; i.e., the distribution of topics exists in each community. A probability transition matrix is used to reveal the potential corrections between topics and communities. It can handle the problem that the topics from attributes and the communities from networks are not well matched. We finally use a Gibbs sampling algorithm to optimize the objective function. Extensive experiments on a number of synthetic and real networks have shown that our model performs better than some baselines on community detection.

In summary, the contributions of this paper are as follows:(i)As we know, it is the first time we propose the generalized community in the attribute networks, in which the nodes have some link patterns with others and semantic similarity in the network(ii)We propose a unified generation model to analyze the attribute networks and detect the generalized community structure as well as its semantic description; it can describe the internal relationship between topological structure and node attribute of the network(iii)We also develop an effective Gibbs sampling algorithm and experiments show its better performance compared with some baselines

To explore the network structural regularities, some methods for detecting generalized communities have been proposed. Recently, node attributes have attracted extensive attention in the complex network analysis.

Newman and Leicht [30] developed a mixture model to explore the network structure with only links. In this method, the nodes with the same link patterns were divided into the same groups. It modeled the relationships between communities and nodes. The probability that a node was connected to other nodes in the network was related to the community to which the node belonged. Closely connected nodes may not belong to the same community. Thus, a broad of structural signatures could be explored without any prior assumptions about the structure of the network. Hua-Wei et al. [4] focused on identifying the intrinsic structural rules in networks. In this model, the nodes within the same groups had a similar link preference to other groups. A block matrix was defined to denote the probability that the randomly selected edge linked two distinct groups. It could detect broad types of structural regularities by modeling network structures.

There were several methods for content analysis, such as Latent Dirichlet Model (LDA) [23]. The method focused on node attributes and identified the set of nodes whose attributes were similar. Several community detection approaches combining network topologies and node attributes have also been proposed. Some methods only used node attributes to improve the performance of community detection, while others provided the semantic descriptions of communities. Ruan et al. [25] proposed a method for determining the strength of the edges between nodes using content information, which is also applicable to graph clustering. Yang et al. [27] used a discriminative model that combines node attributes and network topologies to detect communities. However, this method focused on community detection without describing the relevant attributes of each community. It did not provide a semantic description of the community. Pool et al. [28] proposed a heuristic method to detect communities by optimizing the community scores. This heuristic method reported too many relatively small communities, some of which had only two or three nodes. Chakraborty and Sycara [32] developed a model based on nonnegative matrix trifactorization method to detect communities via modeling network structure and contents. However, this method mainly used additional attributed information to identify communities and failed to infer the relationship between communities and attributes. Chen et al. [31] developed a Bayesian nonparametric attribute (BNPA) model to explore structural regularities in networks. This model combined network structures and node attributes for community detection and assumed that network structures and node attributes shared the same community memberships; i.e., attribute clusters and network communities were the same. However, attributes and community structures may not always align at all; they could not give multiple semantic descriptions of communities. Wang et al. [33] proposed a model that combined network topology and node semantic information to identify communities. It integrated topology-based community memberships and node-attributes-based community attributes (or semantics) in the framework of nonnegative matrix factorization. The model was based on two important observations: if the community memberships of two nodes are similar, they will have a high probability to produce adjacent edges, and if their attributes are related to the underlying community attributes, they will likely be in the same community. The use of node contents improved the result of community detection and provided a semantic description to the resultant network communities. He et al. [15] introduced a generative model consisting of two parts, one for communities and the other for semantics, exploring the network structure and interpreting the functional modules semantically. The method was only applicable to the network with assortative structures and failed to detect generalized community. More discussions on attribute networks can be found in related surveys by Bothorel et al. [34] and Chunaev [35].

3. Model Formulation

In this section, we give a formal description of the proposed model, i.e., Generalized Semantic Community (GSC) identification, with the purpose of generalized community detection and semantic identification in the networks.

3.1. Notations

We define an attributed network with nodes and attributes as an adjacency matrix and an attributes matrix . All the nodes and attributes are denoted as and in the network. In the adjacency matrix , if there is an edge from node to node ; otherwise, . In the attributes matrix , if node has the -th attributes ; otherwise, . Our model is specified by three types of quantities:(i)Observed quantities: the number of groups , the number of nodes , the number of attributes , the adjacency matrix , and the attribute matrix (ii)Latent quantities: group labels , where denotes the community membership of node , and the content memberships , where denotes the topic labels of the node ’s -th attribute(iii)Model parameters: , where is the fraction of nodes in community ; , where is the probability that a certain node in community connects to node ; , where is the probability that node is in the -th content cluster given that the community label is ; , where is the probability that the -th topic generates -th attributes of node

Table 1 shows the notations of the parameters.

3.2. Problem Definition

Considering the rich diversity of community regularities in real networks, encoding network structure and node attributes simultaneously, and providing the semantic descriptions of the resultant network communities are still the problems that are worth studying in the community detection. However, most existing methods tend to ignore certain aspects of the problems that remain the challenges of current community detection. Given an attributed network, the goal of handling these problems is twofold:(i)How to divide the nodes into communities and content clusters no matter what kind of network structural regularity the network is?(ii)How to identify the correlations between communities and attribute topics to provide the best semantic descriptions of communities?

So the problem can be formalized as, given the adjacency matrix and attributes matrix as well as the number of communities , our goal is to obtain the community assignment for each node and the topic distribution of the communities.

3.3. Model Definition

To achieve the objective, we define a unified Bayesian probabilistic generative model to handle topologies and node attributes at the same time. Our goal is to divide the nodes in networks with extensive structural regularities into communities and content clusters, respectively, by using adjacency matrix as well as attributes matrix . To model network structure, we assume that the nodes in the same groups have similar link patterns; i.e., the probability of a node connecting to other nodes in the network is the link tendency between the community to which the node belongs and the rest of nodes. We also take a modified LDA model for node attributes. A transition matrix is used to jointly model network structures and node attributes, which connects network communities and attribute topics. To be specific, a community may be characterized by multiple topics, and the topic of each node attribute is derived from the topic distribution of the community to which each node belongs. Then, by extracting the latent correlation between network communities and attributes clusters, multiple semantic interpretations can be provided for each community. Figure 1 shows a graphical representation of this model, and the generation process is as follows:(1)Sample

(2)For each community (a)Sample (b)Sample (3)For each topic in topics(a)Sample (4)For each new node , (a)Sample a latent group assignment (b)For each node with :(i)Sample edge (c)For each of the -th attribute with :(i)Sample (ii)Sample attribute .

3.3.1. Generating Model Parameters

We introduce a Bayesian treatment into the model generation process. After the number of communities is given, model parameters are treated as random variables; we generate model parameters , , , and , respectively, by the Dirichlet distribution. The parameters are generated based on some hyperparameters, denoted as a -dimensional vector , an -dimensional vector , a -dimensional vector , and an -dimensional vector . The generative process is as follows.

We use Dirichlet distributions to generate the following model parameters, respectively:where represents a Gamma function. All the communities share the same , and all the topics share the same and .

3.3.2. Generating Observed and Latent Quantities

At first, we sample the latent community membership for every node from a multinomial distribution independently. It is described as

After the latent community membership of nodes is explicit, we generate edge as the following definition:where denotes the “preferences” for any node in community to link to node , regardless of which community that node is in. Nodes in the same community have a common link “preference” without any assumptions about network structure regularities. Thus, generalized communities can be detected. Then, we sample the latent topics membership for each attribute of node from a multinomial distribution independently, defined as

As denotes the probability that node is in the -th semantic topic while it is divided into -th community, that is, provides the transition from communities to topics, the topic assignment and community membership of node do not always match well. This is why the community may have several topics.

We generate attributes as the following definition:

Then, the probability of the network with nodes and attributes is

It is subject to , , , and .

4. Model Optimization

To exactly infer that the latent variables and are intractable, we use Gibbs sampling [36] and slice sampling [37] to sample the latent variables and and hyperparameters (, , and ), respectively.

4.1. Inference

Because the Dirichlet and Multinomial distributions are conjugate, equation (2) can be simplified aswithwhere denotes the number of outlinks whose tail nodes belong to and whose head node is ; denotes the number of nodes in community ; denotes the number of which is generated by topic ; and denotes the total number of topics generated by community .

The inference process is in Algorithm 1.

Require: adjacency matrix , attributes matrix , iterations , and specified group number
Ensure: group assignment
 0: initialize , , , , set , , , , , , and to 0
 Initialize each node’s latent community label
(1)//sampling z, , , , , and
(2)for to do
(3) for to do
(4) //get the current community assignment of node
(5) update , , , , , , and
(6) for to do
(7) compute probability according to equation (8)
(8) end for
(9) Gibbs sampling for and obtain
(10) update , , , , , , and
(11) for to do
(12) //get the current topic assignment of attribute
(13) update , , , and
(14) for = 1 to do
(15) compute probability according to equation (9)
(16) end for
(17) Gibbs sampling for and obtain
(18) update , , , and
(19) end for
(20) end for
(21) slice sampling for , , , and in (0, 1)
(22)end for
4.1.1. Sampling

For each node , given the community assignment for all other nodes, the community probability of the node choosing community iswhere denotes the outlinks of nodes ; denotes the number of outlinks from community except node ; denotes the number of outlinks from community except edges ; denotes the number of nodes in community ; is total number of nodes; denotes the topic labels of the attributes of ; denotes the attributes of ; denotes the topic of ’s -th attribute; denotes the number of node attributes whose topic is except ’s attribute ; denotes the number of nodes’ attributes whose topic is except the attributes of ; denotes the total number of topics generated by community except ; and denotes the attributes of whose topic is .

4.1.2. Sampling

For node in community , given the topic assignment for all the attributes except the attribute , the topic probability of the attribute choosing topic iswhere denotes the number of whose topic is except ’s attribute ; denotes the number of all the attributes whose topic is except ’s attribute ; denotes the number of nodes’ attributes whose topic is and whose nodes belong to community except ’s attribute ; and denotes the number of node attributes that belong to community except ’s attribute .

4.2. GSC Models

Our model can also only handle edges or nodes’ attributes in the networks.

4.2.1. GSC-Link

The probability of only considering the links can be written as

The community probability of node choosing community is

4.2.2. GSC-Attr

The probability of only considering the attributes can be written as

The community probability of node choosing community is

The topic probability of the attribute choosing topic is the same as GSC.

5. Experiments and Analysis

Firstly, we experiment on three different synthetic networks with different structure regularities (i.e., assortative, disassortative, and mixture structures) to evaluate the quality of community detection and analyze the superiority of modeling on the network with a rich diversity of structures. Then, we assess the interpretability of communities in an online music system. Finally, we evaluate on real networks and do a comparison with state-of-the-art methods.

As the ground truth of communities in the networks is known, we use the following Normalized Mutual Information (NMI) [38] to compare all the methods:where is the ground truth of communities in the network, and is the community identified by the method. and are the entropies of and , respectively, and denotes the mutual information between them. The higher NMI is, the better the result is.

To describe parameter estimation in GSC more adequately, we describe the changing trend of likelihood function with the number of iterations in Figure 2(a), and each curve in Figure 2(b) shows the changes of the log-likelihood of Cora with one of four hyperparameters when other hyperparameters are determined by slice sampling. It can be seen that the log-likelihood of GSC quickly converges at about 150th iteration. The log-likelihood probability is less sensitive to , , and while made a big difference.

5.1. Experiment on Synthetic Networks with Different Structure Regularities

Firstly, we conduct experiments on synthetic networks to evaluate the quality of community detection. Then, we assess on real networks and do a comparison with state-of-the-art methods.

The first synthetic network is a random network in Newman’s method [15]. The network consists of 128 nodes divided into 4 disjoint communities with . As , (the edges linking to nodes within community) is much larger than (the edges linking to nodes in other communities). For every node , we generate a -dimensional binary attribute (i.e., ) to divide the nodes of 4 content clusters with . In this paper, denotes the number of attributes for every node with associated with its community and (noisy attribute) denotes the number of attributes for every node with corresponding to the other communities. In particular, we generate the -th to -th attributes for each node in the -th cluster by a binomial distribution with mean and generate the remaining attributes by the binomial distribution with mean .

We set and consider that the topologies and contents share the same membership. The node attributes’ matrix and the community attributes’ matrix are shown in Figure 3. We first set and change from 0 to 12 with an increment of 1. We adapt GSC-link using network topology alone as the baseline method. Other comparison methods are NEMBP [15] and SCI [33], which use both network topologies and attributes. As shown in Figure 4(a), our method can use the complementary structural information in node attributes to improve the quality of community detection when . Even when , the cluster structures of node attributes disappear; our model GSC can get better results than baseline method GSC-link. Then we set and change from 0 to 9 with an increment of 1. As shown in Figure 4(b), our method also can perform better than GSC-attr. In general, the proposed method can get better results of community detection by using topology and content information.

The second synthetic network is Newman’s model [30] of 108 nodes. It consists of 8 keystone nodes without community labels and other nodes link to them according to their community membership. The remaining 100 nodes are equally divided into 4 groups, and the edges between these nodes are randomly linked, with the mean degree of every node being 10. The keystone nodes are .

In particular, each community has a unique signature set of keystones, and only the link pattern to keystones can identify the community; thus the structure of this network is neither assortative nor disassortative.

At first, we study the influence of noise attributes on community detection. represents the proportion of noisy attributes of each node. We change the probability of noisy attributes from 0 to 1 with an increment of 0.1. The node attributes’ matrix is shown in Figure 5. When becomes larger, the attributes associated with each community are blurred and less discriminant information is provided for the network community. As shown in Figure 6(b), we almost divide the nodes into 3 communities while only considering network structure. The result gets better when using node attributes in Figure 6(c). As shown in Figure 7(a), our method outperforms GSC-link (even reaches 0.7) and significantly outperforms SCI and NEMBP. It shows that the quality of identified communities improves combining node attributes and network structures. Our model GSC is able to fully use network structure information even if the information of node attributes is erroneous. As increases beyond 0.7, GSC performs worse. It also reveals that node attributes with terrible quality can lower the result of community detection. Figure 7(a) also shows that NEMBP performs worse than GSC-link when reaches 0.2 and SCI performs always much worse. Figures 6(b) and 6(c) represent the results of GSC and NEMBP, respectively, when is 0.5. It can be concluded from the above analysis that GSC is more capable of identifying the networks with mixed structural regularities than SCI and NEMBP.

In this network, the propensity to link to the unique set of keystone nodes determines the group membership. We change the keystone links of each group to change the network structure by varying the keystone links of each group from 100 to 10 with a decrement of 10. We set the probability of noisy attributes . We adapt our model with only attributes as the baseline method and NEMBP for comparison. As can be seen in Figure 7(b), our method is also able to perform well even if the keystone links are only 30. The new model represents strong robustness to the changes of network structure. However, the rambling result of NEMBP indicates that it does not work very well for this type of network.

The third network [31] has both a community and a bipartite structure with 100 nodes and 402 edges as shown in Figure 6(e). The 100 nodes are equally divided into 5 groups, three of which form an assortative structure, whereas the remaining two form a bipartite structure. For each node , we generate -dimensional binary attributes; each of the communities and nodes has 50-dimensional relevant attributes. We change the probability of noisy attributes from 0 to 1 with an increment of 0.1. As shown in Figure 7(c), our model always gets better results than NEMBP and SCI. Even when , the quality of identified communities is also improved compared with GSC-link, and the NMI is almost 1. Figure 6(f) shows the result of NEMBP when . Its performance is much worse than that of GSC.

5.2. Evaluating Efficiency

In this part, we evaluate the efficiency of community detection methods by measuring each method’s running time on synthetic networks as we increase the network size. The comparison methods are NEMBP and SCI. The synthetic networks include assortative and disassortative structures. The edges are placed uniformly at random within and between communities in certain numbers. The number of edges within each community is set to 1,200 and the number of edges between a community and the others is set to 600. They form a community structure. The rest of the communities are divided in pairs, the number of edges between two communities in each pair is set to 2,400, and the number of edges between communities in different pairs is set to 1,200. Each pair of groups forms a bipartite structure. The maximum number of nodes in our synthetic network is 7,000, including 12,6000 edges and 700 attributes. We change the scale of the network (Syn-100, Syn-500, Syn-1000, Syn-2000, Syn-3000, Syn-5000, and Syn-7000). The synthetic network of 100 is the third network that we used above. For each synthetic network, we generate -dimensional binary attributes. We set the ratio of noise attributes to 0.5.

Figure 8 shows the running time of methods versus the network size. Our method is the fastest among the three. When the program runs to convergence, the running time of our method on Syn-7000 is about 5 minutes. For NEMBP, we set the number of iterations in the program to 10; the running time of the program can reach 11 hours even on Syn-2000. The running time of SCI is more than 19 hours.

5.3. A Case Study

In this paper, we use to correlate the communities and attribute topics and evaluate whether it contributes to the descriptions of the communities. We intensively analyze the underlying semantics of communities and provide particular descriptions for some of the communities detected by GSC. Thus, we use the LASTFM dataset, which is a social network from an online music system, that is, Last.fm. It includes 1,892 users and 11,946 attributes of user’s favorite music singers and tag assignment. In this network, the ground truth of community partition is unknown, so we decide to detect 38 communities as in [15]. We find that the communities may have one main topic or multiple topics; a detailed analysis of the three detected communities with different topics is shown in Figure 9.

The first example in Figure 9(a) is a community with one main topic. It should be the fans of popular female singers like “Rihanna” and “Britney Spears.” Their music are “pop,” “rock,” and “dance.” They are both “female vocalists” and “sexy.” As for the community in Figure 9(b), it is a group of fans of “hardcore punk” music. The hardcore punk is also labeled as hard rock. Glam-sleaze music is a derivative of hard rock and alternative rock coming from a post-punk band. Grunge music is a music genre of indie rock which evolved from hardcore punk. Emotionally-Driven Hardcore Punk (EMO) is an indie rock style, and the Screamo originated from EMO. The last community has two major topics. The communities shown in Figures 9(c) and 9(d) are about the fans of electronic music. One topic is mainly about Electronic Body Music (EBM), which combines elements of industrial music and electronic punk music. The other topic is about IDM. This kind of music was created in the late 80s accompanied by hard edge dance and slow music.

5.4. Experiment on Real Networks

Cora, Citeseer, Terrorist, and Biology are four real networks with both links and contents that we apply in this paper. Cora is a part of Cora citation networks, including 2,708 published articles and 5,429 edges. Each publication is represented by a 1,433-dimensional binary word vector which means the absence or presence of the relating words. The total publications are divided into seven communities. Citeseer is a subset of Citeseer citation networks. It includes 3,312 published articles and 4,732 edges. Each publication is represented by a 3,703-dimensional binary word vector. The total publications are divided into six communities. The Terrorist dataset consists of 1,293 terrorist attacks; each attack is assigned one of 6 labels indicating the type of the attack. Each attack is described by a 106-dimensional binary word vector whose entries indicate the absence or presence of a feature. Biology is a real paper citation network, which is from 435 different biological journals. It contains 10,000 papers connected by links. Each paper is described by a 9,944 0/1-valued keyword vector; two papers are connected if they have a reference relationship. There are 435 nodes representing different biological journals in the network; each paper links to them according to the journal in which it is published. So, the network forms a mixture structure that is similar to the synthetic network of 108. All the papers are split into 435 groups; each group contains papers published in a certain journal. We also use Syn-2000, which includes both community and bipartite structure. The five networks are shown in Table 2.

We compare our GSC model with the methods from three categories: (1) models based on only network structures, that is, GSC-link; (2) models based on only network attributes, such as GSC-attr and LDA; (3) models based on both structures and attributes, such as PCL-DC, NMMA, SCI, and NEMBP.

The results of these models on three networks are shown in Table 3. Our model can use the information of network structure and node attributes simultaneously to identify communities. The model GSC outperforms the other models on Cora and achieves larger NMIs than most of models on Citeseer and Terrorist. The result of GSC is lower than that of NMMA on Citeseer. This is mainly due to the fact that network structures and node attributes are more likely to share the same community memberships. NMMA assumed that attribute clusters and network communities were the same, so it performs better on Citeseer. Sometimes, the community structure is not so obvious when considering only the structural information of the network. The nodes are divided into communities mainly by using their attributes. In this situation, our model can effectively use the information of the attributes. The models based on structure and attributes usually outperform the models with only link or attributes.

6. Conclusions

In this paper, we propose a novel Bayesian probability model to detect generalized communities and identify the semantics combining network structures and nodes attributes and use an efficient Gibbs sampling algorithm to optimize the objective function. Even if the information of node attributes is of poor quality, our method can use the complementary structural information in node attributes to get better results. The model assumes that the network structure and node attributes have different hidden variables and adopts a transition matrix to explore the hidden correlation between communities and topics. Thus, it can provide semantic descriptions of communities to better reveal the characteristics of communities. We evaluate our method on a number of real and synthetic datasets and in a case study. The new method can detect various types of network structures and outperforms several state-of-the-art algorithms.

It is similar to the proposed methods in requiring that the number of communities be provided. This problem is about model selection issue, and we will focus on determining group number automatically in the next step.

Data Availability

The datasets used to support the results of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61902278), the National Key R&D Program of China (2018YFC0832101), the Livelihood Science and Technology Project of Qingdao (18-6-1-106-nsh), and the National Social Science Foundation of China (15BGL035).