Abstract

Community structure is one of the most important characteristics of complex networks, which has important applications in sociology, biology, and computer science. The community detection method based on local expansion is one of the most adaptable overlapping community detection algorithms. However, due to the lack of effective seed selection and community optimization methods, the algorithm often gets community results with lower accuracy. In order to solve these problems, we propose a seed selection algorithm of fusion degree and clustering coefficient. The method calculates the weight value corresponding to degree and clustering coefficient by entropy weight method and then calculates the weight factor of nodes as the seed node selection order. Based on the seed selection algorithm, we design a local expansion strategy, which uses the strategy of optimizing adaptive function to expand the community. Finally, community merging and isolated node adjustment strategies are adopted to obtain the final community. Experimental results show that the proposed algorithm can achieve better community partitioning results than other state-of-the-art algorithms.

1. Introduction

Complex networks are ubiquitous in the real world, such as social networks, academic cooperation networks, world wide networks, and biological networks [1]. They are generally composed of nodes (individuals) and edges (relationships between individuals). For example, in social networks, nodes represent people and edges represent relationships between people. Although these networks belong to different fields, they follow the same laws. (1) Small world effect: complex networks have small average paths and large aggregation coefficients. (2) Scale-free: the degree of nodes in the network obeys the power-law distribution. (3) Community structure: the network can be divided into multiple groups with relatively close internal edges, and the connections between groups are relatively sparse.

Community structure is one of the most important structural features of complex networks [2]. It is ubiquitous in various complex networks in the real world. For example, in social networks, individuals with common interests have closer relationships and form communities of common interests. The community detection technology can predict the hobbies of new network users. In the protein network [3], proteins with the same or similar functions constitute each community. Community detection technology can identify the group of unfamiliar proteins and thus discover the protein function. In the academic cooperation network, scholars who have similar research directions or have participated in similar projects constitute a community. Community detection technology can help cross-research projects between statistics departments.

According to the different characteristics of community structure, the community structure is divided into two categories: nonoverlapping community [4] and overlapping community [5, 6]. Nonoverlapping community means that the nodes in the network belong to only one community; each community exists independently. In the real world, network communities are often not independent. Different communities usually share some nodes, which is called overlapping community, and the nodes shared among communities are called overlapping nodes. Figure 1 is a nonoverlapping community network. It can be seen from Figure 1 that there are no public nodes between the two communities. Figure 2 shows the overlapping community structure, and the black circle represents an overlapping node.

Overlapping community detection in complex networks has attracted the attention of many scholars and achieved many results [7], such as LFM [8], COPRA [9], and LINK [10]. However, most overlapping community algorithms lack effective seed selection and community optimization methods, and these algorithms often get community results with low accuracy [11]. To solve these problems, this paper proposes an overlapping community detection algorithm based on information fusion. The main contributions of this paper are as follows:(i)We propose an overlapping community detection algorithm based on information fusion, which improves the quality of community detection through an effective seed selection method and community optimization methods.(ii)We propose a seed selection method of fusion degree and clustering coefficient. The method combines the weight factor, degree, and clustering coefficient to calculate the node importance, which ensures that the seed has a large total node influence and there is a high degree of similarity between nodes.(iii)Finally, we verify the algorithm’s performance on synthetic networks and real networks. The experimental results show that compared with the state-of-the-art algorithms, our proposed method can find more accurate community structures.

The remainder of this paper is organized as follows: in Section 2, the related work of locally extended overlapping community detection method is introduced; Section 3 describes the implementation of overlapping community detection algorithm based on information fusion; Section 4 gives the specific experimental results; and finally, the research work of this paper is summarized in Section 5.

Overlapping community detection method based on local expansion is one of the most important methods to deal with the problem of overlapping community detection in large-scale networks [12], which includes two steps: firstly, some nodes or some node sets in the network are selected as the seed of each community and continue to expand outward through the fitness function (optimization function) until a certain termination condition is met to form a community. Finally, the fitness function reaches the local optimal value as the termination condition for the end of community expansion. Formally speaking, given an undirected network G (V, E), a set of seed nodes S, and a fitness function f (C), the goal of the community expansion process is to find a subgraph C and make f (C) reach the local optimal value [13] (SC). The subgraph C is the result of the community expanded by S. At present, there are a large number of community quality evaluation functions that can be used as the fitness function of community expansion, such as modularity [14], subgraph density [15], centrality [16], conductivity [17], and edge-surplus [18].

Since the expansion process of each seed is independent, an overlapping community structure is formed when the two seeds are expanded to the same node. Meanwhile, the expansion process of the method generally only needs the local network information, which has high efficiency and is extremely suitable for large-scale networks. Lancichinetti et al. [19] proposed the LFM algorithm, which is a typical representative of the local expansion method. A subgraph G of the network is defined as follows:where represents the total internal degree of all nodes in the subgraph and is the total externality. The symbol is an adjustable parameter. Meanwhile, the fitness of a node A relative to a subgraph G is defined as follows:where and indicate that the node A belongs to G and does not belong to G, respectively. LFM algorithm first randomly selects a node in the network as an initial subgraph, continuously adds nodes around the subgraph, and removes nodes inside the subgraph according to formula (1) to increase the fitness of the subgraph until the fitness remains unchanged. Then, it selects a node that does not belong to any community and repeats the above expansion process until all nodes in the network belong to at least one community.

Lee et al. [20] proposed the GCE algorithm, which selects the maximum cliques with the number of nodes not less than K as the seeds and uses the same fitness function as the LFM algorithm to expand each community. Baumes et al. [21] proposed two overlapping community detection algorithms based on local expansion: IS (Iterative Scan) and RARE (Rank Eemoval). These two algorithms are mainly oriented to directed networks. The RARE algorithm takes some nonadjacent subgraphs of the network as seeds, removes the moderately large nodes in the network, and then adds the removed nodes to each community that can increase the community density function. Andersen et al. [22] proposed a community expansion method based on PageRank. The algorithm takes an initial node in the network as an extension object. Firstly, the approximate PageRank vector p starting from node u is calculated, and then a sweep technique is used to select the node set with the best conductivity around node u. The nodes in the network are sorted according to from large to small (where d () represents the degree of node ), and a node sequence is generated. The first k nodes in the sequence are selected to form a set. Different k values correspond to different sets, and the set with the lowest conductivity is the community expanded by node u.

Based on the above work, Silistre et al. [23] proposed two selection seed selection strategies: GRACLUS CENTERS and SPREAD HUBS. Both methods select a single node as a seed and use the community expansion method proposed by Andersen et al. [22]. Zhang et al. proposed [24] the CFCD algorithm and defined the core similarity. On this basis, they defined the core centrality of nodes and the core fitness of communities. The algorithm selects the node with the largest centrality that is not in the core of any existing community as a seed and takes the set composed of the node and its adjacent nodes as the initial community. However, due to the lack of fast and effective seed selection and community optimization methods, these algorithms often get community results with lower accuracy. In order to solve these problems, we propose a seed selection method based on the importance of nodes based on the degree of fusion and clustering coefficient, which ensures that the seeds have a large total node influence and also ensures that the internal nodes of the seeds have a high degree of similarity.

3. Proposed Method

In this section, we introduce the implementation process of the algorithm in detail. The main steps of overlapping community detection algorithm based on information fusion (OCDIF) include seed nodes selection (Section 3.1), local community expansion (Section 3.2), community merging (Section 3.3), and isolated nodes adjustment (Section 3.4). The principle and process of each step are described in detail as follows.

3.1. Seed Node Selection

In the existing overlapping community detection methods based on local expansion, the selection of seed nodes is random. However, it can not obtain a better community structure, and the result of community detection is unstable. In this paper, we take the node influence value as the node importance index and the node influence value as the selection order of seed nodes. The influence value of the node is large, which indicates that the node occupies an important position in the network. Then, the community structure can be guaranteed to have a certain reference value by expanding the community from this node. Moreover, due to the fixed selection order of seed nodes, stable community detection results can be obtained. It overcomes the defect of poor stability of existing overlapping community detection results based on local expansion.

The more neighbors of a node in the network, the greater the influence of this node, so the propagation ability of a node mainly depends on the sum of its direct neighbor degrees. However, considering the local properties of network nodes, different nodes with the same sum of direct neighbor degrees may have different propagation capabilities. Therefore, in addition to the nodes degree, other nodes attributes need to be considered. At the same time, the topological connection between the nodes and its neighbors also has an impact on the propagation ability. The greater the clustering coefficient, the greater the importance of the node.

Combining the degree and clustering coefficient of nodes and entropy weight, we propose a seed selection method of fusion degree and clustering coefficient. Firstly, the concepts of degree and clustering are introduced. The degree value of a node in an undirected network is defined as the number of nodes directly connected to the node. Given an undirected network G = (V, E), the corresponding adjacency matrix , indicates the ability of a node to communicate directly with other nodes. The larger the D(i) value, the more important the node is:where represents the total number of neighbors of the node i and represents the actual number of undirected edges between neighbors.

Firstly, the degree and clustering coefficient of nodes are normalized to dimensionless. The matrix R is created according to the normalized value. n represents the number of nodes in the network, represents the normalized value of the degree of the node j, and represents the normalized value of the clustering coefficient of the node j, as shown in

Secondly, calculate the entropy value of the node i, as shown in

The weight of degree and clustering coefficient is obtained according to entropy , as shown in

The weight factor of the node i is calculated according to the weight calculated by the above formula. The specific calculation formula is as follows:where represents the weight factor of the node i. D(i) and CC(i) represent the degree value and the processed clustering coefficient of node i, respectively. and represent the contribution weight of the degree value and the clustering coefficient, respectively.

Finally, formula (8) is the calculation process of the node importance CLC. Arrange the nodes in descending order and store them in the vector X, select node A that is not allocated to any community from X in order, and treat node A as a local community C. The pseudocode of the nodes’ importance evaluation is shown in Algorithm 1.

Input: a network G(V, E), the number of nodes in the network n
Output: the importance of each node
(1)Initialize D = , CC = 
(2)for i in n do
(3)  Calculate D(i) using formula (4) during D decomposition
(4)end for
(5)for i in n do
(6)  Calculate CC(i) using formula (5)
(7)end for
(8) Create matrix R using formula (6)
(9)for i in 2 do
(10)  Calculate using formula (7)
(11)end for
(12)for i in n do
(13)  Calculate CLC(i) using formula (10)
(14)end for

3.2. Local Community Expansion

This section mainly introduces the process of local community expansion. The condition for starting and stopping is that the adaptation function reaches the local maximum. The adaptation function is shown inwhere and represent the total number of internal and external degrees of local community, respectively. is a parameter greater than 0, which is used to adjust the community scale. Lanchinetti pointed out that the result of community detection is the best when is 0.9, so all experiments in this paper set the value of parameter to 0.9. Each time the algorithm expands the local community, the neighbors of the local community are added with a marker bit, which indicates that the node has joined the community.

If a node is only connected to the current node, it is considered that the node must have the same label as the current node. According to this idea, we introduce a concept “similar,” as shown in Figure 3. All edges of an existing local community A and node 2 are connected not only to the nodes in local community a, but also the outside local community A. Therefore, node 2 cannot be directly added to A. It is necessary to calculate the fitness function value of node 2 before deciding whether they can be added to A. All edges connected with node 1 are in A, node 1 is directly added to local community A and the fitness function value will not be calculated. This greatly reduces the time of the algorithm and improves the quality of community detection.

3.3. Community Merging

After the expansion of local communities, there are many small-scale local communities in the divided community results because it also follows the trend of “birds of a feather flock together” in the network, forming a large-scale community structure rather than scattered small communities. In order to obtain the ideal result of community detection, these small communities need to be merged. Silistre et al. [23] gave the concept of community overlap, which is used to judge whether two communities can be merged into one community. The greater the degree of overlap is, the more reasonable it is to merge the two communities into one. The calculation of community overlap is shown inwhere and represent the set of nodes in and and and represent the set of inner edges of and .

According to the degree of community overlap, it can be judged whether two communities can be merged into one community and calculate the average value of community overlap , which is shown in

The algorithm first judges whether the overlap degree of any two communities is greater than . When the overlap degree , the two communities are merged.

3.4. Isolated Nodes Adjustment

After the implementation of community merging, there are still isolated nodes. It is necessary to judge whether the isolated nodes can become a community. The judgment of an isolated node is mainly divided into two points. As shown in Figure 4, node 1 is an isolated node and is not connected to other nodes. At this time, the node can exist as an independent community.

Another situation is that, as shown in Figure 5, node 1 is an isolated node, but node 1 is connected to nodes 3 and 2. According to formula (12), there is node similarity between the isolated node and its neighbors:where and represent the degree of node and node , respectively. Then, calculate the average similarity between the node and all neighbor nodes according to

If , it will be allocated to the neighbors community. The isolated node may be allocated to multiple communities. It is also in line with the requirements of overlapping communities. The pseudocode of community merging and isolated nodes adjustment are shown in Algorithm 2.

Input: local subgraph LC
Output: community detection results OC
(1) Community merging
(2) = calculateAvgOS(LC)
(3)OC = []
(4)for i in LC do
(5)  j = i + 1
(6)  for j in LC do
(7)   ifthen
(8)    OC.append(ij)
(9)   end if
(10)  end for
(11)end for
(12) Isolated nodes adjustment
(13)for i in OC do
(14)  if len(i) = = 0 and otherSide(i[0]) then
(15)    = i[0]
(16)   neighbors[] = findeNeighbors()
(17)   
(18)   for in neighbors[] do
(19)    ifthen
(20)    addIsolateNode(, ,OC)
(21)    end if
(22)   end for
(23)  end if
(24)end for

4. Experimental Results and Analysis

This article uses Python language to implement the OCDIF algorithm. The seed selection algorithm proposed in this paper is compared with similar methods on the real network. At the same time, the OCDIF algorithm is compared with other overlapping community detection algorithms on real large-scale networks. The experimental environment is Core(TM) i5-4590, 3.3 GHz CPU, 16 GB memory.

4.1. Selection Effect of Seed Node

This section mainly verifies the effect and accuracy of the node importance ranking algorithm. The experimental data set uses the social friendly network (So-Colgate) [25] and the power grid network (PowerGrid) [25]. Among them, the socially friendly network consists of 3,482 nodes and 14,241 edges. The grid network consists of 4,940 nodes and 6,595 edges. For these three networks of different sizes, the node importance is calculated separately, and the ranking results and the ranking results calculated by the SIR propagation model are combined to determine each method (ControlRank [26], MBA [27], NIBNA [28], ODEF [28], CRRank [28]) accuracy.

Table 1 shows the experimental results. It can be found that the accuracy of the seed algorithm proposed in this paper in the So-Colgate network and PowerGrid network is slightly better than other algorithms within a given part of the propagation probability range. In terms of propagation probability, the MBA algorithm is also better, but overall it is lower than the algorithm proposed in this article. In terms of its average accuracy, the advantages of this article are even more obvious. On the network So-Colgate, compared to the ControlRank, MBA, NIBNA, ODEF, and CRRank algorithms, the algorithm proposed in this paper has an average increase of 18.4%, 34.5%, 37.1%, 31.5%, and 20.3%. On the network PowerGrid, compared to the ControlRank, MBA, NIBNA, ODEF, and CRRank algorithms, the algorithm proposed in this article has increased by an average of 10.4%, 4.8%, 24.4%, 22%, and 2.1%.

4.2. Community Detection Results

This section tests the performance of OCDIF on synthetic networks and real networks. We select the overlapping community detection algorithms based on the global structure and local structure of the static networks as the comparison objects of OCDIF (CLPA [29], GREESE [30], ILPA [31], LMD [32], McFFMM [33], MCMOEA [34], MPEA [35], and SSLPA [36]). We use the following two common indicators to evaluate the quality of the community detection: (1) F1-Score (average F1 value) and (2) NMI (normalized mutual information).(1)F1-Score: this standard measures the accuracy of algorithm community detection by quantifying the degree of correspondence between the algorithm detection community and the real community. Given two community structures of a network and , the average F1 value is defined as follows:where is the harmonic average of the Precision and Recall between the two communities:(2)Generalized standard mutual information NMI (normalized mutual information): this standard is proposed to measure the accuracy of overlapping community detection algorithms. NMI evaluates the algorithm’s accuracy by quantifying the similarity between the community discovered by the algorithm and the real community. The value range of NMI is [0, 1]. The larger the value, the higher the quality of the detection community.

4.3. Real Network

This paper uses the benchmark data set provided by SNAP [37] to conduct experiments. The network provided by this data set contains a real community structure, which is convenient for testing the algorithm. Table 2 shows the data of the four large-scale networks selected in this article. Next, we will briefly introduce these four networks:DBLP: it is a collaborative network of authors. Each node in the network represents an author. If two authors have published at least one article together, then there is an edge connection between them. A journal or conference represents a community, and the community is composed of authors who have published articles in the journal or conference.Amazon: it is a commodity network. Each node in the network represents a commodity. If two commodities are frequently purchased at the same time, there is an undirected edge between them. Each product category provided by Amazon corresponds to a real community.YouTube: it is a YouTube social network. Each node represents a user of the network. If two users establish a friendship, then there is an edge connection between them. In this network, a community refers to a group created by users, and a community is a collection of users who join the group.Orkut: It is a Orkut social network. Similar to the YouTube network, the nodes in this network represent users, and the edges represent the friendship between users. A community also refers to a group created by a user, and a community is a collection of users who have joined the group.

Figures 6 and 7 show the average F1 value and NMI value of the test algorithm on networks. Experimental results show that, in terms of the accuracy of community detection, the OCDIF algorithm outperforms all overlapping community detection algorithms based on global information and local information. On both networks, the OCDIF algorithm obtained the highest average F1 value and NMI value. Our algorithm is also the only local algorithm that exceeds all global algorithms in the test algorithm. The ILPA algorithm performed the worst, and the CLPA algorithm performed close to the OCDIF algorithm. On the DBLP network, the average F1 value of the OCDIF algorithm is 22.2% higher than ILPA, 20% higher than MOEA, and 6.6% higher than CLPA. On the Amazon network, the average F1 value of the OCDIF algorithm is 18.2% higher than ILPA, 17.3% higher than MOEA, and 6.4% higher than CLPA. On the DBLP network, the NMI value of the OCDIF algorithm is 27.8% higher than ILPA, 5.6% higher than MOEA, and 11.1% higher than CLPA. On the Amazon network, the NMI value of the OCDIF algorithm is 24.2% higher than ILPA, 6.1% higher than MOEA, and 9.1% higher than CLPA. Therefore, compared to these overlapping community detection algorithms for large-scale networks, the OCDIF algorithm can obtain a more accurate community structure in real networks.MCMOEA and SSLPA algorithms are relatively good, close to the algorithm’s accuracy proposed in this article. The worst performing is GREESE and MOEA algorithms. On the YouTube network, the average F1 value of the OCDIF algorithm is 30.8% higher than MOEA, 7.7% higher than MCMOEA, and 15.4% higher than SSLPA. On the Orkut network, the average F1 value of the OCDIF algorithm is 20.8% higher than MOEA, 12.5% higher than MCMOEA, and 6.25% higher than SSLPA. On the YouTube network, the average NMI value of the OCDIF algorithm is 11.1% higher than MOEA, 16.7% higher than MCMOEA, and 10% higher than SSLPA. On the Orkut network, the average NMI value of the OCDIF algorithm is 18.2% higher than MOEA, 19.1% higher than MCMOEA, and 26.4% higher than SSLPA. Therefore, compared to these overlapping community detection algorithms for large-scale networks, the OCDIF algorithm can obtain a more accurate community structure in real networks. The experimental results show that, compared to the current mainstream overlapping community detection algorithms, the OCDIF algorithm can quickly and with high quality complete large-scale network overlapping community detection.

4.4. Artificial Synthetic Network

In this section, the LFR overlapping benchmarks proposed by Lancichinetti and Fortunato are selected to generate the experimental network. This overlap benchmark is widely used to evaluate the performance of overlapping community detection algorithms. The degree of generated network nodes and community size conform to a power-law distribution. In the previous chapter, we have introduced the parameters included in the LFR benchmark network. This chapter uses this overlapping benchmark to generate 4 groups of networks, which have the same parameter values as follows: N = 10 000, k = 15,  = 50,  = 10,  = 50, and other parameter values are shown in Table 3. Each group of networks contains 6 types of networks, in which the value range of on is [0, 0.5 N]; is set to 2 and 4, respectively; is set to 0.1 and 0.3, respectively, representing a low-mix network and a high-mix network.

This section also chooses the average F1 value and the generalized standard mutual information NMI (normalized mutual information), two evaluation indicators, to analyze the accuracy of the OCDIF algorithm community detection.

Figures 8 and 9 show the results of the NMI value and the average F1 value on a given artificial synthesis network. With the increase of overlapping nodes between communities, the community structure becomes more ambiguous, and the difficulty of finding the community increases. On the four groups of networks, each test algorithm has a different degree of reduction in the accuracy of finding the community. It can be seen from the data that the seed selection method proposed in this chapter is more stable when dealing with networks with fuzzy community structures. In addition, in the comparison algorithm, it can be seen that the seed selection method is superior to other algorithms in terms of the accuracy of expanding the community and the stability of dealing with networks with fuzzy community structure. The seed selection method proposed in this paper can get a more precise community structure.

5. Conclusion

We propose an information fusion overlapping community detection algorithm. The method is divided into four steps: seed node selection, local community expansion, community merging, and isolated nodes adjustment. Considering the local nature of network nodes, different nodes with the same direct neighbor degree may have different influences. We propose a seed selection method based on the degree of fusion and clustering coefficient, which ensures that the seeds have a large total node influence and also ensures that the internal nodes of the seeds have a high degree of similarity. The experimental results show that the algorithm greatly improves the efficiency of community detection and obtains more accurate results.

Most networks in real life are not static and will change over time, such as the removal and increase of edges between nodes in the network. As the nodes and edges in the network change, the community structure in the network will change accordingly. However, most of the existing community detection algorithms study static networks, and the research on dynamic networks is necessary and has great practical significance.

Data Availability

Data will be available at http://snap.stanford.edu/data.

Conflicts of Interest

The authors declare that they have no conflicts of interest.