Abstract

In order to find the structure of local community more effectively, we propose an improved local community detection algorithm ILCDSP, which improves the node selection strategy, and sets selection probability value for every candidate node. ILCDSP assigns nodes with different selection probability values, which are equal to the degree of the nodes to be chosen. By this kind of strategy, the proposed algorithm can detect the local communities effectively, since it can ensure the best search direction and avoid the local optimal solution. Various experimental results on both synthetic and real networks demonstrate that the quality of the local communities detected by our algorithm is significantly superior to the state-of-the-art methods.

1. Introduction

In the real world, many complex systems can be described through various kinds of networks, such as interpersonal networks, biological networks, neural networks, social networks, and WWW. Commonly, in these networks, individuals represented by nodes are linked with some special relationships. A large amount of studies reveal that there exist underlying communities in most complex networks. Community detection, as a key technology for network analysis, can discover the hidden structures and functions in complex networks, which is attracting a considerable amount of attention from researchers in various domains.

In recent years, a large number of community detection algorithms have been proposed, such as module optimization algorithm [13], spectral clustering algorithm [4, 5], hierarchical clustering algorithm [6, 7], and label propagation algorithm [8, 9]. Most of them focus on finding the global community structure which contains all the possible communities. However, in real networks, such as WWW, they are large-scale and even dynamic. The information of the whole network is very difficult or even impossible to be obtained, so none of the algorithms is suitable to detect the community structures of large-scale networks with incomplete information. Because of these restrictions, the problem of local community detection has drawn the scholars’ attention increasingly.

Some local community detection algorithms have been proposed based on the limited information, such as Clauset [10], LWP [11], and LS [12]. These algorithms start from a source node and then gradually join nodes to the local community. Eventually they will find the local community with the source node. Usually the nodes in a community can be divided into core nodes and edge nodes. Core node is the node which connects closely with the interior nodes of the community it is located in. Edge node is the node which connects sparsely with the interior nodes of the community it is located in. The effect of an algorithm depends on the choice of the source node. If the core node is the source node, the algorithm usually can get better structure of the local community. Most of the local community detection algorithms adopt strategy of climbing optimization, which will make the algorithm fall into local optimum, since they join the node with the optimal value of the objective function at each step. As shown in Figure 1, the source node is the edge node of community . Because the node has closer connection with community , the next step may join node of community to the local community. Then the following steps may continue to gather nodes of community . Finally a local community composed of community and node of community could be obtained. So the community where the node really locates could not be obtained.

To solve the above problem, we propose an improved local community detection algorithm using selection probability (ILCDSP). The main idea of the algorithm is to set selection probability for the candidate nodes at each step, making the nodes with high selection probability more probably be chosen. ILCDSP screens nodes randomly; it does not just only select the node with high modularity, and thus it leads the algorithm process to the best search direction. It can avoid such a vicious circle that one misstep will undo all our work. The performance of the algorithm is verified, respectively, on the real and simulated data sets. The evaluation method used is Precision, Recall, and -Score. The experimental results show that this algorithm, compared with Clauset and LWP, could find a local community with higher quality.

The paper is organized as follows. Section 2 introduces concepts and algorithms related to local community detection; Section 3 introduces the algorithm ILCDSP proposed in this paper; Section 4 verifies the performance of the algorithm ILCDSP through the experiment on the real and simulated networks; and Section 5 gives the conclusion and discusses the future work which will be done.

2.1. Problem Definition

The definition of local community detection was firstly proposed by Clauset [10]. Usually the following method is used to define the problem: there is an undirected graph ; represents the node set of the graph and represents the edge set of the graph. The connectivity information of part of the nodes in the graph is known or available. As shown in Figure 2, the known local community is defined as area , and the set of nodes connected with nodes in is defined as area ; the set of nodes in which is connected with the nodes in is defined as edge nodes area , where any node in has at least one neighbor node in . The rest of is the core nodes area .

The task of local community detection is to constitute a local community from a source node. During the process, the algorithms should merge the node which meets the conditions into and remove the node in which does not meet the conditions. Different algorithms have different conditions to select node, and these will be shown in the next section.

2.2. Related Algorithms

We have reviewed several effective approaches to explore local community structure. These algorithms are presented below in the sequence of publication date.

2.2.1. Clauset Algorithm

In order to solve the problem of local community detection, Clauset [10] defined local community modularity and proposed a greedy algorithm of fast convergence to find the local community with maximum modularity.

Local community modularity is defined as follows: where and represent node and node , and

The process of Clauset algorithm is similar to web crawler. First, it starts from a source node , merges node into the subgraph , merges all neighbor nodes of node into , and then iterates the following steps.

Step 1. Calculate for all the nodes in . is the increment of modularity .

Step 2. Merge the node with the largest modularity incremental into .

Step 3. Merge all the neighbor nodes of node into and then update and .

The algorithm merges the neighbor nodes, which will be able to bring the biggest increment of into the local community one by one until the size of the local community reaches the predetermined value. That is to say, this algorithm needs to set a parameter to decide the size of the local community in advance. The clustering result is highly affected by the source node.

2.2.2. LWP Algorithm

LWP [11] algorithm is an improved algorithm. Compared with Clauset, it has definite termination conditions. The algorithm defines the community modularity as follows: where

Given an undirected graph , LWP algorithm will start from a source node and then find a subgraph with maximum . If the subgraph is a community (), the algorithm returns the community consisting of the subgraph. Otherwise, we hold the opinion that it cannot find any community starting from the source node. For one source node , LWP algorithm finds the subgraph with maximum through two main steps. Firstly, a subgraph which has only one source node is constructed, and then the neighbor nodes of are merged into . The following is the incremental step and pruning step of the algorithm.

In the incremental step, the algorithm merges the nodes of into iteratively until is extended to a certain degree. Each time the algorithm will select the node which can bring the biggest increment of ; in the pruning step, the iteration algorithm will remove the nodes of which will make the local community modularity of increase until no nodes can be removed. This algorithm turns out to result in high recall but low accuracy.

2.2.3. Bagrow Algorithm

Bagrow and Bollt proposed a local community detection algorithm [13]. The algorithm from a source node, according to the boundary, gradually merges node into the community; here the boundary is the set of nodes whose distance to the source node is a fixed value. The effect of the algorithm depends on the choice of the source node. If the source node is a boundary node rather than a core node, the final clustering results will be very different. In order to overcome this problem, the authors suggest making each node as a source node at a time, repeating the calculation to find the optimal result; however, the speed of the algorithm is very slow.

3. Improved Local Community Detection Algorithm Using Selection Probability

Most of the existing algorithms utilize the greedy strategy to select the present optimal node to join the local community. It can easily make the algorithm traps in local optimal solution. In order to avoid the occurrence of this phenomenon, this paper on the basis of the local community modularity, will give another selection criteria of nodes—selection probability.

3.1. Selection Probability

LWP algorithm needs to select the subsequent node according to the largest increment of . The nodes which make the increment of greater than zero are regarded as candidate nodes. Here, we set the selection probability for the candidate nodes according to which is the increment of . Consider where is the selection probability of candidate node , and is the number of candidate nodes.

The selection probability is determined by . Usually the lager is, the lager the selection probability is; that is to say, the greater the increment of is, the greater likelihood the node will be added to the local community.

3.2. Steps of the Algorithm

ILCDSP algorithm generates random number between 0 and 1; if , node will be merged into community . It is clear that the node with a greater selection probability will have more chance to be selected.

Steps of the algorithm are shown in Algorithm 1.

Input: A source node and a network
Output: A local community containing source node
(1) Merge the source node into and merge the
  neighbor nodes of into
(2) Set local modularity
(3) Calculate for each node in which makes
(4) Do
(5) While   is not empty Do
(6) Generate randomly, find the candidate node
(7) Merge node into , and remove it from
(8) Update
(9) Until no new node is merged into
(10) Return  

ILCDSP algorithm firstly merges the source node into the local community and then merges all the neighbor nodes into (Step 1). It initializes the community modularity as zero (Step 2), goes through each node in , then chooses the nodes which make lager than zero as candidate nodes, and, at the same time, calculates their selection probability (Step 3). ILCDSP algorithm generates random number between 0 and 1, then selects node according to the node selection probability, and merges it into . It updates , , and and repeats the above steps until no node can be merged into (Steps 4–9). Finally the algorithm will return the local community containing the source node (Step 10).

Here, we give an example to illustrate the application of the algorithm, as show in Figure 3.

The source node is the edge node of community . In order to find the local community that node locates, the next step should select one node from the candidate nodes 1, 2, 3, and 4 to be joined into the local community. Their increment of is ,. According to LWP, node 4 with the largest increment of should be joined into the local community. It is clear that node 4 does not belong to community . If node 4 is joined into the local community, the following steps may continue to gather nodes of community . Finally, we will get a local community composed of community and node of community . So we cannot get the community where the node locates. But in our algorithm, we calculate the selection probability for nodes 1, 2, 3, and 4. Considering , , one node will be randomly selected from them according to our algorithm. Although node 4 cannot avoid being selected completely, the probability of each candidate nodes has to be considered; our algorithm reduces the probability of the situation like node 4. It can get a better result.

In order to improve the efficiency of the algorithm, for any node , the formula (6) [12] can be used to speed up the calculation of , where denotes the number of edges inside and denotes the number of edges between and . represents the number of edges that will be added into because of the agglomeration of node , andis the number of edges which connect and nodes outside . It is easy to find out that the degree of equals to .

The original calculation of is to use the after joining node minus the before joining . It equals to formula (6), which can be proved as follows.

Firstly, And then

Compared to the original calculation of which needs to traverse all the nodes each time, the formula (6) only needs to know and of the node which will be joined into the local community.

4. Experiments

This section verifies the performance of the improved algorithm ILCDSP. We will compare ILCDSP algorithm with several typical algorithms of local community detection; respectively, they are Clauset_, Clauset_, and LWP algorithm. Clauset_ and Clauset_ are the algorithms taking and , respectively, as the objective functions. Experimental environment is the processor, Intel (R) Core (TM) i5-2400 @ 3.10 GHz CPU 3.10 GHz; the memory, 2 G; the operating system, Windows 7; and the programming language, C#.Net.

4.1. Datasets

We select three real networks and LFR benchmark network as experimental data.

(1) LFR benchmark network [14] is the most commonly used data sets in the current study of community detection. LFR benchmark networks mainly include the following parameters: is the parameters of node degree distribution; is the average degree of nodes in network; is the biggest degree of node; is the number of nodes that the smallest community contains; is the number of nodes that the biggest community contains; andis a mixed parameter, which is the probability of nodes connected with nodes of external community. The bigger is, the more difficult the community detection is. We produce four groups of LFR benchmark network; respectively, two of the groups share these parameters , , and , and the other two groups share parameters, , , and ; each group contains nine simulation networks, and the detailed parameters setting is shown in Table 1. The min  and max  of B1, B2, B3 and B4 were respectively set to and , that represents small community networks and big community networks. The value of in each group is set from 0.1 to 0.9, which generates nine simulation networks; these values represent that the networks change from low hybrid network to high hybrid network.

(2) The detailed information of real network data is shown in Table 2.

4.2. Evaluation

We utilize Precision, Recall, and F-Score as the evaluation criteria. They are used frequently in many areas such as statistics, information retrieval, and machine learning [1618]. Some other articles about community detection also used these metrics criterions [1923]. Precision is the fraction of the correct classification of nodes in the community; Recall is the ratio of the number of correctly classified nodes to the total number of nodes that should be agglomerated into the community; and -Score is the harmonic value of Precision and Recall. The specific formulas are as follows: where is the local community where the source node belongs in the true partition, and is the local community of the source node identified by the algorithm of local community detection. A well-performed algorithm should get high Precision, Recall, and -Score at the same time. In the experiments, every node in the network is taken as a source node to discover its local community. We average the Precision, Recall, and F-Score for all nodes to evaluate and compare the accuracy of our algorithm against the others.

4.3. Analysis of Simulation Network

Figures 4, 5, 6, and 7 show the results of algorithms detecting local community in four groups of LFR benchmark networks (B1–B4), respectively. The horizontal is the value of mu which is between 0.1 and 0.9; the vertical is the three evaluations of local community detection. From these figures, it can be seen that the results of Clauset_ and Clauset_ are completely overlapped. Although the form of the community modularity and is different, the effect of the algorithm used them is the same. At the same time, it is obvious that the improved algorithm is better than the original algorithm LWP. Although ILCDSP does not improve the value of Precision, it greatly improves the value of Recall, thus making the value of -Score greater than the LWP significantly. A well-performed algorithm should have a high Precision, Recall, and -Score at the same time; thus the proposed algorithm in this paper is superior to LWP. That is to say, our algorithm can better detect the structure of local community compared with the original algorithm.

4.4. Analysis of Real Network

In our experiments, we apply each algorithm, respectively, in three real networks. Table 3 shows the results. The bold figures are the maximum values of each evaluation on each network. The results of Clauset_ and Clauset_ are the same. Most of the maximum values appear in ILCDSP. In spite of the fact that the Precision of Clauset_ and Clauset_ in karate and polbooks is greater than ILCDSP, their Recall is lower than ILCDSP, which leads to the -Score of the former two algorithms to be lower than ILCDSP. It is said that the overall effect of ILCDSP is the best. Through Figure 8, it can be seen intuitively that the result of our algorithm is better than other algorithms. It further shows that our algorithm cannot only detect local community of high quality in the simulated network but also could be applied to detect community structure effectively in real networks.

5. Conclusions

This paper proposed an improved local community detection algorithm—ILCDSP. The algorithm firstly sets selection probability for each candidate node, making the nodes with high selection probability more likely to be selected, and then it randomly screens nodes. The algorithm will process to the best direction, so as to improve the accuracy of local community detection. Experimental results show that ILCDSP could detect the structure of the local community more effectively than other algorithms both in the real and simulate networks.

Although this proposed algorithm improves the accuracy of community detection, it is not stable enough; at the same time, it remains to be further researched and improved in time.

Conflict of Interests

The authors declare that they have no financial and personal relationships with other people or organizations that can inappropriately influence their work; there is no professional or other personal interest of any nature or kind in any product, service, or company that could be construed as influencing the position presented in or the review of the paper entitled.

Acknowledgments

This work was supported by the Fundamental Research Funds for the Central Universities, the National High Technology Research and Development Program of China (863 Program) (no. 2012AA011004 and no. 2012AA0622022), the Fundamental Research Funds for the Central Universities under Grant (no. 2013XK10), and the Doctoral Fund of the Ministry of Education (no. 20100095110003 and no. 20110095110010).