Abstract

The semisupervised community detection method, which can utilize prior information to guide the discovery process of community structure, has aroused considerable research interests in the past few years. Most of the former works assume that the exact labels of some nodes are known in advance and presented in the forms of individual labels and pairwise constraints. In this paper, we propose a novel type of prior information called negative information, which indicates whether a node does not belong to a specific community. Then the semisupervised community detection algorithm is presented based on negative information to efficiently make use of this type of information to assist the process of community detection. The proposed algorithm is evaluated on several artificial and real-world networks and shows high effectiveness in recovering communities.

1. Introduction

Many networked systems, including social and biological networks, are found to divide natural communities, that is, groups of vertices which are densely connected to each other while less connected to the vertices outside [1]. The community structure in real networks always has a specific function such as cycles or pathways in metabolic networks or collections of pages on the same or related topics on the web community [2]. To comprehensively understand the function of different networks, much research effort has been devoted to develop methods that can extract community structure from networks.

A lot of models and algorithms have been proposed for community detection, such as betweenness-based algorithms [1, 3], modularity-based methods [2, 46], spin model [7], and stochastic blockmodels [8]; see [9, 10] for a more comprehensive review. However, almost all existing approaches for community detection only make use of the network topology information, which completely ignore the background information of the network. However, in many real-world applications, we may know some prior information that could be useful in detecting the community structures. For instance, a few proteins have been known to belong to certain functional classes in protein-protein interaction networks [11]. Therefore, how to utilize prior information to guide the discovery process of community structure is an interesting question that is worthy of working on.

In recent years, a variety of semisupervised community detection algorithms have been proposed. Ma et al. [12] proposed a semisupervised method based on symmetric nonnegative matrix factorization, which incorporates pairwise constraints (via must-links and cannot-link) on the cluster assignments of nodes for identifying community structure in network. Eaton and Mansbach [13] presented a semisupervised algorithm based on spin-glass model, which can incorporate prior knowledge in the forms of individual labels (via known cluster assignments for a fraction of nodes) and pairwise constraints into the process of extracting community structure. Zhang et al. [14, 15] developed the methods that implicitly encode the pairwise constraints by modifying the adjacency matrix of the network, which can also be regarded as the denoising process of the consensus matrix of the community structures. Liu et al. [16, 17] put forward two semisupervised algorithms based on discrete potential and label propagation, respectively. Both algorithms are especially suitable for the network with obscure community structure and exhibit almost linear complexity in time.

Although these approaches can improve accuracy and degree of noise resistant to community detection, they mostly focus on one kind of prior information; that is, the exact labels of a small portion of nodes are given. In some real application, it may not be easy to identify the exact community of a node, whereas we can easily point out the community that one node does not belong to. For a simplified example, assume that the web network can be grouped into some communities which represent pages on related topics. Further, supposing that the web page describes a female soccer game, it is hard to determine whether the web page belongs to sport community or feminism community. However, it does not belong to automobile community.

In machine learning, the negative information was first proposed by Hou et al. [18]. In their work, the negative information indicates whether a point does not belong to a specific category. They utilized the negative information to guide the process of semisupervised learning and made some experiments on image, digit, spoken letter, and text classification tasks. The experimental results showed the effectiveness of negative information. As far as we know, there is no community detection method concerning the negative information, although this information arises naturally in some applications.

In this paper, we propose a novel semisupervised community detection approach based on negative information. It has near-linear complexity in time and can incorporate the negative information into community detection. The algorithm has been evaluated on synthetic LFR benchmark networks [19] and on various real-world networks with community structure. The results show that negative information is helpful to improve the accuracy of identifying communities. Specifically, the algorithm exhibits almost linear complexity in time.

The rest of the paper is structured as follows. Section 2 includes reviews of the basic formulation and notations used in our approach. In Section 3, we describe our new semisupervised community detection algorithm in detail. Experimental results on artificial and real-world networks are given in Section 4. Finally, a conclusion is presented in Section 5.

2. Problem Formulation and Notations

We first give the notations of network representation which will be used throughout this paper. Let denote an unweighted and undirected network, where is the set of nodes and is the set of edges. Multiple edges and self-connections are not allowed. The network structure is determined by adjacency matrix . Each element of is equal to 1 if there is an edge connecting nodes to , and it is 0 otherwise. If there are communities, a community-number (label) set is defined.

Assume that there are three kinds of nodes, that is, traditional label (TL) nodes , negative label (NL) nodes , and unlabeled (UL) nodes . Define the set of TL nodes with cardinality , the set of NL nodes with cardinality , and the set of UL nodes with , where typically , , and . Further, suppose that we are given a set of nodes . The label indicator matrix of is defined as follows: if and only if belongs to the th community; otherwise . We define the label indicator matrix of as if and only if does not belong to the th community; otherwise . Note that, different from , the row vectors of may have more than one element which is equal to . The goal of our approach is to infer the exact labels for nodes in .

In this paper, label propagation task is to propagate the TL under the guidance of NL information to all of the nodes in , accomplishing label prediction of nodes without TL. The result of label propagation for community detection depends on the weights of the edges of network, so how to construct the weight matrix plays a decisive role. In this work, the simple weight matrix can be defined as where represents the degree of node . Obviously, in label propagation process, the labeled nodes propagate seed labels to their neighbours with uniform probability.

3. The Proposed Algorithm

In this section, the details of our proposed algorithm based on negative information are presented, and then the time complexity and the convergence property of the algorithm are analyzed. There are mainly two steps of the algorithm. The first is to determine the particular parameter matrices, and the second is to propagate labels via an iterative process.

3.1. Parameter Matrices Construction

Using the idea of the work by Hou et al. [18], we introduce two matrices, that is, the initial label matrix and the parameter matrix , where is number of communities. represents the probability that belongs to the th community, and is a matrix that shows the role of each node and indicates when an NL node can be regarded as a TL node and when it is considered as an unlabeled node. We also define two parameters and , which take different values for labeled nodes (including TL and NL) and unlabeled nodes.

For any node , the and are defined as follows.

(1) If Has the . Based on the indictor matrix , if belongs to the th community, then

(2) If Has the . According to the , we can define an index set , which contains the sets that does not belong to; then

(3) If Is an Unlabeled Node. Consider

How to make use of these two matrices in the proposed algorithm will be explained in the next subsection. Note that is close to 0 and is close to 1.

3.2. Description of the Algorithm

The algorithm is motivated by the fact that the nodes having the same traditional label are grouped together as one community through labels propagation process. We initialize a small number of nodes with user-defined labels based on prior information (including TLs and NLs) and let the TLs propagate through the network. As the labels propagate, the exact labels of the NL and unlabeled nodes can be achieved. Then we will show how to iteratively propagate the TL under the guidance of NL information and unlabeled nodes.

This process is iteratively performed, where, at every step, each node absorbs some label information from its neighbors and retains some label information of its initial state. Let denote a set of computed label matrices, for all ; its row vector corresponds to the possibilities of a specific node belonging to all the communities. The exact label of one node can be determined by the index of the largest element of the corresponding row vector of . The iterative formula is defined as follows: where denotes the times of iterations. The first term shows the label information that absorbs from its neighbors and the second term represents the label information retained from its initial label.

Specifically, if has a TL which indicates it belongs to the th community, then and . In this case for and is close to . Thus, the second term in (8) plays a major role in each iteration; that is, the predicted label is consistent with the given TL. If has an NL indicating that belongs to the th community, that is, , then and . On the contrary, if , no much prior information can help to determine whether belongs to the th community or not. Therefore, we regard it as an unlabeled node and , . In this case, the first term in (8) plays a major role in each iteration. If is unlabeled, there is no prior information about its label and , . Thus (8) is dominated by its second term. In summary, the iteration equation can be rewritten as where denotes an identity matrix.

To summarize, the main procedure of the method is presented in Algorithm 1.

input: adjacency matrix , the initial label matrix , , the constants and
output: The TL of all the nodes
()   construct the weight matrix by (1).
()   for
()   If has the TL
()    construct and by (2) and (3), respectively.
()   If has the NLs
()    construct and by (4) and (5), respectively.
()   If is unlabeled node
()    construct and by (6) and (7), respectively.
()   iterate (9) until convergence.
() output the labels of each node by .

3.3. Analysis of the Algorithm

In this subsection we will analyze our method theoretically. First we will discuss the time complexity of the algorithm. Second we will analyze the convergence property of the iteration of the algorithm.

The algorithm mainly contains three computational parts: constructing the weight matrix , constructing the label and parameter matrices , , and iterating (9) until convergence. In the first part, time is required to construct the weight matrices, where denotes the number of edges. In the second part, the label and parameter matrices can be derived with computational complexity , where denotes the number of nodes. In the last part, the time complexity of each iteration is . Assuming (9) is converged at iterations, the last part of the algorithm requires time. Since the time complexity of the algorithm depends on the highest complexity of the three parts involved in it, the overall time complexity is for the proposed algorithm.

The convergence of the algorithm is analyzed as follows. According to the initial condition that , (9) can be rewritten as Since and , from the theorem of Perron-Frobenius [20], the spectral radius of satisfies . Recall that the elements of are either or , where and ; thus Obviously, (9) will converge to

4. Experiments and Discussion

In this section, we give a set of experiments to show the effectiveness of the proposed algorithm. The relevant data sets involving the experiments are LFR artificial networks [19] and real-world networks including the Zachary’s network of karate club [21] and the Lusseau’s network of bottlenose dolphins [22]. In all the experiments of this section, and are set to and , respectively.

4.1. Artificial Networks

In this subsection, the ability of the algorithm to identify communities is tested in LFR benchmark networks. Our experiments include evaluating the performance of the algorithm with various amounts of NL nodes, measuring the ability of the algorithm to recover communities with different parameter in benchmark networks, comparing the accuracy of our algorithm with label propagation algorithm (LPA) [23] and Infomap algorithm [24] and analyzing the relationship between the percentage of NL nodes and the percentage of TL nodes in the proposed algorithm. In the following experiments, the choice of NL and TL nodes is random, and the number of NLs of each NL node is set to percent of the number of communities. Note that the NLs of each NL node are selected randomly.

The LFR benchmark network is an artificial network for community detection, which is claimed to possess some basic statistical properties found in real networks, such as heterogeneous distributions of degree and community size. Many parameters are involved to specify properties of generated networks in this benchmark: (number of nodes), (average degree), (maximum degree), and (minimum and maximum community size), (exponent of power-law distribution of nodes degree), (exponent of power-law distribution of community sizes), and (mixing parameter).

To evaluate the performance of our algorithm on discovering community structures, the normalized mutual information (NMI) measure [25] is used to quantitatively compare the known partition with the partition found by the algorithm: where is the real number of community and denotes the number of found community. The matrix presents the confusion matrix, where is simply the number of nodes in the real community that appears in the found community . and are the sum over row and column of confusion matrix, respectively. is obviously the number of nodes. If the found partition is identical to the real communities, then NMI takes its maximum value, . However, if the found partition is entirely independent of the real partition, corresponds to the situation that the entire network is found to be one community. The closer to of the NMI, the better partition of the network will be.

The number of NL nodes in each community is an important factor that affects the ability of the algorithm to identify communities. In order to quantify the relationship between the accuracy of the algorithm and the number of NL nodes in each community, the experiment is performed on the LFR benchmark networks (see Figure 1). The following parameters were employed: , , , , , , and . We fix the percentage of TL nodes in each community. The result of the experiment suggests that the partition accuracy of the algorithm increases with increase of the percentage of NL nodes in each community.

In the LFR networks, the mixing parameter represents the ratio between the external degree of each vertex with respect to its community and the total degree of the node. The larger the value of the network is, the harder its community structure is detected. Then the experiment is designed for testing the accuracy of our algorithm with the various parameter (see Figure 2). In the experiment, we randomly select and labeled nodes in each community, respectively. The following parameters about the LFR benchmark networks were employed: , , , , , , and . We fix the percentage of TL nodes 20% in each community and determine the percentage of NL nodes in each community by searching the grid .

To evaluate the effectiveness of our algorithm, we compare it to label propagation algorithm (LPA) and Infomap algorithm. Both algorithms can discover community structure without prior knowledge. We generate benchmark networks with the following parameters: , , , , , , and . Then, we randomly select TL nodes and, respectively, select and NL nodes in each community (see Figure 3).

Figure 3 presents the advantages of our algorithm under a certain situation. Compared with the two comparative algorithms, our proposed algorithm gives almost the same results as the LPA and Infomap when the mix parameter ranges from 0.1 to 0.3. Our algorithm also presents better quality than LPA. Although our algorithm is no better than the Infomap algorithm during the mix parameter covers from 0.4 to 0.7, it outperforms under the condition that the mix parameter is high. In particular the mix parameter arrives at 0.8; the NMI value of the Infomap algorithm is almost 0, while it is more than 0.3 for our algorithm, as depicted in Figure 3. It means that our algorithm is particularly suitable for the community detection of high parameter . In other words, our algorithm is more favorable with obscure community structure in networks. On the other hand, the result of the experiment shows that the NL nodes can help increase the accuracy of community partition.

In the proposed algorithm, the percentage of NL nodes and the percentage of TL nodes are important factors that influence the accuracy of community partition. To analyze the relationship between NL and TL, we, respectively, set the percentage of TL nodes to and the percentage of NL nodes to (see Figure 4). We generate benchmark networks with the following parameters: , , , , , , and .

As can be seen from Figure 4, the proposed algorithm performs better with the increase of TL nodes. It is consistent with intuition, since there is more exact label information available. Moreover, with the increase of NL nodes, the algorithm can achieve higher accuracies. This means that NL is actually helpful to community detection and the algorithm can use this information effectively. In particular, NL is more beneficial when TL nodes are rare, since the increase of accuracy brought by NL will become smaller with the increase of TL nodes.

4.2. Real-World Networks

In this subsection, we verify our algorithm from empirical networks, the karate club network and the dolphins social network, which have been applied as benchmarks to evaluate many community detection algorithms since the true community structures are known in the two networks. In general, the karate club network can be split into two disjoined groups due to the disagreement between the administrator and the instructor of the club, and the dolphins social network can be separated into two groups due to the temporary disappearance of a dolphin. However, the NL node is equivalent to TL node provided that the networks are divided into two communities. In the following experiments, we assume that Donetti’s result [26] is the true partition of the karate clue network and Pan’s conclusion [27] is the true community of dolphin social network.

The karate club network is constructed by Zachary over a period of two years and is composed of nodes corresponding to members of the club and edges representing the connections of the individuals outside the activities of the club. In Donetti’s result, the network is split into four communities. We select the nodes as TL nodes and the nodes as NL nodes for four different communities, respectively. Each NL node has one NL. The parameters and are set to and , respectively. Applying the proposed algorithm, the results of community detection for karate club network are shown in Figure 5. It is clear that the result of our proposed method is in agreement with the partition of Donetti’s method.

The dolphin social network, consisting of nodes indicating bottlenose dolphins and edges representing the associations between dolphin pairs occurring more often than expected by chance, is constructed by Lusseau over a period of seven years from 1994 to 2001. In Pan’s conclusion, the network is divided into four communities. We select the nodes as TL nodes and the nodes as NL nodes for four different communities, respectively. Each NL node has one NL. The parameters and are and , respectively. Applying the proposed algorithm, the results of community detection for dolphin social network are shown in Figure 6. It is obvious that the result of our proposed method is approximately consistent with the result of Pan’s methods.

5. Conclusions

In this paper, a semisupervised community detection algorithm is proposed based on negative information, which indicates whether a node does not belong to a specific community. It has near-linear complexity in time and can incorporate the NL and TL into community detection. As seen from our experimental results on both real and artificial networks, incorporating NL into community detection procedure can significantly improve performance, especially in the situation where the traditional labels are rare. Moreover, the more TLs and NLs applied in our algorithm, the better the community partition result.

Unfortunately, it is an implicit restriction that the number of communities must be known in advance, since the selection of the TL nodes should cover all the communities. Our future work will concentrate on the issue of detecting communities without preknowing the community number. In other words, we will devote part of our energy on the research of an improved semisupervised community detection algorithm which is capable of identifying communities accurately without labeled nodes of any community in the future.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by the National High-Tech Research and Development Program (863 Program) (no. 2014A A015103), the Joint Funds of the National Natural Science Foundation of China (no. U1404604), the Natural Science Plan Project of the Education Department of Henan Province (nos. 2010B520014 and 2010A520024), and the Soft Science Research Program of Science and Technology Department of Henan Province (no. 112400450405).