Mathematical Problems in Engineering

Volume 2015 (2015), Article ID 109671, 8 pages

http://dx.doi.org/10.1155/2015/109671

## Effective Semisupervised Community Detection Using Negative Information

^{1}School of Computer and Information Engineering, Henan Normal University, Xinxiang 453007, China^{2}School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China

Received 5 June 2014; Accepted 13 October 2014

Academic Editor: Qinggang Meng

Copyright © 2015 Dong Liu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The semisupervised community detection method, which can utilize prior information to guide the discovery process of community structure, has aroused considerable research interests in the past few years. Most of the former works assume that the exact labels of some nodes are known in advance and presented in the forms of individual labels and pairwise constraints. In this paper, we propose a novel type of prior information called negative information, which indicates whether a node does not belong to a specific community. Then the semisupervised community detection algorithm is presented based on negative information to efficiently make use of this type of information to assist the process of community detection. The proposed algorithm is evaluated on several artificial and real-world networks and shows high effectiveness in recovering communities.

#### 1. Introduction

Many networked systems, including social and biological networks, are found to divide natural communities, that is, groups of vertices which are densely connected to each other while less connected to the vertices outside [1]. The community structure in real networks always has a specific function such as cycles or pathways in metabolic networks or collections of pages on the same or related topics on the web community [2]. To comprehensively understand the function of different networks, much research effort has been devoted to develop methods that can extract community structure from networks.

A lot of models and algorithms have been proposed for community detection, such as betweenness-based algorithms [1, 3], modularity-based methods [2, 4–6], spin model [7], and stochastic blockmodels [8]; see [9, 10] for a more comprehensive review. However, almost all existing approaches for community detection only make use of the network topology information, which completely ignore the background information of the network. However, in many real-world applications, we may know some prior information that could be useful in detecting the community structures. For instance, a few proteins have been known to belong to certain functional classes in protein-protein interaction networks [11]. Therefore, how to utilize prior information to guide the discovery process of community structure is an interesting question that is worthy of working on.

In recent years, a variety of semisupervised community detection algorithms have been proposed. Ma et al. [12] proposed a semisupervised method based on symmetric nonnegative matrix factorization, which incorporates pairwise constraints (via must-links and cannot-link) on the cluster assignments of nodes for identifying community structure in network. Eaton and Mansbach [13] presented a semisupervised algorithm based on spin-glass model, which can incorporate prior knowledge in the forms of individual labels (via known cluster assignments for a fraction of nodes) and pairwise constraints into the process of extracting community structure. Zhang et al. [14, 15] developed the methods that implicitly encode the pairwise constraints by modifying the adjacency matrix of the network, which can also be regarded as the denoising process of the consensus matrix of the community structures. Liu et al. [16, 17] put forward two semisupervised algorithms based on discrete potential and label propagation, respectively. Both algorithms are especially suitable for the network with obscure community structure and exhibit almost linear complexity in time.

Although these approaches can improve accuracy and degree of noise resistant to community detection, they mostly focus on one kind of prior information; that is, the exact labels of a small portion of nodes are given. In some real application, it may not be easy to identify the exact community of a node, whereas we can easily point out the community that one node does not belong to. For a simplified example, assume that the web network can be grouped into some communities which represent pages on related topics. Further, supposing that the web page describes a female soccer game, it is hard to determine whether the web page belongs to sport community or feminism community. However, it does not belong to automobile community.

In machine learning, the negative information was first proposed by Hou et al. [18]. In their work, the negative information indicates whether a point does not belong to a specific category. They utilized the negative information to guide the process of semisupervised learning and made some experiments on image, digit, spoken letter, and text classification tasks. The experimental results showed the effectiveness of negative information. As far as we know, there is no community detection method concerning the negative information, although this information arises naturally in some applications.

In this paper, we propose a novel semisupervised community detection approach based on negative information. It has near-linear complexity in time and can incorporate the negative information into community detection. The algorithm has been evaluated on synthetic LFR benchmark networks [19] and on various real-world networks with community structure. The results show that negative information is helpful to improve the accuracy of identifying communities. Specifically, the algorithm exhibits almost linear complexity in time.

The rest of the paper is structured as follows. Section 2 includes reviews of the basic formulation and notations used in our approach. In Section 3, we describe our new semisupervised community detection algorithm in detail. Experimental results on artificial and real-world networks are given in Section 4. Finally, a conclusion is presented in Section 5.

#### 2. Problem Formulation and Notations

We first give the notations of network representation which will be used throughout this paper. Let denote an unweighted and undirected network, where is the set of nodes and is the set of edges. Multiple edges and self-connections are not allowed. The network structure is determined by adjacency matrix . Each element of is equal to 1 if there is an edge connecting nodes to , and it is 0 otherwise. If there are communities, a community-number (label) set is defined.

Assume that there are three kinds of nodes, that is, traditional label (TL) nodes , negative label (NL) nodes , and unlabeled (UL) nodes . Define the set of TL nodes with cardinality , the set of NL nodes with cardinality , and the set of UL nodes with , where typically , , and . Further, suppose that we are given a set of nodes . The label indicator matrix of is defined as follows: if and only if belongs to the th community; otherwise . We define the label indicator matrix of as if and only if does not belong to the th community; otherwise . Note that, different from , the row vectors of may have more than one element which is equal to . The goal of our approach is to infer the exact labels for nodes in .

In this paper, label propagation task is to propagate the TL under the guidance of NL information to all of the nodes in , accomplishing label prediction of nodes without TL. The result of label propagation for community detection depends on the weights of the edges of network, so how to construct the weight matrix plays a decisive role. In this work, the simple weight matrix can be defined as where represents the degree of node . Obviously, in label propagation process, the labeled nodes propagate seed labels to their neighbours with uniform probability.

#### 3. The Proposed Algorithm

In this section, the details of our proposed algorithm based on negative information are presented, and then the time complexity and the convergence property of the algorithm are analyzed. There are mainly two steps of the algorithm. The first is to determine the particular parameter matrices, and the second is to propagate labels via an iterative process.

##### 3.1. Parameter Matrices Construction

Using the idea of the work by Hou et al. [18], we introduce two matrices, that is, the initial label matrix and the parameter matrix , where is number of communities. represents the probability that belongs to the th community, and is a matrix that shows the role of each node and indicates when an NL node can be regarded as a TL node and when it is considered as an unlabeled node. We also define two parameters and , which take different values for labeled nodes (including TL and NL) and unlabeled nodes.

For any node , the and are defined as follows.

*(1) If ** Has the *. Based on the indictor matrix , if belongs to the th community, then

*(2) If ** Has the *. According to the , we can define an index set , which contains the sets that does not belong to; then

*(3) If ** Is an Unlabeled Node. *Consider

How to make use of these two matrices in the proposed algorithm will be explained in the next subsection. Note that is close to 0 and is close to 1.

##### 3.2. Description of the Algorithm

The algorithm is motivated by the fact that the nodes having the same traditional label are grouped together as one community through labels propagation process. We initialize a small number of nodes with user-defined labels based on prior information (including TLs and NLs) and let the TLs propagate through the network. As the labels propagate, the exact labels of the NL and unlabeled nodes can be achieved. Then we will show how to iteratively propagate the TL under the guidance of NL information and unlabeled nodes.

This process is iteratively performed, where, at every step, each node absorbs some label information from its neighbors and retains some label information of its initial state. Let denote a set of computed label matrices, for all ; its row vector corresponds to the possibilities of a specific node belonging to all the communities. The exact label of one node can be determined by the index of the largest element of the corresponding row vector of . The iterative formula is defined as follows: where denotes the times of iterations. The first term shows the label information that absorbs from its neighbors and the second term represents the label information retained from its initial label.

Specifically, if has a TL which indicates it belongs to the th community, then and . In this case for and is close to . Thus, the second term in (8) plays a major role in each iteration; that is, the predicted label is consistent with the given TL. If has an NL indicating that belongs to the th community, that is, , then and . On the contrary, if , no much prior information can help to determine whether belongs to the th community or not. Therefore, we regard it as an unlabeled node and , . In this case, the first term in (8) plays a major role in each iteration. If is unlabeled, there is no prior information about its label and , . Thus (8) is dominated by its second term. In summary, the iteration equation can be rewritten as where denotes an identity matrix.

To summarize, the main procedure of the method is presented in Algorithm 1.