Mathematical Problems in Engineering

Volume 2016 (2016), Article ID 9850927, 10 pages

http://dx.doi.org/10.1155/2016/9850927

## Semisupervised Community Detection by Voltage Drops

^{1}School of Computer Science and Technology, Liaoning Normal University, Dalian, Liaoning 116081, China^{2}School of Urban and Environmental Science, Liaoning Normal University, Dalian, Liaoning 116029, China^{3}College of Business Administration, Dalian University of Finance and Economics, Dalian, Liaoning 116622, China

Received 3 November 2015; Revised 28 December 2015; Accepted 30 March 2016

Academic Editor: Pubudu N. Pathirana

Copyright © 2016 Min Ji et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Many applications show that semisupervised community detection is one of the important topics and has attracted considerable attention in the study of complex network. In this paper, based on notion of voltage drops and discrete potential theory, a simple and fast semisupervised community detection algorithm is proposed. The label propagation through discrete potential transmission is accomplished by using voltage drops. The complexity of the proposal is for the sparse network with vertices and edges. The obtained voltage value of a vertex can be reflected clearly in the relationship between the vertex and community. The experimental results on four real networks and three benchmarks indicate that the proposed algorithm is effective and flexible. Furthermore, this algorithm is easily applied to graph-based machine learning methods.

#### 1. Introduction

From the point of view of mathematics, many real-world systems in nature and society can be effectively modeled as complex networks or graphs. Specifically, the entities of the system are represented by the vertices and the interactions between the entities are represented by the edges. Examples include social relationships, spreading of viruses and diseases, the World Wide Web, author cooperation networks, citation networks, and biochemical networks. It has been shown that many real-world networks have a structure of modules or communities, where the nodes within a community are higher connected to each other than the nodes among communities. The community structures play an important role in the functional properties of complex network, and finding such a structure could be of significant practical importance.

Identifying community structure in special networks has a considerable merit of practice because it gives us insights to the structure-functionality relationship. In the past decades, plenty of techniques have been proposed to detect the community structure hidden in networks. The more typical algorithms for community detection can be found in [1]. Very recently, Chen et al. [2] defined the antimodularity as a quantitative measure of anticommunity partitioning on a network and showed the reliability of antimodularity as a measurement of the quality of an anticommunity partitioning. A vertices similarity probability model to find community structure without the prior knowledge of the type of complex network structure was presented [3]. By studying the community structure in Chinese character network, Zhang et al. [4] found that community structure was always considered as one of the most significant features in complex networks, and it played an important role in the topology and function of the networks. Palla et al. [5] revealed that complex network models exhibited an overlapping community structure, also called fuzzy community. These complicated structures actually make it harder to appropriately construct algorithms to uncover them. Along this way, researchers have made great contributions to the community detection [6–10].

The methods mentioned above belong to unsupervised community detection methods since the topological information of the network is used only and its background knowledge is ignored. In fact, some prior information is of great value in identifying the community structure. Based on the discussion of an equivalence of the objective functions of the symmetric nonnegative matrix factorization and the maximum optimization of modularity density, Ma et al. [11] introduced a semisupervised clustering algorithm for community structure detection. In [12], Silva and Zhao presented a technique for semisupervised classification tasks, by using the modularity measure of complex networks, originally designed for unsupervised learning tasks. Zhang [13, 14] developed a method that implicitly encoded the pairwise constraints by modifying the adjacency matrix of the network, which could also be regarded as the denoising process of the consensus matrix of the community structures. A novel semisupervised community detection algorithm was proposed based on the discrete potential theory [15]. It effectively incorporated individual labels, the labels of corresponding communities, to guide the community detection process for achieving better accuracy. Although these existing semisupervised community detection methods can improve the community identification accuracy, some of them have limitations in high time complexity. Therefore, it is worthwhile to introduce the novel algorithm to identify community structures in complex network rapidly.

The application of discrete potential theory to detect community in network can be traced back to Wu and Huberman’s work [16]. They presented a method that allowed for the discovery of communities within graphs of arbitrary size in times that scale linearly with their size. Their method was based on notions of voltage drops across networks that were both intuitive and easy to solve regardless of the complexity of the graph involved. Zhang et al. [17] applied it to directed networks; they presented a new mechanism for the local organization of directed networks and designed the corresponding link prediction algorithm. Wang and Zhang [18] came up with a semisupervised clustering method based on generalized point charge models for text data classification. Liu et al. [15] recently proposed a linear time algorithm to find the community in network based on discrete potential theory. As data sets get larger and larger, it is still necessary to develop the efficient semisupervised learning methods.

Motivated by Wu and Huberman’s work [16] and Liu’s method [15], in this paper, we present a simple and fast semisupervised algorithm for detecting the community in complex network by discrete potential theory and voltage drops. The complexity of the proposed algorithm is for the sparse network with vertices and edges. Similar to the membership degree in fuzzy -means algorithm, the voltage value of each vertex in network implies the relationship between this vertex and community. The main contributions of the proposal are as follows: (1) the proposed algorithm is a simple and fast semisupervised approach to discover community structures in complex network. (2) The proposal gets rid of the limitation of positive definite matrix which is needed to solve a linear system by conjugate gradient decent algorithm. To some extent, this approach remedies the deficiency of Liu et al.’s work [15]. (3) The unsupervised Wu-Huberman algorithm is extended to semisupervised learning case. The experimental results demonstrate the effectiveness of the proposed algorithm.

#### 2. The Graph and Discrete Potential Method

The graph can be mathematically represented as , where is the set of vertices and denotes the set of edges. Generally, the graph can be expressed by its adjacent matrix , whose elements are equal to 1 if points to and 0 otherwise. We denote as the degree of vertex . The degree matrix is a diagonal matrix containing the vertex degree of a graph on the diagonal. Then the Laplacian matrix can be defined as

Denote as the potential of vertex in the electrostatic field generated by vertices with label . Assign the potentials of all labeled vertices with labels other than to zero and the -labeled vertices to have a unit potential. The process of potential transmission for each electrostatic field is a circuit theory problem and can be modeled by combinatorial Dirichlet [15]. By using the Laplacian matrix , a combinatorial formulation of the Dirichlet integral is in the form [15, 19]where is the potentials of all vertices minimizing (2). Reassigning the order of all vertices of the graph and putting the labeled vertices forward, (2) can be rewritten into where and are two vectors whose elements represent the potentials of labeled vertices and unlabeled vertices, respectively. Setting the derivative of with respect to equal to zero, one can obtain a system of linear equations where is a dimensional vector whose elements are unknown quantities needing to be solved. If the graph is connected, or if every connected component contains a seed, then (4) will be nonsingular.

For each label , a system of linear equations can be established as

If one assigns a unit potential to the labeled vertices with label and zero to other labeled vertices, it will generate an electrostatic field. The potentials of unlabeled vertices can be obtained by the solution of (5). By comparing the potentials of each unlabeled vertex, its label is assigned the same as the labeled vertex corresponding to the greatest potential. Thus the community structure is detected.

From the perspective of discrete potential theory, the solution to (5) can be interpreted as a circuit theory. Based on the three fundamental equations of circuit theory, Kirchhoff’s Current Law, Ohm’s Law, and Kirchhoff’s Voltage Law, one can also get an equivalent system of (5) [15, 19].

In [15], the solutions of (5) have been obtained by conjugate gradient decent algorithm, and a novel semisupervised community detection algorithm was proposed. Several experimental results demonstrate the effectiveness of their approach.

#### 3. The Proposed Algorithm

It should be noted that the coefficient matrix in (5) must be a symmetric positive definite matrix while solving the nonhomogeneous linear equations (5) by conjugate gradient decent algorithm. Obviously, Laplacian matrix is not a positive definite matrix since every row of the Laplacian matrix sums to zero, is always its eigenvalue, and the corresponding eigenvector is . This fact compels us to develop a new method to detect communities in network while considering the network as an electric circuit. In [16], Wu and Huberman introduced an unsupervised method to solve the system like (5) to discover the communities in complex network in linear time. Since there is no class information in advance, they employed bipartite strategy and some superb skills for the case of multiple communities. In this work, we extend their work to the case of semisupervised community detection.

In what follows, we would like to present a novel method to find community structure in complex networks by the process of voltage transmission.

For a given network, we suppose each edge to be a resistor with the same resistance. One attaches all the labeled vertices with label to anode of a battery and other labeled vertices to negative pole so that they have fixed voltages, say 1 and 0. Based on these assumptions, the network can be viewed as an electric circuit with current flowing through each edge (resistor). By solving Kirchhoff equations, one can obtain the voltage value of each unlabeled vertex which of course should be within . In this case, the voltage value of each vertex can be thought of as the membership degree similar as in FCM algorithm, which reflects clearly the relationship between a vertex and the th community. In turn, we can get voltages of a vertex for the different labels if there are classes. In semisupervised learning methods, it is required that at least one sample must be labeled in each class. This indicates that the class parameter is known previously.

Physically, if node connects to neighbors in an electric circuit, the Kirchhoff equation [20] tells us that the total current flowing into should sum up to zero; that is, where is the current flowing from to and is the voltage at neighbor node .

It is easy to rewrite (6) into the following form: That is to say, the voltage of a node is the average of those voltages of its neighbors.

Suppose the number of communities to be ; then the label set . In addition, we also assume that there must be at least one labeled vertex in each community. Divide the vertex set into two parts, (labeled vertices), where is the label of vertex and (unlabeled vertices) such that . One also defines set with label . Denote as the voltage of vertex in the electrostatic field generated by vertices with label and as the set of neighbors of .

If we reassign the order of all vertices of the graph and put the labeled vertices forward and labeled vertices with label first, following (6), one can get the system:Equation (10) is a linear system with variables and can be put into a symmetrical form as follows:

Defineand then the matrix form of Kirchhoff equation is which has the solution

Generally, it will take time to solve this system. Wu-Huberman algorithm [16] skillfully avoids this difficulty by solving (8)–(10) for and . This method seems naturally to be a semisupervised learning method. We now extend it to the case of semisupervised learning.

Specifically, we first set , for , and , for .

Starting from (), one consecutively updates the voltages of to The updating process adopts breadth-first search algorithm and it will end when we get voltages for all vertices in . This process is called a round. One spends an amount of time calculating neighbor voltage of vertex and time setting initial voltages; therefore the complexity in one round is . After repeating the updating process for a finite number of rounds, one will reach an approximate solution of (14) within a certain precision which only depends upon the number of iteration rounds.

Unlike Wu and Huberman’s method [20], we do not need to compute the ideal voltage gap and know roughly the size of each community. As a result, we get a -dimensional voltage vector. The component reflects the relationship of vertex and th community. For each label in label set , we repeat this process. Therefore, for each vertex , we obtain a voltage vector . The element can be considered as the membership degree which vertex belongs to the th community. The vertex is within the th community if . That is to say, largest voltage of each vertex indicates to which community the vertex should belong.

#### 4. Experiments

To validate the proposed algorithm, one would like to test it on four real networks and three benchmarks which are widely used to test the validity of various community division methods. The experimental platform is based on Windows 7 Ultimate Service Pack 1 with Intel® Core*™* i5-3470 CPU 3.20 GHz, 4.00 GB memory, ×64 Operating system, and Java 1.8 Eclipse RCP Luna sr1.

##### 4.1. Three Evaluation Indices of Clustering

To assess the quality of partition, we here use the -measure, -measure, and modularity to quantify the cluster results. The -measure is a harmonic combination of the precision and recall values used in information retrieval [21].

If is the number of the members of class , and is the number of the members of class in cluster , then the precision and recall can be defined as

is denoted by

The corresponding -measure () of the whole clustering result is defined as where is the total number of the members in the data set.

In general, the high value of -measure indicates the better cluster result.

The purity of a cluster represents the fraction of the cluster corresponding to the largest class of data assigned to that cluster; thus the purity of cluster is defined as

The purity of the whole clustering result is defined as In general, the larger the purity value is, the better the clustering result is.

In order to quantify the validity of community division of a complex network and to optimize the chosen splitting, we use, following [22], the concept of modularity. It is defined as follows: given a network division, Let be the fraction of edges in the network that connect vertices in group to those in group , and let . Then the modularity is defined as It measures the fraction of edges that fall between communities minus the expected value of the same quantity in a random graph with the same community division. Obviously, the larger corresponds to the ideal community structure.

##### 4.2. Experiment on Four Real Networks

Testing an algorithm essentially means analyzing a network with a well-defined community structure and recovering its communities. In this subsection, four classical complex networks with known community structures are selected to test the introduced algorithm. The description of these four networks can be found everywhere [1, 11, 16, 23]. Taking Zachary Karate Club network with two communities, for example, we first choose randomly one node in each community and label it. Afterwards, the algorithm can work on this network and a community division is detected. The values of , , and can be computed according to the obtained partition. It is possible that the community division may be changed with the different selection of initial labeled notes. To evaluate the validity of the proposal objectively, we calculate the average values of three indices by choosing randomly 10 groups of initial labeled notes. Along this way, we also compute three indices values by adding the number of labeled notes in each community. In Table 1, we list the average values of three indices by selecting randomly 10 groups of different labeled notes and the label number (number of labeled nodes in each community) varies from 1 to 10.