Abstract

Many applications show that semisupervised community detection is one of the important topics and has attracted considerable attention in the study of complex network. In this paper, based on notion of voltage drops and discrete potential theory, a simple and fast semisupervised community detection algorithm is proposed. The label propagation through discrete potential transmission is accomplished by using voltage drops. The complexity of the proposal is for the sparse network with vertices and edges. The obtained voltage value of a vertex can be reflected clearly in the relationship between the vertex and community. The experimental results on four real networks and three benchmarks indicate that the proposed algorithm is effective and flexible. Furthermore, this algorithm is easily applied to graph-based machine learning methods.

1. Introduction

From the point of view of mathematics, many real-world systems in nature and society can be effectively modeled as complex networks or graphs. Specifically, the entities of the system are represented by the vertices and the interactions between the entities are represented by the edges. Examples include social relationships, spreading of viruses and diseases, the World Wide Web, author cooperation networks, citation networks, and biochemical networks. It has been shown that many real-world networks have a structure of modules or communities, where the nodes within a community are higher connected to each other than the nodes among communities. The community structures play an important role in the functional properties of complex network, and finding such a structure could be of significant practical importance.

Identifying community structure in special networks has a considerable merit of practice because it gives us insights to the structure-functionality relationship. In the past decades, plenty of techniques have been proposed to detect the community structure hidden in networks. The more typical algorithms for community detection can be found in [1]. Very recently, Chen et al. [2] defined the antimodularity as a quantitative measure of anticommunity partitioning on a network and showed the reliability of antimodularity as a measurement of the quality of an anticommunity partitioning. A vertices similarity probability model to find community structure without the prior knowledge of the type of complex network structure was presented [3]. By studying the community structure in Chinese character network, Zhang et al. [4] found that community structure was always considered as one of the most significant features in complex networks, and it played an important role in the topology and function of the networks. Palla et al. [5] revealed that complex network models exhibited an overlapping community structure, also called fuzzy community. These complicated structures actually make it harder to appropriately construct algorithms to uncover them. Along this way, researchers have made great contributions to the community detection [610].

The methods mentioned above belong to unsupervised community detection methods since the topological information of the network is used only and its background knowledge is ignored. In fact, some prior information is of great value in identifying the community structure. Based on the discussion of an equivalence of the objective functions of the symmetric nonnegative matrix factorization and the maximum optimization of modularity density, Ma et al. [11] introduced a semisupervised clustering algorithm for community structure detection. In [12], Silva and Zhao presented a technique for semisupervised classification tasks, by using the modularity measure of complex networks, originally designed for unsupervised learning tasks. Zhang [13, 14] developed a method that implicitly encoded the pairwise constraints by modifying the adjacency matrix of the network, which could also be regarded as the denoising process of the consensus matrix of the community structures. A novel semisupervised community detection algorithm was proposed based on the discrete potential theory [15]. It effectively incorporated individual labels, the labels of corresponding communities, to guide the community detection process for achieving better accuracy. Although these existing semisupervised community detection methods can improve the community identification accuracy, some of them have limitations in high time complexity. Therefore, it is worthwhile to introduce the novel algorithm to identify community structures in complex network rapidly.

The application of discrete potential theory to detect community in network can be traced back to Wu and Huberman’s work [16]. They presented a method that allowed for the discovery of communities within graphs of arbitrary size in times that scale linearly with their size. Their method was based on notions of voltage drops across networks that were both intuitive and easy to solve regardless of the complexity of the graph involved. Zhang et al. [17] applied it to directed networks; they presented a new mechanism for the local organization of directed networks and designed the corresponding link prediction algorithm. Wang and Zhang [18] came up with a semisupervised clustering method based on generalized point charge models for text data classification. Liu et al. [15] recently proposed a linear time algorithm to find the community in network based on discrete potential theory. As data sets get larger and larger, it is still necessary to develop the efficient semisupervised learning methods.

Motivated by Wu and Huberman’s work [16] and Liu’s method [15], in this paper, we present a simple and fast semisupervised algorithm for detecting the community in complex network by discrete potential theory and voltage drops. The complexity of the proposed algorithm is for the sparse network with vertices and edges. Similar to the membership degree in fuzzy -means algorithm, the voltage value of each vertex in network implies the relationship between this vertex and community. The main contributions of the proposal are as follows: (1) the proposed algorithm is a simple and fast semisupervised approach to discover community structures in complex network. (2) The proposal gets rid of the limitation of positive definite matrix which is needed to solve a linear system by conjugate gradient decent algorithm. To some extent, this approach remedies the deficiency of Liu et al.’s work [15]. (3) The unsupervised Wu-Huberman algorithm is extended to semisupervised learning case. The experimental results demonstrate the effectiveness of the proposed algorithm.

2. The Graph and Discrete Potential Method

The graph can be mathematically represented as , where is the set of vertices and denotes the set of edges. Generally, the graph can be expressed by its adjacent matrix , whose elements are equal to 1 if points to and 0 otherwise. We denote as the degree of vertex . The degree matrix is a diagonal matrix containing the vertex degree of a graph on the diagonal. Then the Laplacian matrix can be defined as

Denote as the potential of vertex in the electrostatic field generated by vertices with label . Assign the potentials of all labeled vertices with labels other than to zero and the -labeled vertices to have a unit potential. The process of potential transmission for each electrostatic field is a circuit theory problem and can be modeled by combinatorial Dirichlet [15]. By using the Laplacian matrix , a combinatorial formulation of the Dirichlet integral is in the form [15, 19]where is the potentials of all vertices minimizing (2). Reassigning the order of all vertices of the graph and putting the labeled vertices forward, (2) can be rewritten into where and are two vectors whose elements represent the potentials of labeled vertices and unlabeled vertices, respectively. Setting the derivative of with respect to equal to zero, one can obtain a system of linear equations where is a dimensional vector whose elements are unknown quantities needing to be solved. If the graph is connected, or if every connected component contains a seed, then (4) will be nonsingular.

For each label , a system of linear equations can be established as

If one assigns a unit potential to the labeled vertices with label and zero to other labeled vertices, it will generate an electrostatic field. The potentials of unlabeled vertices can be obtained by the solution of (5). By comparing the potentials of each unlabeled vertex, its label is assigned the same as the labeled vertex corresponding to the greatest potential. Thus the community structure is detected.

From the perspective of discrete potential theory, the solution to (5) can be interpreted as a circuit theory. Based on the three fundamental equations of circuit theory, Kirchhoff’s Current Law, Ohm’s Law, and Kirchhoff’s Voltage Law, one can also get an equivalent system of (5) [15, 19].

In [15], the solutions of (5) have been obtained by conjugate gradient decent algorithm, and a novel semisupervised community detection algorithm was proposed. Several experimental results demonstrate the effectiveness of their approach.

3. The Proposed Algorithm

It should be noted that the coefficient matrix in (5) must be a symmetric positive definite matrix while solving the nonhomogeneous linear equations (5) by conjugate gradient decent algorithm. Obviously, Laplacian matrix is not a positive definite matrix since every row of the Laplacian matrix sums to zero, is always its eigenvalue, and the corresponding eigenvector is . This fact compels us to develop a new method to detect communities in network while considering the network as an electric circuit. In [16], Wu and Huberman introduced an unsupervised method to solve the system like (5) to discover the communities in complex network in linear time. Since there is no class information in advance, they employed bipartite strategy and some superb skills for the case of multiple communities. In this work, we extend their work to the case of semisupervised community detection.

In what follows, we would like to present a novel method to find community structure in complex networks by the process of voltage transmission.

For a given network, we suppose each edge to be a resistor with the same resistance. One attaches all the labeled vertices with label to anode of a battery and other labeled vertices to negative pole so that they have fixed voltages, say 1 and 0. Based on these assumptions, the network can be viewed as an electric circuit with current flowing through each edge (resistor). By solving Kirchhoff equations, one can obtain the voltage value of each unlabeled vertex which of course should be within . In this case, the voltage value of each vertex can be thought of as the membership degree similar as in FCM algorithm, which reflects clearly the relationship between a vertex and the th community. In turn, we can get voltages of a vertex for the different labels if there are classes. In semisupervised learning methods, it is required that at least one sample must be labeled in each class. This indicates that the class parameter is known previously.

Physically, if node connects to neighbors in an electric circuit, the Kirchhoff equation [20] tells us that the total current flowing into should sum up to zero; that is, where is the current flowing from to and is the voltage at neighbor node .

It is easy to rewrite (6) into the following form: That is to say, the voltage of a node is the average of those voltages of its neighbors.

Suppose the number of communities to be ; then the label set . In addition, we also assume that there must be at least one labeled vertex in each community. Divide the vertex set into two parts, (labeled vertices), where is the label of vertex and (unlabeled vertices) such that . One also defines set with label . Denote as the voltage of vertex in the electrostatic field generated by vertices with label and as the set of neighbors of .

If we reassign the order of all vertices of the graph and put the labeled vertices forward and labeled vertices with label first, following (6), one can get the system:Equation (10) is a linear system with variables and can be put into a symmetrical form as follows:

Defineand then the matrix form of Kirchhoff equation is which has the solution

Generally, it will take time to solve this system. Wu-Huberman algorithm [16] skillfully avoids this difficulty by solving (8)–(10) for and . This method seems naturally to be a semisupervised learning method. We now extend it to the case of semisupervised learning.

Specifically, we first set , for , and , for .

Starting from (), one consecutively updates the voltages of to The updating process adopts breadth-first search algorithm and it will end when we get voltages for all vertices in . This process is called a round. One spends an amount of time calculating neighbor voltage of vertex and time setting initial voltages; therefore the complexity in one round is . After repeating the updating process for a finite number of rounds, one will reach an approximate solution of (14) within a certain precision which only depends upon the number of iteration rounds.

Unlike Wu and Huberman’s method [20], we do not need to compute the ideal voltage gap and know roughly the size of each community. As a result, we get a -dimensional voltage vector. The component reflects the relationship of vertex and th community. For each label in label set , we repeat this process. Therefore, for each vertex , we obtain a voltage vector . The element can be considered as the membership degree which vertex belongs to the th community. The vertex is within the th community if . That is to say, largest voltage of each vertex indicates to which community the vertex should belong.

4. Experiments

To validate the proposed algorithm, one would like to test it on four real networks and three benchmarks which are widely used to test the validity of various community division methods. The experimental platform is based on Windows 7 Ultimate Service Pack 1 with Intel® Core i5-3470 CPU 3.20 GHz, 4.00 GB memory, ×64 Operating system, and Java 1.8 Eclipse RCP Luna sr1.

4.1. Three Evaluation Indices of Clustering

To assess the quality of partition, we here use the -measure, -measure, and modularity to quantify the cluster results. The -measure is a harmonic combination of the precision and recall values used in information retrieval [21].

If is the number of the members of class , and is the number of the members of class in cluster , then the precision and recall can be defined as

is denoted by

The corresponding -measure () of the whole clustering result is defined as where is the total number of the members in the data set.

In general, the high value of -measure indicates the better cluster result.

The purity of a cluster represents the fraction of the cluster corresponding to the largest class of data assigned to that cluster; thus the purity of cluster is defined as

The purity of the whole clustering result is defined as In general, the larger the purity value is, the better the clustering result is.

In order to quantify the validity of community division of a complex network and to optimize the chosen splitting, we use, following [22], the concept of modularity. It is defined as follows: given a network division, Let be the fraction of edges in the network that connect vertices in group to those in group , and let . Then the modularity is defined as It measures the fraction of edges that fall between communities minus the expected value of the same quantity in a random graph with the same community division. Obviously, the larger corresponds to the ideal community structure.

4.2. Experiment on Four Real Networks

Testing an algorithm essentially means analyzing a network with a well-defined community structure and recovering its communities. In this subsection, four classical complex networks with known community structures are selected to test the introduced algorithm. The description of these four networks can be found everywhere [1, 11, 16, 23]. Taking Zachary Karate Club network with two communities, for example, we first choose randomly one node in each community and label it. Afterwards, the algorithm can work on this network and a community division is detected. The values of , , and can be computed according to the obtained partition. It is possible that the community division may be changed with the different selection of initial labeled notes. To evaluate the validity of the proposal objectively, we calculate the average values of three indices by choosing randomly 10 groups of initial labeled notes. Along this way, we also compute three indices values by adding the number of labeled notes in each community. In Table 1, we list the average values of three indices by selecting randomly 10 groups of different labeled notes and the label number (number of labeled nodes in each community) varies from 1 to 10.

From Table 1, it is easy to see that we can detect an ideal community division for these four networks by the proposed algorithm when we label 3 nodes in each community. The accuracy of network partition is greater than or equal to 94% except polbooks network. Three indices values are ascending or varying slightly with the increasing of labeled nodes. These results also show that one can detect a good network partition by labeling a small quantity of nodes in each community. For football network, we can get the same partition accuracy as in [15] while the number of the labeled vertices randomly selected is from 1 to 4.

In Table 2, the average run times of the proposed algorithm for four real networks are presented. It is shown that the run times decrease with the increase of labeled nodes. This is reasonable because the number of nodes that need to be divided is reduced.

Figures 1 and 2 show the variety of run time of the proposed algorithm and the values of three indices for dolphins network and karate network, respectively.

4.3. Experiment on Three Benchmarks

For testing community detection algorithms on graphs with overlapping communities, several artificial networks or benchmarks are introduced. Among them, the most famous benchmark for community detection is a class of networks introduced by Girvan and Newman (GN) [24]. Each network has 128 nodes, divided into four communities with 32 nodes. The average degree of the network is 16 and the nodes have approximately the same degree, as in a random graph.

In what follows, we apply the proposed algorithm to detect the communities on this benchmark. For each fixed number of labeled nodes, one also selects randomly 10 groups of different initial labeled nodes to compute the average values of three indices. The benchmark can be thought of as the network with apparent community structure if mixing parameter . From Table 3, one can see that four communities in this benchmark are detected accurately when mixing parameter and the number of labeled nodes is equal to or greater than 4. If we take and label 10 nodes in each community, 90% of nodes in this benchmark can be partitioned correctly. When , this benchmark is with overlapping community structures. Although the partition accuracy becomes higher and higher with the increasing of number of labeled nodes, we can not find ideal communities in this network. Particularly, our algorithm fails to divide it into four groups when and number of labeled nodes in each groups is less than 10.

Assuming that both the degree and the community size distributions are power laws, Lancichinetti et al. [25] designed a more general benchmark for testing community detection algorithms on graphs. Some parameters used in this benchmark are explained as follows:: number of nodes,: average degree,: maximum degree,: mixing parameter,: minus exponent for the degree sequence,: minus exponent for the community size distribution,: minimum for the community sizes,: maximum for the community sizes,: number of overlapping nodes,: number of memberships of the overlapping nodes,: [average clustering coefficient] not mandatory.In this benchmark, , , , and have to be specified. For the others, the program can use default values: ; ; ; ; and will be chosen close to the degree sequence extremes.

If we set parameters , , , , and , a kind of Girvan-Newman benchmark will be obtained.

To test the validity of our algorithm on large network, we apply the proposed algorithm to this benchmark with parameters , , , and . The mixing parameter is varied from to . For each fixed , one takes and , respectively. Unlike the GN benchmark, the community size is power laws in this network. Therefore, it is proper to label nodes in each community in terms of node proportion. The minimal proportion which we will take is 10% because of the requirement that there exists one labeled node at least in each community and the fact that there are small size communities in this network. Applying the proposed algorithm on this benchmark by labeling randomly of two groups of different initial nodes, one obtains some results reported in Table 4. There are nearly 90% of nodes which can be classified correctly in this network while and 10% nodes in each community are labeled. In this case, there is no distinct variety of three indices values with the increasing of label proportion. This fact indicates that one can detect a good community division on the network with apparent community structure although a few nodes are labeled. The values in each column are descending with the increasing of mixing parameter . This shows that a good network partition will not be found by the proposed algorithm for the network which communities overlap seriously.

Figure 3 presents the comparison of run time of our algorithm on two benchmarks with different parameters and label nodes numbers or label proportions. The increasing of labeled nodes number or label proportions implies that the number of unlabeled nodes in benchmarks is descending, and therefore it needs less and less time to partition network into groups.

We now present our experimental results on the LFR benchmark and further compare our proposal with GN algorithm [24], spectral clustering algorithm [1], NMF algorithm [20], and SNMF-SS algorithm [11] by a normalized mutual information index (NMI).

The LFR benchmark is designed by Lancichinetti et al. [25] and widely employed to test the performance of community structure identification. It allows user to specify distributions for both the community sizes and the degree distribution and then generates vertices and communities by sampling from those distributions. The mix parameter represents the average ratio of intracommunity adjacencies to total adjacencies. The large corresponds to the network with apparent community structure. In this paper, the input parameters of the LFR benchmark are the same for our algorithm and the comparative algorithms. For the different values of , we generated 50 instances for each of LFR benchmark graphs whose node degree is taken from a power law distribution with exponent 2 and community size from a power law distribution with exponent 1. Each graph has 1000 vertices, average degree of 15, maximum degree of 50, maximum for the community sizes of 50, and minimum for the community sizes of 5. The definition of NMI can be found everywhere [11, 15, 26].

From Figure 4, we can see that the values of NMI obtained by our algorithm are bigger than those gotten by the other four algorithms. The peak value of our approach is 0.732 at . This value is bigger than the one 0.7 computed by SNMF-SS algorithm. Because the decrease of means that the LFR benchmark is with the obscure community structure, it is difficult to detect communities correctly for five algorithms. It is reasonable that the NMI values obtained by five algorithms become smaller and smaller as decreases. The NMF algorithm seems to be stable since it has a small decrease speed. The performance of our proposal decreases greatly while is greater than 0.6. This fact implies that our algorithm can not apply the networks with nonapparent community structure. However, compared with other four algorithms, our algorithm can gain the best performance.

5. Conclusions

In this paper, we propose a semisupervised community detection algorithm for partitioning network into groups. This approach amalgamates the discrete potential theory and Wu-Huberman algorithm. The complexity of the introduced approach indicates that it can be applied to detect community on large network. The validity of our proposal is demonstrated by applying it to four real networks and three benchmarks. The experimental results show that a good community division of a complex network is obtained by labeling a small quantity of nodes in each community. However, it is difficult to classify correctly the network with heavily overlapping communities or obscure community structure by our method. This fact can be seen from the experimental result on LFR benchmark. Therefore, it is worthwhile to further introduce new and fast algorithm to deal with this case.

Competing Interests

The authors declare that there are no competing interests regarding the publication of this paper.

Acknowledgments

This work is supported by NSFC (under Grants nos. 61373127 and 41471140).