Abstract

Community detection in dynamic networks is an important research topic and has received an enormous amount of attention in recent years. Modularity is selected as a measure to quantify the quality of the community partition in previous detection methods. But, the modularity has been exposed to resolution limits. In this paper, we propose a novel multiobjective evolutionary algorithm for dynamic networks community detection based on the framework of nondominated sorting genetic algorithm. Modularity density which can address the limitations of modularity function is adopted to measure the snapshot cost, and normalized mutual information is selected to measure temporal cost, respectively. The characteristics knowledge of the problem is used in designing the genetic operators. Furthermore, a local search operator was designed, which can improve the effectiveness and efficiency of community detection. Experimental studies based on synthetic datasets show that the proposed algorithm can obtain better performance than the compared algorithms.

1. Introduction

Many real-world complex systems take the form of networks. Acquaintance networks, Internet, power grids, and neural networks are some examples. Networks could be modeled as graphs, where the individual objects are represented by nodes and the interactions among these objects are represented by edges. Community structure, that is, the vertices in networks are often found to cluster into tightly knit groups with a high density of within-group edges and a lower density of between-group edges [1], is one important property of these networks. The detection of such a community structure has great practical meaning.

Traditional analysis of community detection treats the network as a static graph, where the static graph is either derived from aggregation of data over all time or taken as a snapshot of data at a particular time. Researchers have successively proposed many effective detection methods of static network. In fact, dynamic networks capture the modifications of interconnections over time, which allow tracing the changes of network structure at different time steps. Community detection in dynamic networks is attracting increasing interest.

Recently, a framework called temporal smoothness is applied to solve community detection in dynamic networks. In this framework, it is not desirable that the significant changes of clusters structure in a short time period [2]. In order to smooth each community over time, it needs to trade off two competing objectives of snapshot quality, which is that the clustering should reflect as accurately as possible the data coming during the current time step, and temporal cost, which is that each clustering should not shift dramatically from one time step to the successive one. Using this idea, Folino and Pizzuti proposed a multiobjective approach named DYN-MOGA to discover communities in dynamic networks by employing genetic algorithms [3]. In DYN-MOGA, Community Score (CS) and Normalized Mutual Information (NMI) were selected as two objectives to be optimized simultaneously. At the end of each timestamp, DYN-MOGA returns a set of solutions contained in the Pareto front. They adopted modularity as a criterion to automatically select one solution with respect to another. Following this work, Gong et al. introduced a novel multiobjective immune algorithm with local search to solve the community detection problem in dynamic networks [4]. They adopted Modularity and NMI as two objectives to optimize.

But, the modularity has been exposed to resolution limits [57]. Fortunato and Barthélemy [5] have recently found that modularity optimization may fail to identify modules smaller than a scale which depends on the total size of the network and on the degree of interconnectedness of the modules, even in cases where modules are unambiguously defined. To overcome the limitations of modularity function, a new measure, named by modularity density, was presented in [8].

In this paper, we present a novel multiobjective algorithm, named DNCD-MOEA, to detect community in dynamic networks. The algorithm adopts modularity density as one objective to measure how well the clustering found represents the data at the current time, and it adopts NMI as another objective to measure the distance between two clusterings at consecutive time steps. Based on the problem-specific knowledge, the new genetic operators are designed to solve the proposed model. In order to improve the quality of the solutions, a local search operator is designed.

The outline of the paper is as follows. Section 2 introduces related background. In Section 3 we present our algorithm and explain its steps. Experimental studies are presented in Section 4. Section 5 concludes this paper.

2. Background

2.1. Notation

We define a static network at time as , where is a set of objects, each denotes a node, and is a set of links, each represents an edge that connects two nodes and at time . The dynamic network can be defined as a sequence of static networks ; that is, , where each represents the snapshot of nodes and edges at time . Let be the set of all communities in network at time where denotes the th community at time .

2.2. Evolutionary Clustering

Chakrabarti et al. first introduced evolutionary clustering in [2] as the problem of clustering data coming at different time steps to produce a sequence of clustering. We adopt this framework to analyze communities and their evolutions by use of the community structure at time to regularize the community structure at time . At each time step, two conflicting criteria must be simultaneously optimized to produce a new clustering. The first criterion is that the clustering at any point in time should remain faithful to the current data as much as possible. The second is that the clustering should not shift dramatically from one time step to the next. A framework called temporal smoothness is defined in order to satisfy the second property to smooth out each community over time. This framework assumes that the dramatic change of clustering in a short time is not desirable; thus the cost function consisting of two parts is defined as follows: where denotes snapshot cost which measures how well a community structure represents the data at time and denotes temporal cost which measures how similar the community structure is with the previous community structure . The input parameter is used by the user to control the level of emphasis on each part of the two objectives. When , it returns the clustering without temporal smoothing. When , however, the framework produces the same clustering structure with the previous time step, that is, . Thus it can control the preference degree of each subcost by changing the value of parameter between 0 and 1.

3. Proposed Algorithm

As [3], we adopt the same multiobjective representative frameworks for dynamic community detection which treat snapshot cost and temporal cost as two competing objectives. Modularity density and NMI are employed to denote snapshot cost and temporal cost , respectively. The main advantage of this method is that it does not need to fix the control parameter .

The proposed algorithm DNCD-MOEA is realized under the framework of NSGA-II [9]. The details of objective functions and representation method are given as follows.

3.1. Objective Functions

To denote snapshot cost , we adopt modularity density as an objective function to measure the quality of the community. The modularity density is defined as follows:

In detail, given an undirected network consisting of the vertex set ( is the cardinality of ) and the edge set . Where is a partition of the vertex set into groups, is the complement of with respect to , ( is the adjacent matrix of ), and is the cardinality of .

Normalized Mutual Information (NMI) was employed to denote the second objective function, that is, temporal cost. Danon et al. have proved NMI to be a reliable similarity measure [10].

Let and denote two partitions of a network in communities, and denotes the confusion matrix whose element is the number of nodes of the community that are also in the community . The is defined as follows: where is the sum of the elements of in row (column ) and is the number of nodes. If , set . If and are completely different, set . In this paper, the two objectives of and will be maximized simultaneously.

3.2. Solution Selection

In fact, the proposed algorithm DNCD-MOEA returns Pareto front at the end of each timestamp, which contains a set of solutions. Each of these solutions corresponds to a different tradeoff between the two objectives and thus to a diverse partitioning of the network consisting of various number of clusters. It needs to establish a criterion to automatically select which solution denotes the optimal partitioning of the current network at each time step. Community score introduced in [11] has been proved to be very effective in detecting communities. In this paper, we adopt community score as a criterion to select, among the solutions found, the solution having the highest value of community score. It is defined as follows.

Let be the set of all communities in network at time where denotes the th community at time . The community score of is defined as where The parameter denotes the fraction of edges which connect each node of to the nodes in the same community . The community score takes into account both the fraction of interconnections among the nodes and the number of interconnections contained in the module . It gives a global measure of the network division in communities by summing up the local score of each module found. Thus the larger community score indicates that the community structure is stronger. So, we select the maximum community score value in the set of solutions as the best solution.

3.3. Genetic Representation

The locus-based adjacency representation proposed by Park and Song [12] is used in our community detection algorithm, similar to [11].

As a graph has vertices, an arbitrary individual of the population consists of genes, which can be represented by a number of strings as follows: where denotes that there exists a link between the nodes and . This means that thenodes and will be in the same community. It is necessary to identify all of the components of the corresponding graph in the decoding step. The nodes participating to the same component are assigned to the same community. Adopting this representation, the main advantage is that the decoding step can be done in linear time and it has been verified as an effective encoding schema for community detection as shown in [3]. Additionally, the number of clusters is automatically determined by the number of components contained in an individual and is also determined by the decoding step [11].

An example of the encoded genotype and corresponding network is shown in Figure 1. As shown in Figure 1(a), the supposed network consists of seven nodes numbered from 1 to 7. It is obvious that the network can be partitioned into two groups visualized by different shape. A possible genotype as shown in Figure 1(b), which corresponds to the optimal solution, is translated in the graph structure given in Figure 1(c). Each connected component provides a grouping of nodes that corresponds to the partitioning of the network in Figure 1(a).

3.4. Initialization

If an individual is created randomly, it may not be a feasible solution. In fact, a randomly generated individual could contain an allele value in the th position, but no connection exists between the two nodes and ; that is, the edge is not present. In order to overcome this limitation, we should check whether the individual is safe after an individual is created. When the individual is not safe, that is, a gene contains a value , but link does not exist, it needs to repair to ensure whether the individual is safe or not. Safe individuals improve the convergence of the method because the space of the possible solutions is restricted.

3.5. Crossover and Mutation

As in [3], we use uniform crossover which can guarantee the maintenance of the effective connections of the nodes in the network in the child individual. Given two arbitrary safe parent individuals and that a random binary vector is created, the genes are selected from the first parent if the vector element is 1, and the genes are selected from the second parent if the vector element is 0. Then the genes are combined to form a child. Because of the biased initialization, the child created from the two safe parents is safe also. In the child, if a position contains a value , then the edge exists. The uniform crossover is shown in Figure 2.

Crossover operator is regarded as a macroscopic operation on individuals, while the mutation operator is regarded as a microcosmic operation on individuals. The mutation operator that randomly changes the value of the th gene causes a useless exploration of the search space. In order to guarantee the mutated child is safe as the crossover operation, the possible value of an allele after mutating is restricted to be one of the replaced gene’s neighbors.

3.6. The Pseudocode of the Proposed Algorithm

The pseudocode of our DNCD-MOEA algorithm is described in Algorithm 1.

Program  DNCD-MOEA
Input:  The number of time steps, the sequence of
dynamic network
Output:  The sequence of community structure detected in
the dynamic network
Begin
       Step : Set . Generate the initial community structure
         of the network using
        GA-Net algorithm. Set .
         Step : If , return the sequence of community
        structure as the output,
        algorithm stops; Else, go to Step .
         Step : Set . Randomly generate individuals whose
        length equals the nodes number of
        network as an initial population ;
         Step : While termination condition is not satisfied do
      Step : Create a new population of offspring by
            applying the variation operators on
            population ;
      Step : Combine the parents and offspring into a
            new pool and;
      Step : Decode each individual of the population
            to generate the partitioning
             of the network in
             connected components;
      Step : Evaluate the two fitness values of the translated
            individuals;
      Step : Partition into fronts, assign a rank to each
            individual and sort them according to
            nondomination rank;
      Step : Select individuals based on rank and crowding
            length to comprise new population ;
      Step : Select the dominant individuals in ,
      Step : Perform the local search algorithm on the
            selected individuals in to generate the new
            dominant population . Update the dominant
            population with in .
      Step :
          End while
         Step : Select the individual which has the maximum
        Community Score on the Pareto front. Decode the
        selected individual to get the community structure
         of the network .
         Step : Set , go to Step .
End

3.7. Local Search Strategy

Local search is proved to be an effective algorithm. The mutation operator is regarded as a microcosmic operation on individuals and can achieve its local search function by moving single nodes between communities. Inspired by this idea, we adopt mutation operator in our local search algorithm.

In local search algorithm, it needs to convert multiple objectives into a single objective function. In our study, we select a weighted objective as follows [3]: where is the objective function which is described in (2) and (3), respectively, and is the nonnegative weights for the two objectives. The weights are calculated in a special way as follows: where or is the maximum or minimum value of each objective function in the obtained dominant population and . In Algorithm 2, the detailed pseudocode of the local search algorithm is given.

Program: Local Search Algorithm:
Input: Given a dominant population before local search
at the generation, the size of dominant population ,
the number of local search.
Output: the improved population in the generation.
Begin
   for to
       ;
      while and not
     Create a new individual applying the mutation
       operator on the individual of dominant
       population ;
     Calculate the objective function value of the new
       individual according to formula (7). If its value is
       greater than that before local search, add the new
       individual to , ;
      ;
      end while
      if not , adds the current individual to
   end for
END

3.8. Computational Complexity

The computational complexity of DNCD-MOEA algorithm is investigated in this subsection. The DNCD-MOEA algorithm contains two major components: the main program based on the NSGA-II framework and the subprogram of local search operation. The time consuming of the main program mainly consists of three parts: generating nondominated sorting, calculating crowding distance assignment, and constructing partially ordered set. Let denote the size of population and denote the number of objects. The computational complexity of the three parts is , , and , respectively. So the total time complexity of the main programming is . The local search operation mainly contains two loops and its computational complexity is . So, the total computational complexity of DNCD-MOEA algorithm is . Because , , and in practice , the time complexity can be simplified to .

4. Experiments

In order to check the ability of our approach on a dynamic network, we adopt the method proposed as [3] to generate data simulating dynamic network. Firstly, we generate synthetic datasets by following the procedure suggested by Girvan and Newman [1]. The datasets are generated using the software package designed by Lancichinetti et al. [13]. The data have 128 nodes, which are divided into 4 communities of 32 vertices each. Every node has an average degree of 16 and shares a number , which represents the average number of edges from a node to nodes in other communities. If we increase , then the noise level of network is augmented. On the other hand, if we decrease , then the noise level in the network decreases.

A parameter , which represents the average ratio of external degree/total degree for each node, is used to control the noise level in the dynamic networks. If the value of is increased, then the network will become more noisy in the sense that the community structure becomes less obvious and hard to detect. In this study, by setting and , the datasets under two different noise levels are generated. In order to introduce dynamics into the network, we let the community structure of the network evolve in the following way. After time step 1, 5% of the nodes is randomly choused to leave their original community and randomly assigned to the other three communities at each time step. After the community memberships are decided, links are generated by following the parameter . We generate the network with community evolution in this way for 10 time steps.

Figure 3 shows the statistical average value of normalized mutual information with the ground truth over the 10 networks for the 10 timestamps when the value of (Figure 3(a)) and (Figure 3(b)). Both figures show that the proposed algorithm DNCD-MOEA can achieve better accuracy than the compared method. Especially, the average values of NMI at each time step obtained by DNCD-MOEA are closed to 1 when .

Figure 4 reports the community score obtained by the two algorithms at each time step when (Figure 4(a)) and (Figure 4(b)). It indicates that the corresponding network is densely connected within each subnetwork when the obtained value of community score is larger. From Figure 4, it can be found that the algorithm DNCD-MOEA outperforms the algorithm DYN-MOGA. That is to say, the community structure obtained by our algorithm is more closed to the true community structure.

5. Conclusions

In this paper, a novel multiobjective algorithm for detecting communities in dynamic networks is proposed. Based on optimizing modularity density and NMI, the algorithm automatically provides a solution representing the best tradeoff between the accuracy of the community structure obtained with respect to the data of the current time step and the deviation from one time step to the successive. To solve the proposed model, the new genetic operators are designed based on the problem-specific knowledge. A local search operator is designed in order to improve the quality of the solutions. Experimental results on synthetic datasets show that the proposed algorithm can obtain better performance than the compared algorithm. Future research will aim at improving efficiency and applying proposed algorithm to process real-life network.

Acknowledgment

This work was supported by the National Natural Science Foundation of China under Grants no. 61272119 and no. 61203372.