Mathematical Problems in Engineering

Volume 2014, Article ID 502809, 15 pages

http://dx.doi.org/10.1155/2014/502809

## Performance Evaluation of Modularity Based Community Detection Algorithms in Large Scale Networks

^{1}Department of Computer Science, Federal University of São João del Rei (UFSJ), 36301-360 São João del Rei, MG, Brazil^{2}COPPE, Federal University of Rio de Janeiro (UFRJ), P.O. Box 68506, 21941-972 Rio de Janeiro, RJ, Brazil

Received 28 August 2014; Accepted 27 November 2014; Published 28 December 2014

Academic Editor: Mohamed A. Seddeek

Copyright © 2014 Vinícius da Fonseca Vieira et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Community structure detection is one of the major research areas of network science and it is particularly useful for large real networks applications. This work presents a deep study of the most discussed algorithms for community detection based on modularity measure: Newman’s spectral method using a fine-tuning stage and the method of Clauset, Newman, and Moore (CNM) with its variants. The computational complexity of the algorithms is analysed for the development of a high performance code to accelerate the execution of these algorithms without compromising the quality of the results, according to the modularity measure. The implemented code allows the generation of partitions with modularity values consistent with the literature and it overcomes 1 million nodes with Newman’s spectral method. The code was applied to a wide range of real networks and the performances of the algorithms are evaluated.

#### 1. Introduction

Community detection is of great interest in the field of complex networks and its study has been subject of many works [1–6]. A consensual notion about the characterization of a community in a network is a subset of nodes with great internal density and low external density.

Several works can be found in the literature analyzing and comparing different measures for quality of partitions. For instance, in the work of Yang and Leskovec [7], the authors investigate the suitability of several measures to characterize ground-truth based communities. The work of Moradi et al. [8] compares different quality functions regarding their ability to classify useful and spam messages in an email network.

Currently, modularity, proposed by Newman and Girvan [9], is the most widely adopted measure for the assessment of the quality of communities in networks. To a particular community, modularity can be understood, in a general way, as the difference between the fraction of edges inside the community and the fraction of edges expected by a random version of the network, preserving the degree distribution of the nodes.

In one of the first works with the purpose of investigating community structures in networks, Girvan and Newman [10] propose a method based on edge centrality [11], able to handle small-scale networks (up to nodes). Later, Newman proposes a modularity based heuristic method, able to handle networks on a larger scale (up to hundreds of thousands of nodes) [12]. In order to adapt the heuristic method to large scale networks, Clauset et al. define, thus, a methodology which allows it to be executed more efficiently [1].

In a different direction, some works use approximate optimization methods for community detection in complex networks based on the modularity measure [13, 14], such as Newman’s work [15], which proposes a relaxed optimization method based on spectral graph theory [16].

Recently, there has been great discussion about some negative aspects in the use of modularity as a measure of the quality of the division of a network in communities. The work of Fortunato and Barthélemy is worth mentioning [17], which verifies that the modularity can fail in the identification of intuitive communities (for instance, cliques of nodes). This problem is broadly addressed in the literature [4, 18–20] and frequently related as the resolution limit problem. Another aspect which is important to point out in the use of modularity is the fact that some communities can show high modularity values, even with just few variations from random connections, as observed by Guimerà et al. [21] and also discussed by Kehagias [22]. Furthermore, one must consider the fact that very distinct partitions can lead to similar modularity values, as discussed by Good et al. [18], which is a drawback in the use of modularity.

The discussion concerning the use of the modularity as a measure of the quality of partitions, introduced by the aspects previously mentioned, is of primary importance in the community detection area. Nevertheless, several works as [4, 18, 20] discuss that modularity is still a very appropriate measure for the assessment of community structures in networks and emphasize the importance of the investigation on methods for modularity maximization.

Currently, the growing possibility of storing and processing data in high performance computing environments raises the demand for the analysis of increasingly larger networks, with million (or even billion) nodes. A great challenge set in the complex networks area is the identification of communities in large scale networks.

There is a great demand for computational methods that are capable of detecting community structure in large scale networks and this is currently one of the most important problems in the area of complex networks. Several works can be found in the literature with the purpose of proposing and studying such methods and, among them, [1, 2, 4, 9, 15, 23–27] can be cited.

The method of Clauset, Newman, and Moore (CNM) can be considered one of the most important methods for community detection in networks and, currently, it is one of the most studied methods with this purpose. Some modifications can be found in the literature in order to accelerate its execution and make it possible to investigate larger networks, including the works of Wakita and Tsurumi [28], Leon-Suematsu and Yuta [4], and Danon et al. [24]. Another important heuristic method for community detection in large scale networks is reported in the work of Blondel et al. [29], which uses an agglomerative multistep process during its execution. Currently, there has been a great interest in nonparametric methods, which aim at adjusting networks to statistic models, according to its structural properties.

This work aims at the investigation of the computational issues of methods for community detection which enable them to deal with large scale networks. Two of the most adopted modularity based methods for community detection are addressed: the spectral method of Newman [15] combined with a variation of the Kernighan-Lin method [30], which is called fine-tuning, and the method of Clauset, Newman, and Moore (CNM) [1]. Some variations on the fine-tuning stage are also proposed in order to accelerate its execution without harming the quality of the result obtained.

The computational implementation of the studied methods is discussed in respect of the computational complexity of the algorithms. The implemented algorithms are used to qualitative and quantitative comparative study of the spectral method of Newman and the CNM method, adjusting their application to large scale networks. All of the developed code is freely available for download on the web, in Github repository (http://www.github.com/vfvieira/).

The remainder of the work is organized as follows. Section 2 presents the problem of community detection and the methods addressed in this work. In Section 3, the main computational issues concerning the implementation of the methods are presented. The experiments performed, as well as the obtained results and discussion, are presented in Section 4. Section 5 presents some conclusions and future works.

#### 2. Community Detection in Networks

##### 2.1. Problem Statement

A community structure in a network can be identified when there is a division of the network in groups with high density of internal connections and, at the same time, low density of external connections. The community sense becomes more evident as the difference between the intragroup and intergroup increases. Thus, it is a central concern to quantify the quality of a particular division of the network in communities.

Consider a graph where represents the set of nodes and represents the set of edges. In this work, the edges of the graph are unweighted and undirected. Thus, the graph is represented by an adjacency matrix , where an element , if a node is connected to a node and , otherwise.

A community structure is defined as a partition of in communities: where each community is a subset of nodes of such as

Equations (2) determine that a community structure defines a partition of the set of nodes, such that there is no overlap between the communities. Alternatively, several approaches that consider the overlapping of communities can be found in the literature [7, 31].

The quality of a community structure can be assessed by modularity, a measure proposed by Newman and Girvan [6], which considers the difference between the fraction of edges in a community and the fraction of edges expected by a network with the same degree distribution, but randomly placed.

Consider as the number of edges in the network and as the degree vector, where is the degree of a node . The number of connections expected between all pairs of nodes inside the same community is , where denotes that the node belongs to the community and is the Kronecker delta, which returns if the operands are equal and otherwise. The factor is used to avoid double counting of edges.

Modularity of a community structure can be defined as

From the definition of modularity, it can be said that, for a particular network, a community structure that corresponds to the maximum value of modularity is the best partition of the set of nodes. Based on this principle, one has an important motivation for the modularity maximization for solving the community detection problem in complex networks. Modularity optimization in networks has been subject of several works in the literature, including [4, 15, 24, 27–29, 32, 33].

This work focuses on two of them: the spectral method of Newman and the heuristic method of Clauset, Newman, and Moore. The next sections are dedicated to such methods.

##### 2.2. Spectral Optimization of Modularity

The spectral approach was applied by Newman and Girvan to the community detection problem [6] and, to this end, they define a modularity matrix , in which each element can be defined as

Considering the division of the network in just two generic communities , the communities can be represented such that each node belongs to a vector , and if and if , redefining modularity (3) in terms of as

Relaxing the vector in a vector which allows any real number, the solution of the modularity maximization problem can be obtained by solving the eigenproblem where is the largest eigenvalue of and is its corresponding eigenvector. For the sake of simplicity, and will be treated, respectively, as and for the remainder of the work.

The solution of (6) maximizes the approximation . From that, the community structure is defined by the eigenvector corresponding to the largest eigenvalue of , according to the signal of : the nodes corresponding to the positive elements of are assigned to a group and the nodes corresponding to the negative elements of are assigned to the other group, which can be better described as

This method is known in the literature as Newman’s bisection method, which aims at dividing a network into two communities (generically defined as and ), and can be summarized by the following steps: calculate the eigenvector corresponding to the largest eigenvalue of the modularity matrix; assign the nodes to the communities according to the sign of the elements (positive elements are assigned to a community and negative elements are assigned to the other community ).

In order to generalize the method for the division of the network in several communities, the maximization of modularity can be performed in a successive bisection process. Thus, the method evaluates if there is a gain in the modularity obtained from the division of the network (or a community), and if convenient, a division of the nodes into two subsets is done. In a recursive process, the method evaluates if it is convenient to divide each of the two subsets, and if the division increases the modularity, the operation is performed. The process stops when there is no division in which the modularity will be increased.

However, the strategy of simply removing the vertices which connect two communities and applying the method to each community leads to an essential mistake in definition of the modularity. As defined by (3), modularity to be maximized must consider the whole network.

In this sense, Newman defines a community modularity matrix , which concerns only a particular community, in this case, and can be defined as Then, Newman defines a measure which evaluates the modularity variation caused by the division of a generic community and can be written as

This definition allows the method to be applied to any generic community, since the sum of the rows of is still zero and is also zero when the community remains undivided.

In summary, generalized Newman’s method works as follows: the communities are repeatedly divided according to the signs of the leading eigenvector of its corresponding modularity matrix; when a division does not result in a positive change in for a community, it must remain undivided; when there is no community in which the division increases , the process is finished. Algorithm 1 shows an algorithm for generalized Newman’s method.