Abstract

Community detection in social networks plays an important role in cluster analysis. Many traditional techniques for one-dimensional problems have been proven inadequate for high-dimensional or mixed type datasets due to the data sparseness and attribute redundancy. In this paper we propose a graph-based clustering method for multidimensional datasets. This novel method has two distinguished features: nonbinary hierarchical tree and the multi-membership clusters. The nonbinary hierarchical tree clearly highlights meaningful clusters, while the multimembership feature may provide more useful service strategies. Experimental results on the customer relationship management confirm the effectiveness of the new method.

1. Introduction

A social network is a set of people or groups each of which has connections of some kind to some or all of the others. Although the general concept of social networks seems simple, the underlying structure of a network implies a set of characteristics which are typical to all complex systems. Social network plays an extremely important role in many systems and processes and has been intensively studied over the past few years in order to understand both local phenomena, such as clique formation and their dynamics, and network-wide processes, for example, flow of data in computer networks [1], energy flow in food webs [2], customer relation management [36], and so forth. Modern information and communication technology has offered new interaction modes between individuals, like mobile phone communications and online interactions. Such new social exchanges can be accurately monitored for very large systems, including millions of individuals, representing a huge opportunity for the study of social science.

Clustering analysis is a data mining technique developed for the purpose of identifying groups of entities that are similar to each other with respect to certain similarity measures. Many different clustering methods have been proposed and used in a variety of fields. Jain [7] broadly divided these methods into two groups: hierarchical clustering and partitioned clustering. Hierarchical clustering is the grouping of objects of interest according to their similarity into a hierarchy, with different levels reflecting the degree of inter-object resemblance. The most well-known hierarchical methods are singlelink and completelink. In singlelink hierarchical methods, the two clusters whose two closest members have the smallest distance are merged in each step; in completelink cases, the two clusters whose merger has the smallest diameter are merged in each step. Compared to hierarchical clustering methods, partitioned clustering methods find all the clusters simultaneously as a partition of the data lK-means, which is widely used for the ease of implementation, simplicity, and efficiency where a certain data point cannot be simultaneously included in more than one cluster [8]. Based on the difference of their capabilities, applicability, and computational requirements, clustering methods can be categorized into several different approaches: partitioning, hierarchical, density-based, grid-based, and model-based. No particular clustering method has been shown to be superior to all its competitors in all aspects [9].

In recent years, community detection based on clustering has become a growing research field partly as a result of the increasing availability of a huge number of networks in the real world. The most intuitive and common definition of community structure is that such network seems to have communities in them: subsets of vertices within which vertex-vertex connections are dense, but between which connections are relatively sparse. Yang and Luo [10] show that community structure has close relationship with some functionality such as robustness and fast diffusion. It is an important network property and is able to reveal many hidden features of the given network [11]. The detection and analysis of communities in social networks have played an important role in the mining of different kinds of networks, including the World Wide Web [12, 13], communication networks [14], and biological networks [15].

Most traditional community detection algorithms based on clustering are limited to handling one-dimensional datasets [16, 17]. However, the datasets to be mined in real life often contain millions of objects described by many various types of attributes or variables. For example, in customer relation management, a customer can be depicted by multidimensional data or mixed type data such as gender, age, income, education level, and so forth. In such cases, data mining operations and methods are required to be scalable as well as capable of dealing datasets’ complex structures and dimensions. Previous researches were mainly focused on the representation of a set of items with a single attribute, which is apparently unsuitable for the scenarios described above: (i) a single attribute can not accurately represent all the dimensions of items; (ii) clustering according to a single attribute often fails to capture the inherent dependency among multiple attributes and leads to meaningless cluster.

Under such considerations, in this paper we firstly introduce two pretreatment methods for multi-dimensional and mixed type data, followed by a new clustering approach for community detection in social networks. In this approach, individuals and their relationships are denoted by weighted graphs, and the graph density we defined gives a better quantity depict of the overall correlation among individuals in a community, so that a reasonable clustering output can be presented. In particular, our method produces “trees” of simple hierarchy and allows for fuzzy (overlapping) clusters, which distinguishes it from other methods. In order to verify the utility/effectiveness of our method, we did a (preliminary) evaluation against a mobile customer segmentation use case. The numerical output of which shows supporting evidence for further (improvement) application.

The rest of the paper is organized as follows. In Section 2 we summarize the related works of community detections in social networks. In Section 3, we introduce the details of the novel clustering approach for multiattribute data sets. As an application in customer relationship management, this approach is used to analyze mobile customer segmentation problem in Section 4. Finally, a summary and conclusions are given in Section 5.

The detection for communities has brought about significant advances to the understanding of many real-world complex networks. Plenty of detection algorithms and techniques have been proposed drawing on methods and principles from many different areas, including physics, artificial intelligence, graph theory, and even electrical circuits [11]. The spectral bisection methods [18] and the Kernighan-Lin [19] algorithm are early solutions to this problem in computer society. The spectral approach bisects graph iteratively, which is unsuitable to general networks. For the Kernighan-Lin algorithm, it requires a priori knowledge about the sizes of the initial divisions. In 2002, Girvan and Newman [20] proposed a divisive hierarchical clustering algorithm referred to as GN, which can generate optimizion of the division of a network by iteratively cutting the edge with the greatest betweenness value. However, a disadvantage of GN is that its time complexity is on a network of nodes and edges or on a sparse network; then Newman [21] proposed a faster algorithm, referred to as NM, with time complexity or on a sparse network. A lot of works have been done to improve GN and NM; for example, Radicchi et al. [22] proposed a similar algorithm with GN by using the edge-clustering coefficient as a new metric with a smaller time complexity ; Clauset et al. [23] have also proposed a fast clustering algorithm with time complexity on sparse graph. Especially in 2007, Ou and Zhang [24] proposed a new clustering method with the feature of hierarchical tree and overlapping clusters, the complexity of this method is where denotes the height of the hierarchical structure. This method was, respectively, used to cluster extremist web pages [25] and some classic social networks [26] with single weighted edges.

Random walk has also been successfully used in finding network communities [27, 28]. The idea of this method is that the walk tends to be trapped in dense parts of a network corresponding to communities. Pons and Latapy [27] proposed a measure of similarity between vertices based on random walks which has several important advantages: it captures well the community structure in a network, it can be computed efficiently, and it can be used in an agglomerative algorithm to compute efficiently the community structure of a network. The algorithm called Walktrap runs in time and space in the worst case and in time and space in most real-world cases; Hu et al. [29] proposed a method for the identification of community structure based on a signaling process of complex networks. Each node is taken as the initial signal source to excite the whole network one time, and the source node is associated with an -dimensional vector which records the effects of the signaling process. By this process, the topological relationship of nodes on the network could be transferred into a geometrical structure of vectors in -dimensional Euclidean space. Then the best partition of groups is determined by F statistics, and the final community structure is given by the K-means clustering method.

Spectral clustering techniques have seen an explosive development and proliferation over the past few years [3032]. Previous work indicated that a robust approach to community detection is the maximization of the benefit function known as “modularity” over possible divisions of a network, but Newman and Girvan [30] showed that the maximization process can be written in terms of the modularity matrix, which plays a role in community detection similar to that played by the graph Laplacian in graph partitioning calculations, and the time complexity of this algorithm is . They also proposed an objective function for graph clustering called the function, which allows for automatic selection of the number of clusters, and then higher values of the function were proven to correlate well with good graph clustering. White and Smyth [31] showed how the function can be reformulated as a spectral relaxation problem and proposed two new spectral clustering algorithms that seek to maximize . Capocci et al. [32] developed some spectral-based algorithm to reveal the structure of a complex network, which could be blurred by the bias artificially overimposed by the iterative bisection constraint. Such a method should be able to conjugate the power of spectral analysis to the caution needed to reveal an underlying structure when there is no clear cut partitioning, as is often the case in real networks.

Lots of other community detection algorithms have also been proposed in the recent literatures. For example, Wu and Huberman [33] proposed a method which partitions a network into two communities, where the network is viewed as an electric circuit, and a battery is attached to two random nodes that are supposed to be within two communities. Shi et al. [11] proposed a new genetic algorithm for community detection, using the fundamental measure criterion modularity as the fitness function. A special locus-based adjacency encoding scheme is applied to represent the community partition; Shi et al. [34] proposed a novel method based on particle swarm optimization to detect community structures by optimizing network modularity.

3. Multidimensional and Multimembership Clustering Method for Social Networks

3.1. Similarity of Multidimensional Data

Traditional distance functions include Euclidean distance, Chebyshev distance, Manhattan distance, Mahalanobis distance, Weighted Minkowski distance, and Cosine distance. Among these distance functions, Mahalanobis distance is based on correlations between variables by which different patterns can be identified and analyzed. It gauges similarity of an unknown sample set to a known one. It differs from Euclidean distance in that it takes into account the correlations of the data set and is scale-invariant. In other words, it is a multivariate effect size.

All these distance functions have their own advantages and disadvantages in practical applications. Some research results shows that Euclidean distance has better performance in vector models, while some other numerical examples in high dimensional spaces show that the farthest and nearest distance are almost equal, although Euclidean distance is used to measure the similarity between data points. That is in high-dimensional data, traditional similarity measures as used in conventional clustering algorithms are usually not meaningful. This problem and related phenomena require adaptations of clustering approaches to the nature of high-dimensional data. This area of research has been a highly active one in recent years. Common approaches are known as, for example, subspace clustering, projected clustering, pattern-based clustering, or correlation clustering. Subspace clustering is the task of detecting all clusters in all subspaces, which means that a point might be a member of multiple clusters, each existing in a different subspace. Subspaces can either be axis parallel or affine. Projected clustering seeks to assign each point to a unique cluster, but clusters may exist in different subspaces. The general approach is to use a special distance function together with a regular clustering algorithm. Correlation clustering provides a method for clustering a set of objects into the optimum number of clusters without specifying that number in advance.

In 2011, A new function “Close()” is presented based on the improvement of traditional algorithm to compensate their inadequacy for high-dimensional space [35]. Let denote two points in -dimensional space. The function “Close()” is defined as

It depicts the similarity degree between two data points and has the following properties.(a)The minimum value of the function is 0, which means that the similarity degree between and is smallest since the difference comes closest to infinity in each dimension.(b)The maximum value of the function is 1, which means that the similarity degree between and is largest since they come closest to coinciding in each dimension.

Similar to the weighted operator in traditional distance functions, the close function can be corrected as where denotes the importance degree of data in the th dimension. Advantages of the new function are obvious in high-dimensional similarity measurement according to the comparison in [35]. Quantitative analysis also proved that this function can avoid the effects of noise and the curse of high-dimension.

3.2. Similarity of Mixed Type Data

For clustering multiattributes datasets, we first introduce a method for the measurement of similarity between items as follows [36]. The multiattribute datasets can be separated into two parts: the pure numeric datasets and pure categorical datasets. Some existing efficiency clustering methods designed for these two types of data sets are employed to produce corresponding clusters. For the similarity matrix, we define as the number of times the given sample pair and has co-occurred in a cluster [37].

Consider where denotes the number of the clustering. and denote the cluster label of items and , respectively. Then for the pure numerical datasets, the similarity can be defined as where is the number of clustering and is the number of times the pattern pair (, ) is assigned to the same cluster among the clustering. If (, ) is assigned to the same cluster, , otherwise .

For the pure categorical datasets, the similarity can be defined as where denotes the number of attributes. Then the similarity of multiattribute datasets can be denoted by where is a user-defined parameter. If , the categorical datasets is more important than the numerical datasets; if , numerical datasets is more important. and can also be used as two-dimensional (or multidimensional) datasets to represent the similarities between items and .

3.3. Multidimensional and Multimembership Clustering Method for Social Networks

A graph or network is one of the most commonly used models to represent real-valued relationships of a set of input items. Since many traditional techniques for one-dimensional problems have been proven inadequate for high-dimensional or mixed type datasets due to the data sparseness and attribute redundancy, the graph-based clustering method for single dimensional datasets proposed in [2426] can be extended as follows to directly cluster multidimensional datasets.

Let be a graph with the vertex set and associated with weights:

For a subgraph of , we define the th density of by

In single weighted graph , if and for every edge in , the subgraph induces a clique. For a multiweighted graph , a subgraph is called a -quasiclique if for some positive real number and for every ( is the number of weights on the edge).

Clustering is a process that detects all dense subgraphs in and constructs a hierarchically nested system to illustrate their inclusion relation.

A heuristic process is applied here for finding all quasicliques with density of various levels. The core of the algorithm is deciding whether or not to add a vertex to an already selected dense subgraph . For a vertex , we define the contribution of to by A vertex is added into if where is a user specified parameter.

In short, the main steps of our algorithm can be described as shown in Algorithm 1.

Input: A graph is a multi-weighted graph with : .
Output: Meaningful community sets in .
Algorithm: Detect -quasi-cliques in with various levels of , and construct a hierarchically nested system to illustrate their
inclusion relation.
While
 begin
 determine the value of
Decompose( )
( ) ,
 for each edge in in decreasing order of weights, if the two vertexes of edge are not in any community, create a new empty
 community Choose in the rest vertex sets that have maximum contribution to and add in it.
Merging
 Merge two communities according to their common vertexes;
 Contract each community to a vertex and redefine the weight of the corresponding edges.
 Store the resulted graph to .
 End.

Trace the process of each vertex, and obtain the hierarchic tree.

Our detailed community detection algorithm that can find -quasicliques in with various levels of is as follows. A hierarchically nested system is constructed to illustrate their inclusion relation.

Step 0. where is the indicator of the levels in the hierarchical system: where is a user specified parameter ( is a cut-off threshold).

Step 1 (the initial step). Let be the set of all edges of with Let . Sort the edges of the set as a sequence such that , , and where is the community sets in the th hierarchical level.

Step 2 (One has starting a new search).

Step 3 (growing)

Substep 3.1.; if , go to Step 4; otherwise continue.(*) Pick such that is a maximum.

If, for every , where and with , as user specified parameters, then , and go back to Substep 3.1.

If Inequality (15) is not satisfied, then If , repeat (*). If , go to Substep 3.2.

Substep 3.2.. If go to Step 4; otherwise continue.

Substep 3.3. Suppose . If at least one of , then go to Step 2; otherwise go to Substep 3.2.

Step 4 (merging).

Substep 4.1. List all members of as a sequence such that where . , .

Substep 4.2. If (where is a user specified parameter), then , and the sequence is rearranged as follows:

deleting , from .

, , and go to Substep 4.4.

Substep 4.3.. If , go to Substep 4.2.

Substep 4.4., . If , go to Substep 4.2.

Step 5. Contract each as a vertex: The vertex is obtained by contracting and is obtained by contracting where is the set of crossing edges which is defined as For define . Other cases are defined similarly.

If , then go to Step 6; otherwise go to End.

Step 6. One has where is a user specified parameter, and go to Step 1 (to start a new search in a higher level of the hierarchical system).

End.

Trace the movement of each vertex and generate the hierarchic tree.

If the input data is an unweighted graph , the adjacency information is used for establishing the similarity matrix of . Let be the adjacency matrix of where and the inner product of the th and the th row of is used to describe the similarity between nodes and and stored as in the similarity matrix .

4. Simulation Examples

In order to validate the feasibility of the proposed novel approach to cluster multi-dimensional data sets, we randomly took 3000 customers’ consumption lists of August 2012 from Shandong Mobile Corporation and use our new approach to divide these customers into distinguishing clusters according to 4 evaluation indices: local call fee, long distance call fee, roaming fee and text message and WAP fee. The original data of 3000 customers are listed in Table 1.

We have applied our approach to this problem, and the results of segmentation and their average consumption are listed in Table 2 and Figure 1.

As we can see from the clustering result, the long distance fee of group 1 has a high proportion of their total expenses; Groups 3 and 4 have high roaming fees; Group 8 has lower cost in each index; Groups 2, 3, and 4 have higher text message and WAP fees. Mobile corporations can initiate corresponding policies according to the clustering results. For example, for the customers in Groups 2, 3, and 4, mobile corporation should provide them with some discount text message package; for the customers in Groups 3, 4, and 6, some discount package of roaming will also help to increase customer loyalty and stability.

On the other hand, we noticed that the sum of the last column of Table 2 is larger than 3000. This is because our method allows multimembership clustering; thus some customers can belong to more than one group. For instance, Groups 8 and 1 are low value customer and high value customer respectively, and some special policies should be recommended for the 39 customers, who belong to either Group 1 or 8 to help them become loyal higher value customers.

5. Conclusions

In this paper, a graph-based new clustering method for multi-dimensional datasets is proposed. Due to the inherent sparsity of data points, most existing clustering algorithms do not work efficiently for multi-dimensional datasets, and it is not feasible to find interesting clusters in the original full space of all dimensions. These researches were mainly focused on the representation of a set of items with a single attribute, which cannot accurately represent all the attributes and capture the inherent dependency among multiple attributes. The new clustering method we proposed in this paper overcomes this problem by directly clustering items according to the multidimensional information. Since it does not need data preprocessing, this new method may significantly improves clustering efficiency. It also has two-distinguished features: nonbinary hierarchical tree and multimembership clusters. The application in customer relationship management has proved the efficiency and feasibility of the new clustering method.

Conflict of Interests

Peixin Zhao, Cun-Quan Zhang, Di Wan, and Xin Zhang certify that there is no actual or potential conflict of interests in relation to this paper.

Acknowledgments

The first author is partially supported by the China Postdoctoral Science Foundation funded Project (2011M501149), the Humanity and Social Science Foundation of Ministry of Education of China (12YJCZH303), the Special Fund Project for Postdoctoral Innovation of Shandong Province (201103061), the Informationization Research Project of Shandong Province (2013EI153), and Independent Innovation Foundation of Shandong University, IIFSDU (IFW12109). The second is author partially supported by an NSA Grant H98230-12-1-0233 and an NSF Grant DMS-1264800.