Abstract

The semantic social network is a kind of network that contains enormous nodes and complex semantic information, and the traditional community detection algorithms could not give the ideal cogent communities instead. To solve the issue of detecting semantic social network, we present a clustering community detection algorithm based on the PSO-LDA model. As the semantic model is LDA model, we use the Gibbs sampling method that can make quantitative parameters map from semantic information to semantic space. Then, we present a PSO strategy with the semantic relation to solve the overlapping community detection. Finally, we establish semantic modularity (SimQ) for evaluating the detected semantic communities. The validity and feasibility of the PSO-LDA model and the semantic modularity are verified by experimental analysis.

1. Introduction

With the development of society and the improvement of science and technology, semantic social networks are rapidly developed and many semantic networks, like Twitter and Weibo, have made an insignificant impact in our life so far. In these networks, different individuals have different small social “worlds” which are called communities [1]. Thus, researchers focus attention on community detection not only to divide networks into modules but also to make a deep insight into understanding interesting properties within the semantic social network. In practical application, semantic communities have a great promotion on intelligent information retrieval, marketing management, individual service, and other information management domains [2]. Heretofore, the research on community detection mainly reflects on the following three categories: topological community detection [3], community detection on overlapping construction [4], and semantic community detection.

The topological community detection represents the pioneer work, the goal of which is studying the topological constructions and dividing the social networks into several separate networks. The representative algorithms contain Modular Optimization [5], GN [6], and FN [7]. Then, researchers gradually focus on overlapping communities which can be more real than previous research networks. Therefore, CPM [8] was proposed to detect the overlapping communities. Soon afterwards, community detection on overlapping construction received more attention in social networks and many representative algorithms were proposed, including LFM [9], EAGLE [10], COPRA [11], DEMON [12], and so forth. Neuman and Yair [13] proposed an agglomerative spectral clustering method with conductance and edge weights. In their method, the most similar nodes are agglomerated based on eigenvector space and edge weights. But this method only is suitable for the nonsemantic social networks. Then, with the big interest in semantic network, semantic community detection came into researchers’ eyes. Yang and McAuley [14] proposed the CESNA model to develop communities by using edge structure and node attributes. This method leads to more accurate community detection as well as improved robustness in the presence of noise in the network structure. But when this method applies into semantic network, it performs instable. Reihanian and Ali [15] proposed a generic framework for overlapping community detection in social networks with special focus on rating-based social networks. This framework considers the information shared by the users in order to find meaningful communities. The most important feature of semantic communities is that the nodes in these communities not only have topological relationships, but also own semantic context. For the semantic data mining must be considered on the text analysis, and many semantic community detection algorithms applied the Latent Dirichlet Allocation (LDA) [16] model as the core model.

In the last few years, the analysis in semantic social network has become popular. Most of these algorithms utilize LDA model as the basic model. The SVM-DTW method proposed by Solera, Calderara, and Cucchiara [17] can work on the hierachical networks. This method makes simple structure and needs less input parameters, but the semantic context is not considered and the detected community has less connection with the real semantic network. Li and Ming and She [18] proposed the GRTM model which not only simulates users’ interests as latent variables through their information, but also considers the connections between users as a result of their information. This method combines the context analysis with topological analysis and the similarity of the detected community is nearly close to the real semantic social network, but it is lack in the feature of sampling that would make some fuzzy irrelevant community. Xiao and Liu [19] proposed the GLDA-FP model which can be extended using the prediscretizing method which can help LDA model detect the topic evolution automatically, but the calculation required is large. As for the LCTA model proposed by Yin, Cao, and Gu [20] which makes the different topic distributions in different communities to make the model reasonable, this method has high accuracy in the result, but the number of communities needs to be preset and some hidden parameters need to be set up with experience.

In this paper, we propose a novel community detection algorithm for the objective of dividing nodes into clusters. The main characteristic of communities detected by this algorithm is that members of the same community have common or similar interests. We take into account the topic and keywords information in text from individuals’ words through LDA model, then quantize semantic nodes, and map them into semantic space. Then, we get ideal virtual social communities after using Particle Swarm Optimization algorithm. Last but not least, we build a novel modular model and use the new function to evaluate the virtual social communities we make.

Compared with other models in semantic social network, such as lovain method model [21] and stochastic block model [22], the LDA model provides the probabilistic method so as to promote the foundation of mathematics. Then considering the following sampling, the Gibbs sampling can give an accurate and powerful mathematical proof for the convergence and solution of the LDA model, which is impossible to happen in the other semantic models. Combined with the PSO algorithm, the probability function compiled by LDA model can be closely integrated with the inertia weight and the constriction factor of the particles [23]. In performance measure, we propose a new module detecting evaluation model based on semantic information using the cosine function, which enriches the classic semantic detecting evaluation model.

The rest of the paper is organized as follows: Section 2 introduces LDA model in semantic network. Section 3 shows gibbs sampling and the proposed algorithm. In order to verify our approach, we conducted extensive experiments on a real data set. Performance evaluation and experimental results are shown and discussed in Sections 4 and 5. Finally, in Section 6 we make conclusions and envision further work.

2. Preliminaries

2.1. Community Detection Process

The problem of community detection belongs to NP-hard areas [24] which need initialize solutions at the beginning and optimize solutions constantly in the way of getting the best satisfying solution. The main goal of detecting semantic community is to form communities that individuals share common interests and probably they have similar characteristic [25]. So we show a novel idea that we focus on textual data of individuals’ words. According to the complexity of community detection, we utilize the probabilistic graphical model–LDA to design network. This model has a most clearly hierarchical structure [26], and the scale of parameter spatial has no connection with the number of training documents.

First, we select topics and words from individuals’ semantic information through LDA model. Then, we map semantic nodes into semantic space via Gibbs sampling method [27]. Last, in order to get more accurate communities, we use Particle Swarm Optimization (PSO) algorithm to form semantic communities. The proposed community detection algorithm is clearly explained in the following steps.

2.1.1. Similar Semantic Information Discovery

Every individual says different words as each node has its own information contents in semantic social network [28]. So we abstract semantic context into topic, and then we extract keywords from topic. Through semantic information, we convey some distributions to constrain our mess context [29]. In this way, dividing communities in semantic social network based on similar documents, topics, and keywords from social semantic contents make communities real [30]. The LDA probability model is shown in Figure 1.

In this section, we research LDA model on information contents. The relevant mathematical symbols for illustrating the LDA model are given in Table 1. LDA model assumes the following generative process for each node:

  . The parameter , which pertains to topic distribution, is subject to the Dirichlet distribution over a priori parameter .

  . The parameter , which pertains to keyword distribution, is subject to the Dirichlet distribution over a priori parameter .

  . The topic is subject to the multinomial distribution in case of topic distribution probability .

  . The keyword is subject to the multinomial distribution in case of keyword distribution probability over topic .

The process of forming LDA model is shown in Algorithm 1. And means the number of documents in the process.

(1)Extract the keyword distribution, and ;
(2)for each do
(3)extract keywords, and ;
(4)Extract topic distribution, and ;
(5)for each do
(6)Extract a topic, and this topic obeys ;
(7)Extract a keyword, and this keyword obeys   min  ;
(8)end for
(9)end for

3. Gibbs Sampling and PSO Strategy

3.1. Gibbs Sampling

Gibbs sampling [31] is a simple case of Markov-chain Monte Carlo (MCMC) [32] and aims at extracting a set of approximate samples from Markov-chain that is targeted to make a suitable probability distribution for converging to optimal solutions in high-dimensional models [33] such as LDA. According to the feature of Markov-chain, the probability-distribution function becomes the key to Gibbs sampling [34]. As for LDA in this text, we only sample topics in semantic social network; that is, we only need to consider hidden variety . We denote (topic set besides ) and (set of keywords besides ) to draw a posterior probability . As for , we can find the corresponding keyword . So the probability can be described as in the following equation.When and ( is one of the keywords in ; , which corresponds to , is one of the topics in ), the probability only involves conjugate distribution of the document and topic under the Dirichlet-multinomial model.

We make as the number of topics in document, and the multinomial distribution can be described as The number of keywords in topic, named , can be shown as follows under multinomial distribution. The posterior distribution of and can be obtained in the following equations. is the number of topics and is the number of keywords.

The distribution probability can be calculated by (6)(11). is the amount of topics while , is the amount of topics, is the amount of keywords while , and is the amount of keywords.

3.2. PSO Class Dependent LDA (PSO-LDA)

Particle Swarm Optimization (PSO) is an intelligent optimization algorithm. It was first proposed by J.Kennedy and R.C.Eberhart [35]. PSO algorithm has the advantages of simplified, rather quick convergence [36] speed and less controlling parameter, and so forth.

Compared with other optimization algorithms, such as Genetic Algorithm (GA), Ant Colony Optimization (ACO), and Simulate Anneal (SA), PSO algorithm has two attractive features: firstly, PSO optimizes the solution from the local optimum first and runs fast, which makes the algorithm more adaptable to the evolution of networks; secondly, particles in PSO can be mapped to nodes in semantic network; the process of finding the optimal solution in PSO is consistent with the birth process of the semantic community.

PSO puts a set of random solutions at system startup time and uses iterative search to find out optimal solutions [37]. In PSO, a solution of each optimization problem is called “particle”. Each particle owns fitness value of itself. So we design a heuristic method to detect communities based on PSO. Each particle searches for the optimal solution by sharing social information between individuals.

In PSO-LDA, some LDA semantic feature is put into PSO. We use nodes in semantic social network mapping to “particle” in PSO and utilize semantic information vector of each node mapping to velocity of each particle in PSO. As for fitness value, we use information similar function instead. In PSO, we normalize that the nodes in semantic social network simulate the behavior of a “bird flock”, where social sharing of information takes place, individuals’ gains from the discoveries and previous experience of all other nodes during the search for food [38]. Thus, each node, called particle, in semantic social network which is called swarm, is assumed to “fly” over the search place looking for promising regions on the landscape.

First, we assume the search place is space; and the particle position of the swarm is denoted as , the vector . Each particle has two pieces of message in the process: its “best” position with the smallest value (i.e., its personal best position) and the best function value of global particles in swarm (i.e., the global best position of all particles) . At each iteration, particle of the swarm updates its position and the velocity according to the following equation: is the current iteration, , , represents the size of population, is the dimension of the search place, is the inertia weight, and and are two positive constants. and are study factors, that is, two random numbers extracted from the range for each dimension.

In the search place, once velocity updated, the particle position is changed as in the following equation. is a constriction factor which manages and regulates the velocity’s magnitude to maintain a balance between exploration and exploitation and it can be calculated as follows:, . The constriction factor has influence on the proposed algorithm; we discuss the issue in part 4. The pseudocode for PSO is described in Algorithm 2 [39].

Input:
The semantic social network gragh disposed by LDA;
Output:
Useful transformable probability matrix;
Step  0. Initialize proper parameters, inertia weight , constriction factor , study
factors , , population size(the size of network) , particle size (the number of
nodes in semantic social network) and maximum iteration .
Step  1. Initialize all particles and let ;
Step  2. Evaluate fitness of each particle;
Step  3. Judge whether the ultimate criteria is satisfied. If , stop and jump to Final.; otherwise
refresh variables according to the following steps;
Step  4. Refresh by comparing the current fitness of each particle with its own historical best position
, if gets smaller, then change it with the current position;
Step  5. Refresh by comparing the current best fitness of all particles with the historical best
position of the whole swarm, if gets smaller, then change it with the current best position;
Step 6. Refresh and using Eq (12) and Eq (13);
Step 7. , return Step 2;
Final.

4. Performance Measure

Generally speaking, the performance measure of semantic social network is mostly based on the topological construction. And the model proposed by Shen et al. [40] is widely used in evaluating overlapping communities, which is described in the following equation: is the degree of node and is the degree of node , is the total degree of the network, is the element of adjacency matrix of the network, is the number of communities which the node belongs to and is the number of communities which the node belongs to, and is the community in the network. For we use both topological construction and semantic context to detect communities, a novel evaluation model named , which we add information similarity into topological evaluation index, is given by the following equation. is the node and is the node, is the number of communities that the node pertains and is the number of communities that the node pertains, is the total degree of the network, is the element of adjacency matrix of the network, and the range of value for is . As for the information similarity , we give a normal social graph , where is a set of nodes in the network and is the node; is the set of edges linking to graph nodes. The actual point of is to measure the structural correlation of nodes and add semantic correlation components at the same time. This is more suitable for the basic characteristics of the semantic communities. Each node has connection with an information vector ; is the information similarity of two neighbor nodes and which is calculated as is the dimension of the social network. In our method, if the semantic components of two nodes are close, the projection angles of these two nodes in two-dimensional space will be relatively small. On the contrary, the projection vectors are in contradictory situation.

5. Experimental Results

In this part, we would present and discuss the experiments with topics number analysis, evaluation criterion, real datasets, and different community detection algorithms, based on three datasets (the American College Football network dataset, the Krebs polbooks network dataset, and the dolphins network dataset).

5.1. The Analysis on Topics Number

The number of topics , which is one of the input parameters in PSO-LDA model, can influence the compactedness of communities. So we choose the following three datasets to verify the effect of topics over the result: The American College Football network is shown in Figure 2. This network, created by Newman, is a complex social network about American College Football league. Nodes are regarded as football teams and one edge, between two neighbor nodes, represents that two football teams have played a match. It contains 115 nodes and 616 edges. The Krebs polbooks network established by V.Kreb is shown in Figure 3. The nodes represent the politics books sold on Amazon. Generally, the books on political tendency are approximately divided into three classes. So in order to get topic distribution, Newman collected the political tendency in 3 steps away around each node. The dolphins network collected by Newman is shown in Figure 4. The dolphins network is made up of two families, including 62 nodes and 159 edges. We simulate each node with the semantic information to fit on Dirichlet distribution.

In this section, we use the topic number to experimentalize on three datasets (football, polbooks, and dolphins). Figure 5 shows the comparison of and on the three datasets with . While the topic number grows bigger and the topic distribution rises higher, the number of detected communities gets bigger as rises. In Figure 5, when the topic number gets larger to a certain degree, the topic distribution tends to be stable, resulting in the increment of communities. From the comparison of and , these two performance measure models tend to decrease as increases, since the topic number arrives at an optimal point. The optimal value of is 6 in Figure 5.

For the sake of getting communities more intuitive, Figure 6 shows the detected communities of three datasets when is 6, 12, and 18.

5.2. The Comparison on Different Optimization Algorithms

In this section, we do the comparison on different optimization algorithms with three network datasets above (dolphins, polbooks, and football). We compare the number of communities, the size of communities, runtime, and semantic concentration with PSO algorithm, Genetic Algorithm (GA), Ant Colony Optimization (ACO), and Simulate Anneal (SA). The result is shown in Figure 7. From Figure 7, we can see PSO algorithm makes more numbers of communities and smaller size of communities than others. As for runtime in PSO algorithm, it runs a little better than ACO and SA. The semantic concentration () [41] is a function for measuring and testing degree of coagulation on specific topic and is shown in the following equation: is the performance measure of communities links, while and only if and belong to the same community, there is a link between and . Compared with similarity function , makes focus on the stability of social groups in local environment. But what needs to be noted is that higher does not mean higher in communities and higher does not mean we can get the best divisions; this is because the overlapping part of communities can effect the semantic cohesion. So the ideal community construction should be suitable with and , and this also fits the performance measure of overall optimization and local optimization. Compared with GA, ACO, and SA in Figure 7, the detected communities by PSO have a little small size and a bit more community numbers, which is in accordance with the topic distribution. As for runtime, PSO runs a bit slower than ACO but much better than GA and SA. Figure 8 shows four optimization algorithms run on dolphins network, and as similar as Figure 7, PSO works much better than other algorithms on community detection.

5.3. The Comparison on the Constriction Factor with and

In this section, we compare and over three datasets. The run diagrams, which and run in three datasets, are shown in Figure 9. From (16), we put the similar function of information into and . So generally, the tendency of diagram can be higher than . The maximum value of in football dataset is 0.4233 ( =0.52) and is 0.4064 (=0.53); and there exists bias when =0.53, and the value of is 0.4112 (not the maximum one). There is also bias in polbooks dataset and dolphins dataset, and the maximum value of is 0.4154 ( =0.54) and is 0.3982 ( =0.55) in polbooks dataset while the maximum value of is 0.4639 ( =0.60) and is 0.4489 ( =0.62) in dolphins dataset.

5.4. The Comparison on Community Detection Algorithms

Considering the bias in the semantic community detection, we utilize classical nonsemantic algorithms to illuminate the issue with the football dataset, for example.

We choose GN, FN, LFM, COPRA as nonsemantic classical algorithms, where LFM and COPRA are the overlapping community detection algorithms. The and of the algorithms above are covered in Table 2 and the detection of communities is shown in Figure 10 with football dataset.

From the result in Table 2, the of nonsemantic classical algorithms work higher than that of PSO-LDA (0.5132), but the works lower than PSO-LDA (0.4258). So it suggests that the nonsemantic classical algorithms make a higher in the topological construction detection and a lower in the semantic detection. There is a bias in community detection by nonsemantic classical algorithms compared to semantic algorithms in the way of getting the ideal communities. On the one hand, we verify the performance of these algorithms; on the other hand, we use this experiment to verify the relation above , , and . As for in Table 2, PSO-LDA performs better in and has high , and PSO-LDA is higher than other algorithms in . This means PSO-LDA performs well in overall search ( and ) and works better than others in local search ().

5.5. The Comparison on Real Datasets

In this section, we compare real different datasets, including Quantifying Link Semantics-Publication (QLSP) dataset (805 nodes), Academic Social Network (ASN) dataset (extract 2500 nodes) (https://www.aminer.cn/aminernetwork), extracting 10000 nodes and 20000 nodes from DBLP (December 31, 2014) dataset (2839219 nodes) (http://dblp.uni-trier.de/db/) as DBLP(A) and DBLP(B), and Enron email network (Enron) dataset (extract 25000 nodes) (http://snap.stanford.edu/data/email-Enron.html). The , , and (the number of detected communities) of datasets above detected by various algorithms are reported in Table 3, as the PSO-LDA for . The histogram of is shown in Figure 11 and in Figure 12. From Figures 11 and 12, the PSO-LDA model can be more suitable to solve the semantic community detection than the classical nonsemantic algorithms.

6. Conclusion

In this paper, we presented a novel community detection algorithm PSO-LDA that combines the topological construction with semantic information. It can avoid the number and the size of communities. For the Gibbs sampling solving the hidden parameter in the proposed model, the sampling result approaches to the realistic state. The main contribution of this research focuses on how to use different similarity measure to measure similarity between nodes based on topological construction and their semantic information. As for future work, we would apply the model in some fields such as privacy protection and worm containment in semantic social network.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is sponsored by National Natural Science Foundation of China (61402126), Nature Science Foundation of Heilongjiang province of China (F2016024), Heilongjiang Postdoctoral Science Foundation (LBH-Z15095), University Nursing Program for Young Scholars with Creative Talents in Heilongjiang Province (UNPYSCT-2017094), Heilongjiang Province Foundation for Returned Scholars (LC2018030), and National Training Programs of Innovation and Entrepreneurship for Undergraduates (201810214020). The paper is also supported by China Natural Science Fund.