Research Article | Open Access
Multityped Community Discovery in Time-Evolving Heterogeneous Information Networks Based on Tensor Decomposition
The heterogeneous information networks are omnipresent in real-world applications, which consist of multiple types of objects with various rich semantic meaningful links among them. Community discovery is an effective method to extract the hidden structures in networks. Usually, heterogeneous information networks are time-evolving, whose objects and links are dynamic and varying gradually. In such time-evolving heterogeneous information networks, community discovery is a challenging topic and quite more difficult than that in traditional static homogeneous information networks. In contrast to communities in traditional approaches, which only contain one type of objects and links, communities in heterogeneous information networks contain multiple types of dynamic objects and links. Recently, some studies focus on dynamic heterogeneous information networks and achieve some satisfactory results. However, they assume that heterogeneous information networks usually follow some simple schemas, such as bityped network and star network schema. In this paper, we propose a multityped community discovery method for time-evolving heterogeneous information networks with general network schemas. A tensor decomposition framework, which integrates tensor CP factorization with a temporal evolution regularization term, is designed to model the multityped communities and address their evolution. Experimental results on both synthetic and real-world datasets demonstrate the efficiency of our framework.
Most artificial online systems, such as World Wide Web, social networks, and collaboration networks, can be represented as information networks, which describe the interactions and relationships between numerous objects, for example, hyperlinks between web pages, friendships between users, and coauthorships between researchers. The information network analysis is attracting an increasing number of researchers from a variety of fields, such as social science [1, 2], machine learning [3–5], and recommendation systems [6, 7]. Community discovery is one of the most significant focuses in information network analysis, which aims to discover interpretable hidden structures, patterns of interactions among objects, and their evolution along with time in such network. Although community detection in networks has been studied for many years, most existing approaches are designed to analyze static information network [1, 8, 9] and homogeneous information network [10–12]. That is, there is only one type of objects and links contained in the network, and the objects and links are not time-varying.
However, in real-world scenarios, information networks are typically heterogeneous and time-evolving. In contrast to communities in traditional approaches, which only contain one type of static objects and links, communities in time-evolving heterogeneous information networks contain multiple types of dynamic objects and links. For example, the DBLP network, an open resource including most bibliographic information on computer science, is a typical time-evolving heterogeneous information network. DBLP network contains four types of objects: author (), paper (), venue (i.e., conference or journal) (), and term (). The links between different object types represent different semantic relationships, such as “an author wrote a paper” and “a paper published in a conference.” The most intriguing communities in DBLP are research areas, which contain the authors with similar research interests, the papers they wrote, the conferences they attended, and the terms they used. With the addition of new authors and new hot topics, the structures of communities are dynamic and varying gradually.
Although the traditional community discovery methods can be applied to time-evolving heterogeneous information network by converting such network into a set of homogeneous information networks and aggregating the time-evolving objects and links along with all timestamps into one snapshot, the rich semantic relationships among different object types and the dynamic property of the communities are lost. In recent years, community discovery in time-evolving heterogeneous information networks has emerged as an outstanding challenge and attracted the attention of many researchers. For instance, Sun et al. used net-clusters  to describe the communities and proposed a Dirichlet Process Mixture Model based algorithm named Evo-NetClus [14, 15] to detect the communities in heterogeneous information networks with star network schema. In the star network schema, the links only appear between target objects and attribute objects.
In this paper, we focus on community discovery in time-evolving heterogeneous information networks with general network schemas, which presents several challenges as follows:(i)Heterogeneity: obviously, the communities in heterogeneous information networks are also heterogeneous, which contain multityped objects and links.(ii)Time-varying: the communities are constantly changing, with new objects coming and old objects vanishing. We assume that the evolution of communities at two adjacent snapshots should be smooth.(iii)Being suitable for general network schema: the network schema of a heterogeneous information network is often more complex than star network schema. The community discovery method should be able to handle the general network schema.(iv)Online mode: although some offline frameworks can produce a global view of community evolution along time by capturing all historical information, online framework is more realistic.
To overcome the aforementioned challenges, we propose a tensor decomposition framework for modeling the multityped communities and address their evolution in time-evolving heterogeneous information networks with general network schemas. Essentially, a time-evolving heterogeneous information network consists of a sequence of network snapshots. We model the time-evolving heterogeneous information network as a sequence of multiway arrays, that is, tensors. Tensor is a highly effective and veracious approach for modeling high-mode data, which can naturally express the complex structures and interactions in heterogeneous information networks. By integrating the tensor CP factorization with a temporal evolution regularization term, the multityped communities and their evolution along time can be formalized as a tensor decomposition problem. A second-order stochastic gradient descent algorithm is presented to solve the problem, and the experimental results on both synthetic and real-world datasets demonstrate the efficiency of our framework.
The rest of this paper is organized as follows. In Section 2, we discuss the related work on community discovery in time-evolving heterogeneous information networks. Section 3 formalizes the problem as tensor decomposition, which integrates tensor CP factorization with a temporal evolution regularization term. A second-order stochastic gradient descent algorithm is presented in Section 4. Section 5 discusses some implementation issues, including dead and new objects, online deployment, and time complexity analysis. The experimental results on both synthetic and real-world datasets are presented in Section 6. Finally, the conclusions are drawn in Section 7.
2. Related Work
Community discovery is a fundamental technique of information network analysis. Many creative methods for discovering communities in static and homogeneous network have been deployed in the past decades. Stochastic block model [16, 17] and mixed membership model  are powerful probabilistic community discovery models for analyzing static networks. These two models, however, lack capability of time-evolving networks and cannot be directly used for heterogeneous information networks.
Tracking the evolution of communities [11, 19] takes the dynamic properties in time-evolving networks into consideration. A commonly used framework [20–22] is to apply the static community detection algorithms for each snapshot of the time-evolving networks and then generate the evolution of communities by computing the match between two adjacent snapshots. Another attempt to track community evolution in time-evolving networks is multiobjective optimization model [23–25], which integrates the measurement of community quality and temporal smoothness into a multiobjective cost function. Nevertheless, these methods are designed for homogeneous networks.
Recently, the community discovery in heterogeneous information networks has become a hot topic. Tang et al. introduced the community evolution in multimode network and proposed a framework which partitioned the multimode network into a set of bityped networks [26, 27]. Sun et al. used net-clusters  to describe the communities and proposed Evo-NetClus [14, 15] to detect the communities automatically. However, the net-clusters and Evo-NetClus are only suitable for star network schema, where the links only appear between target objects and attribute objects.
To analyze the heterogeneous information networks with general network schemas, tensor factorization offers a promising way for extracting hidden communities in such networks. Tensor is an effective expression of complicated and interpretable structures among different dimensions in heterogeneous information network. For instance, Lin et al. proposed MetaGraph Factorization [28, 29] to detect the communities from dynamic social networks. In addition, a tensor factorization based mixed membership framework  simulates the generation of communities as Dirichlet distribution, which can identify the communities automatically. However, this method needs to partition the heterogeneous network into four parts artificially and organize them as a 3-star network. Meanwhile, the 3-star count tensor must be converted to an orthogonal symmetric tensor. Thus the capability of this method to deal with time-evolving heterogeneous information networks could be degraded.
Our prior works in [31–33] have also focused on clustering heterogeneous information networks based on tensor decomposition, which can cluster multityped objects simultaneously in heterogeneous information networks. However, these methods treat the heterogeneous information networks as static networks and integrate the time-evolving networks into one snapshot, which lose the dynamic properties among multityped objects and links.
Another line related to our work is on the incremental tensor factorization . Though tensor factorization has been widely studied in many domains, such as image processing  and computer vision , the incremental tensor factorization is still a challenging intellectual task . Sun et al. proposed a general framework of incremental tensor analysis  for mining higher-order data streaming, which included three methods: dynamic tensor analysis, streaming tensor analysis, and window-based tensor analysis. Even though the higher-order data streaming can be effectively analyzed in such framework, the smooth evolution of latent patterns cannot be guaranteed.
3. Problem Formulation
Following the works by Sun et al. in  and our prior work , we first introduce some definitions of heterogeneous information networks and tensor construction from a given heterogeneous information network.
A heterogeneous information network  is a graph consisting of more than one type of objects or links . Assume that belongs to object types , and belongs to link types . That is, in a heterogeneous information network, or . Otherwise, the network becomes a homogeneous information network.
The indicates the set of objects from the th type. We denote an arbitrary object in as , for ; , where is the number of objects in type ; that is, . Thus, the total number of objects in the heterogeneous information network is given by .
The network schema  for a given heterogeneous information network is a metatemplate that indicates the formation of object types and link types in the network. The network schema is denoted by . In other words, is an instance of . For example, the star network schema shown in Figure 1 is a typical network schema, in which four types of objects are contained, that is, author, paper, venue, and term. In Figure 1, paper is target object, and the others are attribute objects. The feature of star network schema is that the links in the network only appear between target object and attribute objects.
A gene-network , denoted by , is a minimum instance of in the set of subnetworks of . It is noteworthy that a gene-network is an integrated semantic relation in the heterogeneous information network, which is quite different from gene regulatory network in Bioinformatics . For example, a gene-network in DBLP network, denoted by , represents an integrated semantic relation; that is, “an author writes a paper , which contains the term and is published in the venue .” For simplicity, we can mark the gene-network by the subscripts of objects in , that is, .
Following our prior work , a th order tensor can be constructed according to the distribution of gene-networks, where each mode of represents one type of objects in the network . An arbitrary element is an indicator of whether the corresponding gene-network exists, where , for , is the index of an object in type .
The time-evolving heterogeneous information networks can be segmented into a network sequence according to a series of snapshots. The heterogeneous information network associated with timestamp can be denoted as ; then the network sequence is . Thereby, the tensor representation of the network sequence is . Actually, is the hyper-adjacency tensor of the given heterogeneous information network at the th timestamp, which indicates the distribution of gene-networks.
The community in heterogeneous information network is called multityped community, which is more complex than that in homogeneous information network. A multityped community is a set of gene-networks that share the same features and connect together. In other words, a multityped community contains all associated types of objects and links. As shown in Figure 2, the multityped communities about research areas in DBLP network consist of the authors with similar research interests, the papers they wrote, the conferences they attended, and the terms they used. In each multityped community, the authors, papers, venues, and terms are connected to each other and organized as gene-networks. In fact, the objects may belong to several multityped communities since some gene-networks coming from different multityped communities may share the same objects. For example, a famous scientist can cooperate with other researchers within different areas by publishing many interdisciplinary papers; that is, the famous scientist will be contained in many gene-networks across different multityped communities.
The problem of multityped community discovery from such a network sequence can be decomposed into two subproblems: (A) detect the multityped communities in each network snapshot, and (B) model the evolution of multityped communities over time.
(A) Multityped Community Discovery in Each Network Snapshot. Without loss of generality, we take the th network snapshot as an example. Let denote hidden multityped communities in the network and represent the probability that the th object in type belongs to the th community at the th timestamp. DenoteFollowing our prior work , a multityped community can be represented aswhere is the outer product of two vectors. Actually, the multityped community is a rank-one tensor with the same size of . Equation (3) indicates the gene-networks and the probability of associated objects belonging to the th community. Thereby, we can approximate through a sum of rank-one tensors; that is,
Obviously, (4) is a tensor CP factorization. Let factor matrix be the latent community membership matrix for the th type of objects at timestamp , where . We denoteBy minimizing the Frobenius norm of the difference between and its CP approximation, the multityped community discovery in each network snapshot can be formulated as an optimization problem:where ; ; and . The first and second constraints in (6) guarantee that is the probability. The last constraint in (6) ensures that each multityped community consists of all associated types of objects.
(B) Multityped Community Evolution over Time. Equation (6) just performs the multityped community discovery at each timestamp independently and does not consider their smooth evolution at two adjacent snapshots. We denote the objective function in (6) as ; that is,In order to ensure that the evolution of the multityped communities is smooth, a temporal evolution regularization term is introduced.where is a temporally regularized parameter. Indeed, is a first-order Markov assumption, which forces the multityped communities at current timestamp to resemble that at previous snapshot.
Denote the objective function asTherefore, the problem of multityped community discovery in time-evolving heterogeneous information networks can be formulated as
Here, are constants at current timestamp , which are solved at previous timestamp. When , we have no a priori knowledge about the multityped communities. We set , for . Thus, becomesIt is worth noting that is also a Tikhonov regularization term , which ensures the sparsity of the factor matrices and makes the optimization solution easy to be found. Moreover, when , problem (10) degrades into the same form as we proposed in . That is, the work in  is the special case for static networks.
The stochastic gradient descent algorithm is an efficient tool for optimizing tensor factorization [33, 39]. However, the first-order stochastic gradient descent algorithm has a poor convergence speed near the optimal point. It has been proven that the second-order stochastic algorithm has not only a faster convergence speed but also better robustness with respect to the learning rate . The SOSClus proposed in  is a second-order stochastic algorithm which has been well studied for the case of in (10), that is, static heterogeneous information networks. Here, we present a second-order stochastic gradient descent algorithm, named SOSComm, for the time-evolving case, which is an extension of SOSClus. In this section, some multilinear operators and tensor algebra for tensor factorization will be used, which can be found in .
When , the snapshot of the current heterogeneous information network and the previous community membership matrices are known. To compute the factor matrix , we can rewrite in (10) by matricization of along the th mode. According to (7), (8), and (9), we havewhereThe is the matricization of along the th mode, and the symbol indicates the Khatri-Rao product of two matrices. Given two matrices and , their Khatri-Rao product is a matrix of size and defined bywhere , , , is an element of , and , , is a column of . In particular, we denote the Khatri-Rao product of a series of matrices except as
Since the partial derivative of with respect to has been given in , we introduce the result directly.whereand symbol is Hadamard product, also named element-wise product of two matrices with the same dimension.
The partial derivative of with respect to is
Therefore, the partial derivative of with respect to is given bywhere is a unit matrix. And the second-order partial derivative of with respect to can be obtained as
When , (21) has the same form as SOSClus. That is, the SOSComm is an extension of SOSClus for time-evolving heterogeneous information networks. To satisfy the constraints in (10), the factor matrices derived by (21) should be normalized as
For the current network , based on the tensor representation and the previous community membership matrices , the alternating optimization can be used to update according to (21) and (22), while all other variables are fixed. The community membership matrices obtained by (21) and (22) are the approximations. We also need to recover the discrete community membership matrices from the approximations in some cases, which can be achieved by applying -means to the factor matrices. Conveniently, we can simply assign each object to the multityped community which has the largest entry in the corresponding row of factor matrix. After that, the multityped communities consist of gene-networks that can be extracted according to (3). Therefore, the pseudocode of SOSComm is given in Algorithm 1.
5. Implementation Issues
5.1. New Objects Coming and Old Objects Vanishing
In realistic scenarios, objects in time-evolving heterogeneous information networks have various lifecycles. With the lifecycles beginning and end, new objects are born and join the network while old objects die and leave. The framework designed above does not consider the various lifecycles of objects, which assumes that the objects in a network remain unchanged and keep active. Here, we discuss more realistic cases that new objects coming and old objects vanishing in a time-evolving heterogeneous information network.
Note that the tensor representation is a distribution of gene-networks in the heterogeneous information network, whose elements indicate whether the gene-networks exist or not. If the lifecycle of a new object begins at the th timestamp, it will join the network and become active. Since the size of becomes and only the previous factor matrix is used to regularize the temporal smoothness, we can add an all-zero row to the corresponding position on when updating .
If the lifecycle of a specified object ends at the th timestamp, it will not appear in any gene-network in the network. According to (1), . That is, each element in the hyperplane, which is perpendicular to the th dimensionality and passes the th point of the th dimensionality in the tensor space, is zero. Therefore, we set all entries in the th row of equal to zero; that is, for . However, this operation makes the factor matrix dissatisfy the first constraint in (10). Since our framework is an approximation and the dead objects will never appear in any multityped community (according to (3)), we can loosen the first constraint in (10) aswhich does not affect the performance of recovering the discrete community membership matrices from and extracting the multityped communities .
5.2. Online Deployment
The snapshots in the network sequence of time-evolving heterogeneous information networks are coming in a stream way, which makes the storage of the whole network sequence unrealistic. Fortunately, we only use the new network snapshot and the previous community membership matrices to update the model, which makes SOSComm easy to deploy online. However, three issues should be taken into account.
Firstly, the initialization of factor matrices has a large impact on the efficiency of SOSComm. A good initialization may reduce the number of iterations significantly. In practice, previous community membership matrices served as the start when updating the current factor matrices is a good choice. That is, setin the beginning of the algorithm. See line (1) in Algorithm 1.
Secondly, the second-order stochastic gradient descent algorithm has a fast convergence speed [33, 41] with good initialization, which will be proven in the experiments in Section 6. And the factor matrices obtained by SOSComm are the approximations to community membership matrices. Therefore, we can set the maximum iteration to be a very small positive integer.
Finally, the sparsity of heterogeneous information network should be used to speed up the calculation. According to (21), the primary computation cost for updating is calculating a series of Khatri-Rao products, that is, . If we store all the elements of and calculate the Khatri-Rao product of the factor matrices orderly, it will be a very expensive calculation because the largest scale of intermediate results will reach . Actually, the heterogeneous information networks are usually very sparse; namely, a great amount of elements in tensor are zeros. By considering as a whole, the elements of are given byObviously, when , we can directly set ; that is, the following calculation of Khatri-Rao products is unnecessary. Thus, by considering the sparsity, only nonzero elements in need to be stored and calculated.
5.3. Time Complexity Analysis
The primary computation cost for updating the factor matrices in each iteration of SOSComm is calculating three part: , , and the product of them. Firstly, for calculating , only nonzero elements in need to be concerned. Therefore, the time complexity is , where is the number of nonzero elements in and also is the total number of gene-networks in the network. Secondly, according to (17), since a series of matrix-matrix multiplications and Hadamard products are used to replace numerous Khatri-Rao products, calculating costs , where is the total number of objects in the network. Thus, the time complexity for calculating the inverse matrix of is . Finally, the product of and is a matrix-matrix multiplication, where and , so, the time complexity is .
To summarize, the time complexity for SOSComm in each iteration is , where is the total number of gene-networks, is the total number of objects, is the number of object types, and is the number of multityped communities. Since and , the time complexity for SOSComm is nearly .
6. Experiments and Results
In this section, the proposed SOSComm is evaluated on both synthetic and real-world datasets. We demonstrate the efficiency of SOSComm for multityped community discovery in time-evolving heterogeneous information networks with general network schemas and further compare the performances with several other state-of-the-art community discovery methods. The experiments are simulated by MATLAB R2015a (version 8.5.0, 64-bit), with the MATLAB Tensor Toolbox (version 2.6, http://www.sandia.gov/~tgkolda/TensorToolbox/). The code and datasets used in experiments are available online https://github.com/tianshuilideyu/SOSComm.
6.1. Experiments on Synthetic Datasets
6.1.1. Dataset Description
Typically, the real-world heterogeneous information networks are often without ground-truth of community membership. Furthermore, due to the large scale and sparsity, it is impossible to manually assign the community labels to objects in a real-world network. Therefore, several synthetic networks with detailed community structures are resorted to demonstrate the effectiveness of SOSComm.
We construct four synthetic networks with different parameters as the initial networks, that is, the network snapshots at . In order to obtain more realistic synthetic networks, the interactions between objects are assumed to follow Zipf’s law (see details online: https://en.wikipedia.org/wiki/Zipf’s_law), which denotes the distribution of gene-networks in networks. The parameters are as follows, and the details of the synthetic networks at the first timestamp are shown in Table 1:(i) is the number of object types in networks.(ii) is the number of multityped communities.(iii) is the network scale, and .(iv) is the tensor density, and .
To simulate the smooth evolution of multityped communities, each synthetic network is evolved into a network sequence with 10 timestamps. Within each evolution, a percentage (from 5% to 10%) of the objects from each type change their community memberships by interacting with other objects in different communities randomly at each timestamp.
For completeness, we also randomly generate from 10% to 15% new objects coming and old objects vanishing in Syn4 at each timestamp. With new objects coming and interacting with other objects, many new gene-networks are generated. Meanwhile, with old objects vanishing, they will not appear in any gene-network in the network.
6.1.2. Comparative Methods and Experimental Setting
The performances of SOSComm on synthetic networks are compared with two state-of-the-art baselines:(1)SOSClus (see ): an offline clustering framework for static heterogeneous information networks, which treats every snapshot in the network sequence independently without the temporal evolution regularization term.(2)CEMNTR (see [26, 27]): a framework of community evolution in multimode network with temporal evolution regularization term, denoted as CEMNTR. CEMNTR partitions the multimode network into a set of bityped networks and detects communities in each bityped network via block model approximation with temporal regularization.
Both the baselines and SOSComm share the same stopping conditions; that is, the change of corresponding objective function is less than and the maximum iterations . The experiments in our prior work  have shown that the second-order stochastic gradient descent has good robustness with respect to the learning rate. Hence, we set the learning rate for both SOSClus and SOSComm. As CEMNTR needs to partition the networks into a set of bityped networks, we divide each network snapshot in Syn3 and Syn4 into 3 bityped networks and construct the adjacent matrices for each pair of object types.
Since the ground-truth of the community structures in the synthetic networks is known, we adopt the Normalized Mutual Information (NMI)  as the metric to evaluate the performances. NMI is a measurement of mutual dependence information between multityped community membership and the ground-truth, which ranges from 0 to 1. The larger the value of NMI is, the better the result is.
6.1.3. Experimental Results
We set the temporally regularized parameter for SOSComm and CEMNTR. Since the number of multicommunities is an important parameter for SOSComm, we evaluate the performance with different on the 4 synthetic networks firstly. With varying from to , the average values of NMIs of SOSComm on the 4 synthetic networks are shown in Figure 3. Obviously, on Syn1 and Syn3, SOSComm performs best when , and on Syn2 and Syn4, SOSComm performs best when . The results are consistent with the real setting for synthetic networks in Table 1; that is, the real number of multityped communities is for Syn1 and Syn3, and for Syn2 and Syn4. With the widening gap between and the real number of multityped communities, SOSComm performs worse and worse in all synthetic networks.
In the following experiments, we fix as the real number of multityped communities in each synthetic network. The comparison of NMIs for SOSComm and two baselines on the 4 synthetic networks is shown in Figure 4. In Figure 4, each subgraph shows NMIs of the three methods on each network snapshot in corresponding synthetic network. The tendency of the NMI curve turns out the ability of tracing communities evolution. From the 4 subgraphs in Figure 4, we find that SOSComm performs best on NMI and tracing communities evolution. Since no knowledge of previous community membership at the first timestamp is available, SOSComm and SOSClus share the same starting point on the 4 synthetic networks. Moreover, with the time evolving, SOSComm can trace the evolution of multityped communities closely, while the NMIs of SOSClus and CEMNTR on the 4 synthetic networks decline steadily.
(a) Experimental results on Syn1
(b) Experimental results on Syn2
(c) Experimental results on Syn3
(d) Experimental results on Syn4
As shown in Figure 4(d), with the new objects coming and old objects vanishing in the network at each timestamp, the NMIs of SOSClus and CEMNTR on Syn4 drop sharply; in detail, NMI of SOSClus drops from 1.0 to 0.2865 and NMI of CEMNTR drops from 0.8099 to 0.0976. Meanwhile, NMI of SOSComm keeps smooth relatively. This reveals that SOSComm can handle the time-evolving heterogeneous information networks with new objects coming and old objects vanishing effectively.
The convergence speed is also a significant focus for studying the performances of our framework. We run SOSComm on Syn3 and Syn4 with and analyze the changes of the objective function between adjacent iterations, denoted as , for all timestamps in the two network sequences. When the s almost keep constant, the algorithm converges. Figure 5 shows the experimental results of , where each subgraph displays the convergence speed of SOSComm on Syn3 and Syn4 at corresponding timestamp. In Figure 5, we can see that SOSComm converges quickly on both Syn3 and Syn4 at all timestamps. Particularly, SOSComm has converged when the iterations are less than 10 in all subgraphs, which is a good property for online deployment.
The temporally regularized parameter in (10) controls the impact of historical information on the current community distribution. The larger the is, the more significant the impact is. To study the influence of temporally regularized parameter tuning, we apply SOSComm on Syn4 with varying from 0.1 to 100. The average values of NMIs and iterations on all network snapshots over all timestamps are shown in Figure 6, where the coordinates of -axis are based on a logarithmic transformation. As shown in Figure 6, the NMIs and iterations maintain the satisfactory results when is less than 10. However, when and keeps increasing, the performances of NMIs and iterations become worse quickly. That is, the historical information dominates and the algorithm consumes more resources to smooth the time-evolving communities, when the temporally regularized parameter is too large. Certainly, the temporally regularized parameter contributes to multityped communities detection by considering the temporal information when ranges from 0.1 to 10.
To conclude, the experiments on the 4 time-evolving synthetic networks demonstrate that SOSComm outperforms the SOSClus and CEMNTR. With a fast convergence speed, SOSComm can trace the evolution of multityped communities in the 4 synthetic networks accurately. In particular, on Syn4, with the new objects coming and old objects vanishing in the network, SOSComm can detect the multityped communities evolution well, while the performances of SOSClus and CEMNTR deteriorate rapidly as time goes on. The performances of NMIs for SOSComm on the 4 synthetic networks with different show that SOSComm is sensitive to . The is closer to the real number of multityped of communities, so SOSComm performs better. Moreover, when ranges from 0.1 to 10, the performances of SOSComm are satisfactory.
6.2. Experiments on Real-World Dataset
6.2.1. Dataset Description
Here, we compare the performances of SOSComm with the baselines on real-world dataset. The real-world dataset is a 25-year DBLP network sequence, which is collected by Tang et al.  and available online: http://www.leitang.net/heterogeneous_network.html. In the 25-year DBLP dataset, the papers published from 1980 to 2004 are extracted, and all related authors, terms (words contained in the papers’ titles), and venues (the conferences or journals the papers published in) are included. The low frequency used and stop words have been abandoned. In the real-world dataset, the 25-year DBLP network is segmented into 25 network snapshots according to the publication year associated with each paper. After that, we construct a 4-mode tensor for each network snapshot, where the 4 modes of the tensors represent the papers, authors, venues, and terms, respectively. Table 2 shows the number of papers, authors, venues, terms, and gene-networks in each network snapshot of the 25-year DBLP dataset. Meanwhile, each row in Table 2 indicates the size of the corresponding tensor. For example, the size of the tensor for is 69,021 × 105,292 × 1,238 × 9,153, with 1,182,458 nonzero elements. It is worth noting that there is no ground-truth of community memberships in the real-world dataset, because it is difficult and unrealistic to label the massive objects in a real-world network automatically or even manually.
6.2.2. Evaluation Metrics
Different from the synthetic networks, NMI cannot be adopted as the metric to evaluate the performances due to the lack of ground-truth of community membership in the real-world dataset. In fact, to evaluate the detection of community evolution is challenging. Alternatively, we extend the modularity [43, 44], a widely used metric of measuring the quality of communities in a homogenous network, to the high-order tensor space, so that the extended modularity is suitable for the heterogeneous information networks. In a network, the high modularity reflects dense connections among vertices within a community and sparse connections among vertices across different communities.
Following the work of , modularity is defined as the fraction of the edges that fall within the given communities minus the expected fraction of randomization of these edges with the fixed degree of each vertex. We directly give the calculation of modularity in :where is the total number of edges in the whole network, is an element of adjacent matrix , and denotes the degree of vertices . The function indicates whether the vertices and are in the same community or not. The value of falls in the range , which can be negative. In practice, when the value of ranges from 0.3 to 0.7, the quality of community is satisfactory.
Without loss of generality, we take the heterogeneous information network at the th timestamp as an example and ignore the superscript of timestamp in the following discussion. In our framework, each nonzero element of tensor maps a gene-network in the given heterogeneous information network, while the outer product of a series of the th column in the corresponding factor matrices indicates the distribution of the th multityped community for gene-networks; that is, . In other words, a gene-network is the minimum unit in our framework. Then, a new graph reflecting the connections of gene-networks is formed, in which each gene-network in the original heterogeneous information network is treated as a vertex. In other words, the vertices in are the gene-networks in original heterogeneous information network. If two vertices and are connected or an edge between and exists in , this means that the gene-networks denoted by and in the original heterogeneous information network share one or more same objects.
Accordingly, the modularity can be used to evaluate the quality of communities in . Since the vertices in are in one-to-one correspondence with gene-networks in original heterogeneous information network, the multityped communities consisting of gene-networks in original heterogeneous information network are also the partition of communities in . Let denote the total number of vertices in ; that is, . The adjacent matrix of becomes , whose element indicates whether connects to or not. Here, the adjacent matrix is a symmetric matrix with all zeros diagonal; that is, and .
Thereby, the total number of edges in is , and the degree of is . According to (26), the extended modularity (also denoted by ) can be calculated byIf and are in the same multityped community, . Otherwise, .
6.2.3. Experimental Results
Firstly, The baselines and SOSComm are deployed in offline mode in order to learn their best performances on multityped communities discovery. That is, the baselines and SOSComm are iterated on each network snapshot until they converge. In the offline mode, we share the same comparative methods and experimental setting as that in experiments on synthetic networks; that is, the change of corresponding objective function is less than and the maximum iterations . We set the temporally regularized parameter for SOSComm and CEMNTR.
To seek out the suitable number of multityped communities, we perform the SOSComm on the 25-year DBLP network with different . Figure 7 gives the average values of modularity on the 25 timestamps with varying from to . In Figure 7, when , . Though the average values of modularity are almost equal when , and , the maximum of is obtained when . Therefore, in the following experiments, the number of multityped communities in the 25-year DBLP network is fixed to 20.
The comparison of modularity for the baselines and SOSComm in offline mode is shown in Figure 8. SOSComm performs the best modularity on each network snapshot. With the time evolving, SOSComm traces the evolution of multityped communities more and more closely, while the modularity of SOSClus keeps low all the time and the modularity of CEMNTR declines steadily.
Secondly, we learn the performances of SOSComm in online mode. In the online mode, the maximum iteration is limited to 5. The comparison of modularity for the baselines and SOSComm in online mode is shown in Figure 9. Although the modularity of SOSComm has declined relatively to that in offline mode, its performance is still the best.
In addition, Figure 10 shows the comparison of modularity for SOSComm in offline mode and online mode. In Figure 10, we can find that the performance of SOSComm in online mode is not worse than that in offline mode. Before 2000, the two curves almost overlap. With the explosive growth of the tensors in the last 5 years, the modularity of SOSComm in online mode is slightly less than that in offline mode. Table 3 summarizes the running time of the baselines and SOSComm in online mode. CEMNTR and SOSComm, as shown in Table 3, yield the obvious advantages. Most of the time, SOSComm is the fastest.
To summarize, the experiments on the 25-year DBLP dataset show that SOSComm outperforms the SOSClus and CEMNTR. With a larger modularity , SOSComm can detect the multityped communities and trace their evolution in the 25-year DBLP network. In particular, the experimental results of online mode demonstrate that SOSComm has the best performances on modularity and running time. That is, SOSComm has a good property of online deployment.
In this paper, a novel online framework for multityped community discovery in time-evolving heterogeneous information network without the restriction of network schema is proposed. Each snapshot of the network sequence is expressed as a tensor, and the multityped community is modeled as a rank-one tensor. Then, the problem of multityped community discovery is formalized as a tensor decomposition, which integrates the tensor CP factorization with a temporal evolution regularization term. In addition, a second-order stochastic gradient descent algorithm, named SOSComm, is designed to address the tensor decomposition. In this framework, the community membership matrices of all types of objects, the multityped communities, and their evolutions over time can be obtained simultaneously. Whether in offline or online mode, the proposed algorithm outperformed the other state-of-the-art methods.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This study was supported by the National Science Foundation of China (no. 61401482 and no. 61401483).
- D. Cai, Z. Shao, X. He, X. Yan, and J. Han, “Mining hidden community in heterogeneous social network,” in Proceedings of the 3rd International Workshop on Link Discovery (LinkKDD '05), pp. 58–65, USA.
- H. Ma, H. Yang, M. R. Lyu, and I. King, “Mining social networks using heat diffusion processes for marketing candidates selection,” in Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM '08), pp. 233–242, ACM, Napa Valley, California, CA, USA, October 2008.
- A. H. Doan, J. Madhavan, P. Domingos, and A. Halevy, “Ontology matching: A machine learning approach,” in International Handbooks on Information Systems, pp. 397–416, 2003.
- F. Tao, G. Brova, J. Han et al., “NewsNetExplorer: Automatic construction and exploration of news information networks,” in Proceedings of the International Conference on Management of Data (SIGMOD '14), ACM, USA, 2014.
- M. Gomez Rodriguez and L. Song, “Diffusion in social and information networks: Research problems, probabilistic models & machine learning methods,” in Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '15), pp. 2315-2316, Australia, 2015.
- X. Yu, X. Ren, Y. Sun et al., “Recommendation in heterogeneous information networks with implicit user feedback,” in Proceedings of the 7th ACM Conference on Recommender Systems (RecSys '13), pp. 347–350, China, October 2013.
- X. Yu, X. Ren, Y. Sun et al., “Personalized entity recommendation: A heterogeneous information network approach,” in Proceedings of the 7th ACM international conference on Web search and data mining, pp. 283–292, New York, NY, USA, Feburary 2014.
- X. Wang, L. Tang, H. Liu, and L. Wang, “Learning with multi-resolution overlapping communities,” Knowledge and Information Systems, vol. 36, no. 2, pp. 517–535, 2013.
- L. Tang, X. Wang, and H. Liu, “Community detection via heterogeneous interaction analysis,” Data Mining and Knowledge Discovery, vol. 25, no. 1, pp. 1–33, 2012.
- R. Jin, C. Kou, and R. Liu, “Improving community detection in time-evolving networks through clustering fusion,” Cybernetics and Information Technologies, vol. 15, no. 2, pp. 63–74, 2015.
- C. C. Aggarwal and P. S. Yu, “Online analysis of community evolution in data streams,” Sdm Lars Backstrom Dan Huttenlocher Jon Kleinberg and Xiangyang, 2005.
- M. Revelle, C. Domeniconi, M. Sweeney, and A. Johri, “Finding community topics and membership in graphs,” ECML PKDD, pp. 625–640, 2015.
- Y. Sun, Y. Yu, and J. Han, “Ranking-based clustering of heterogeneous information networks with star network schema,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '09), pp. 797–806, Paris, France, July 2009.
- Y. Sun, J. Tang, J. Han, M. Gupta, and B. Zhao, “Community evolution detection in dynamic heterogeneous information networks,” in Proceedings of the 8th Workshop on Mining and Learning with Graphs (MLG '10), pp. 137–146, July 2010.
- Y. Sun, J. Tang, J. Han, C. Chen, and M. Gupta, “Co-evolution of multi-typed objects in dynamic star networks,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 12, pp. 2942–2955, 2014.
- P. W. Holland and K. B. Laskey, “Stochastic blockmodels: First steps,” Social Networks, vol. 5, no. 2, pp. 109–137, 1983.
- K. Nowicki, “Estimation and prediction for stochastic blockstructures,” Journal of the American Statistical Association, vol. 96, no. 455, pp. 1077–1087, 2001.
- E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing, “Mixed membership stochastic blockmodels,” Journal of Machine Learning Research, vol. 9, no. 5, pp. 1981–2014, 2008.
- J. Sun, S. Papadimitriou, P. S. Yu, and C. Faloutsos, “Community evolution and change point detection in time-evolving graphs,” in Link Mining: Models, Algorithms, and Applications, pp. 73–104, Springer, New York, NY, USA, 2010.
- Y.-R. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. L. Tseng, “Blog community discovery and evolution based on mutual awareness expansion,” in Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI '07), pp. 48–56, USA, November 2007.
- G. Palla, A. Barabási, and T. Vicsek, “Quantifying social group evolution,” Nature, vol. 446, no. 7136, pp. 664–667, 2007.
- A. Cuzzocrea, F. Folino, and C. Pizzuti, “Dynamicnet: An effective and efficient algorithm for supporting community evolution detection in time-evolving information networks,” in Proceedings of the 17th International Database Engineering and Applications Symposium (IDEAS ’13), pp. 148–153, ACM, New York, NY, USA, 2013.
- Y. Lin, Y. Chi, S. Zhu, H. Sundaram, and B. L. Tseng, “Analyzing communities and their evolutions in dynamic social networks,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 3, no. 2, pp. 1–31, 2009.
- C. Tantipathananandh, T. Berger-Wolf, and D. Kempe, “A framework for community identification in dynamic social networks,” in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '07), pp. 717–726, USA, August 2007.
- F. Folino and C. Pizzuti, “An evolutionary multiobjective approach for community discovery in dynamic networks,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 8, pp. 1838–1852, 2014.
- L. Tang, H. Liu, J. Zhang, and Z. Nazeri, “Community evolution in dynamic multi-mode networks,” in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '08), pp. 677–685, USA, August 2008.
- L. Tang, H. Liu, and J. Zhang, “Identifying evolving groups in dynamic multimode networks,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 1, pp. 72–85, 2012.
- Y.-R. Lin, J. Sun, P. Castro, R. Konuru, H. Sundaram, and A. Kelliher, “MetaFac: Community discovery via relational hypergraph factorization,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '09), pp. 527–535, France, July 2009.
- Y.-R. Lin, J. Sun, H. Sundaram, A. Kelliher, P. Castro, and R. Konuru, “Community discovery via MetaGraph Factorization,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 5, no. 3, article 17, 2011.
- A. Animashree, G. Rong, H. Daniel, and M. K. Sham, “A tensor spectral approach to learning mixed membership community models,” in Proceedings of the in JMLR: Workshop and Conference Proceedings, 2013.
- J. Wu, Y. Wu, S. Deng, and H. Huang, “Multi-way clustering for heterogeneous information networks with general network schema,” in Proceedings of the 16th IEEE International Conference on Computer and Information Technology (CIT '16), pp. 339–346, December 2016.
- J. Wu, Q. Meng, S. Deng, H. Huang, Y. Wu, and A. Badii, “Generic, network schema agnostic sparse tensor factorization for single-pass clustering of heterogeneous information networks,” PLoS ONE, vol. 12, no. 2, Article ID e0172323, 2017.
- J. Wu, Z. Wang, Y. Wu, L. Liu, S. Deng, and H. Huang, “A Tensor CP decomposition method for clustering heterogeneous information networks via stochastic gradient descent algorithms,” Scientific Programming, vol. 2017, Article ID 2803091, 13 pages, 2017.
- J. Sun, D. Tao, S. Papadimitriou, P. S. Yu, and C. Faloutsos, “Incremental tensor analysis: theory and applications,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 2, no. 3, article 11, 2008.
- M. Zhang and C. Ding, “Robust tucker tensor decomposition for effective image representation,” in Proceedings of the 14th IEEE International Conference on Computer Vision (ICCV '13), pp. 2448–2455, Australia, December 2013.
- X. Cao, X. Wei, Y. Han, and D. Lin, “Robust face clustering via tensor decomposition,” IEEE Transactions on Cybernetics, vol. 45, no. 11, pp. 2546–2557, 2015.
- E. Davidson and M. Levine, “Gene regulatory networks,” Proceedings of the National Acadamy of Sciences of the United States of America, vol. 102, no. 14, p. 4935, 2005.
- P. Paatero, “Construction and analysis of degenerate PARAFAC models,” Journal of Chemometrics, vol. 14, no. 3, pp. 285–299, 2000.
- E. Acar, D. M. Dunlavy, and T. G. Kolda, “A scalable optimization approach for fitting canonical tensor decompositions,” Journal of Chemometrics, vol. 25, no. 2, pp. 67–86, 2011.
- T. G. Kolda, “Multilinear operators for higher-order decompositions.,” Tech. Rep. SAND2006-2081, 2006.
- B. L. Bottou and N. Murata, “Stochastic approximations and efficient learning,” The Handbook of Brain Theory and Neural Networks, Second edition, 2002.
- A. Strehl and J. Ghosh, “Cluster ensembles—a knowledge reuse framework for combining multiple partitions,” Journal of Machine Learning Research, vol. 3, no. 3, pp. 583–617, 2003.
- M. E. J. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Physical Review E: Statistical, Nonlinear, and Soft Matter Physics, vol. 69, no. 2, Article ID 026113, 2004.
- M. E. J. Newman, “Modularity and community structure in networks,” Proceedings of the National Acadamy of Sciences of the United States of America, vol. 103, no. 23, pp. 8577–8582, 2006.
Copyright © 2018 Jibing Wu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.