Abstract
The heterogeneous information networks are omnipresent in realworld applications, which consist of multiple types of objects with various rich semantic meaningful links among them. Community discovery is an effective method to extract the hidden structures in networks. Usually, heterogeneous information networks are timeevolving, whose objects and links are dynamic and varying gradually. In such timeevolving heterogeneous information networks, community discovery is a challenging topic and quite more difficult than that in traditional static homogeneous information networks. In contrast to communities in traditional approaches, which only contain one type of objects and links, communities in heterogeneous information networks contain multiple types of dynamic objects and links. Recently, some studies focus on dynamic heterogeneous information networks and achieve some satisfactory results. However, they assume that heterogeneous information networks usually follow some simple schemas, such as bityped network and star network schema. In this paper, we propose a multityped community discovery method for timeevolving heterogeneous information networks with general network schemas. A tensor decomposition framework, which integrates tensor CP factorization with a temporal evolution regularization term, is designed to model the multityped communities and address their evolution. Experimental results on both synthetic and realworld datasets demonstrate the efficiency of our framework.
1. Introduction
Most artificial online systems, such as World Wide Web, social networks, and collaboration networks, can be represented as information networks, which describe the interactions and relationships between numerous objects, for example, hyperlinks between web pages, friendships between users, and coauthorships between researchers. The information network analysis is attracting an increasing number of researchers from a variety of fields, such as social science [1, 2], machine learning [3–5], and recommendation systems [6, 7]. Community discovery is one of the most significant focuses in information network analysis, which aims to discover interpretable hidden structures, patterns of interactions among objects, and their evolution along with time in such network. Although community detection in networks has been studied for many years, most existing approaches are designed to analyze static information network [1, 8, 9] and homogeneous information network [10–12]. That is, there is only one type of objects and links contained in the network, and the objects and links are not timevarying.
However, in realworld scenarios, information networks are typically heterogeneous and timeevolving. In contrast to communities in traditional approaches, which only contain one type of static objects and links, communities in timeevolving heterogeneous information networks contain multiple types of dynamic objects and links. For example, the DBLP network, an open resource including most bibliographic information on computer science, is a typical timeevolving heterogeneous information network. DBLP network contains four types of objects: author (), paper (), venue (i.e., conference or journal) (), and term (). The links between different object types represent different semantic relationships, such as “an author wrote a paper” and “a paper published in a conference.” The most intriguing communities in DBLP are research areas, which contain the authors with similar research interests, the papers they wrote, the conferences they attended, and the terms they used. With the addition of new authors and new hot topics, the structures of communities are dynamic and varying gradually.
Although the traditional community discovery methods can be applied to timeevolving heterogeneous information network by converting such network into a set of homogeneous information networks and aggregating the timeevolving objects and links along with all timestamps into one snapshot, the rich semantic relationships among different object types and the dynamic property of the communities are lost. In recent years, community discovery in timeevolving heterogeneous information networks has emerged as an outstanding challenge and attracted the attention of many researchers. For instance, Sun et al. used netclusters [13] to describe the communities and proposed a Dirichlet Process Mixture Model based algorithm named EvoNetClus [14, 15] to detect the communities in heterogeneous information networks with star network schema. In the star network schema, the links only appear between target objects and attribute objects.
In this paper, we focus on community discovery in timeevolving heterogeneous information networks with general network schemas, which presents several challenges as follows:(i)Heterogeneity: obviously, the communities in heterogeneous information networks are also heterogeneous, which contain multityped objects and links.(ii)Timevarying: the communities are constantly changing, with new objects coming and old objects vanishing. We assume that the evolution of communities at two adjacent snapshots should be smooth.(iii)Being suitable for general network schema: the network schema of a heterogeneous information network is often more complex than star network schema. The community discovery method should be able to handle the general network schema.(iv)Online mode: although some offline frameworks can produce a global view of community evolution along time by capturing all historical information, online framework is more realistic.
To overcome the aforementioned challenges, we propose a tensor decomposition framework for modeling the multityped communities and address their evolution in timeevolving heterogeneous information networks with general network schemas. Essentially, a timeevolving heterogeneous information network consists of a sequence of network snapshots. We model the timeevolving heterogeneous information network as a sequence of multiway arrays, that is, tensors. Tensor is a highly effective and veracious approach for modeling highmode data, which can naturally express the complex structures and interactions in heterogeneous information networks. By integrating the tensor CP factorization with a temporal evolution regularization term, the multityped communities and their evolution along time can be formalized as a tensor decomposition problem. A secondorder stochastic gradient descent algorithm is presented to solve the problem, and the experimental results on both synthetic and realworld datasets demonstrate the efficiency of our framework.
The rest of this paper is organized as follows. In Section 2, we discuss the related work on community discovery in timeevolving heterogeneous information networks. Section 3 formalizes the problem as tensor decomposition, which integrates tensor CP factorization with a temporal evolution regularization term. A secondorder stochastic gradient descent algorithm is presented in Section 4. Section 5 discusses some implementation issues, including dead and new objects, online deployment, and time complexity analysis. The experimental results on both synthetic and realworld datasets are presented in Section 6. Finally, the conclusions are drawn in Section 7.
2. Related Work
Community discovery is a fundamental technique of information network analysis. Many creative methods for discovering communities in static and homogeneous network have been deployed in the past decades. Stochastic block model [16, 17] and mixed membership model [18] are powerful probabilistic community discovery models for analyzing static networks. These two models, however, lack capability of timeevolving networks and cannot be directly used for heterogeneous information networks.
Tracking the evolution of communities [11, 19] takes the dynamic properties in timeevolving networks into consideration. A commonly used framework [20–22] is to apply the static community detection algorithms for each snapshot of the timeevolving networks and then generate the evolution of communities by computing the match between two adjacent snapshots. Another attempt to track community evolution in timeevolving networks is multiobjective optimization model [23–25], which integrates the measurement of community quality and temporal smoothness into a multiobjective cost function. Nevertheless, these methods are designed for homogeneous networks.
Recently, the community discovery in heterogeneous information networks has become a hot topic. Tang et al. introduced the community evolution in multimode network and proposed a framework which partitioned the multimode network into a set of bityped networks [26, 27]. Sun et al. used netclusters [13] to describe the communities and proposed EvoNetClus [14, 15] to detect the communities automatically. However, the netclusters and EvoNetClus are only suitable for star network schema, where the links only appear between target objects and attribute objects.
To analyze the heterogeneous information networks with general network schemas, tensor factorization offers a promising way for extracting hidden communities in such networks. Tensor is an effective expression of complicated and interpretable structures among different dimensions in heterogeneous information network. For instance, Lin et al. proposed MetaGraph Factorization [28, 29] to detect the communities from dynamic social networks. In addition, a tensor factorization based mixed membership framework [30] simulates the generation of communities as Dirichlet distribution, which can identify the communities automatically. However, this method needs to partition the heterogeneous network into four parts artificially and organize them as a 3star network. Meanwhile, the 3star count tensor must be converted to an orthogonal symmetric tensor. Thus the capability of this method to deal with timeevolving heterogeneous information networks could be degraded.
Our prior works in [31–33] have also focused on clustering heterogeneous information networks based on tensor decomposition, which can cluster multityped objects simultaneously in heterogeneous information networks. However, these methods treat the heterogeneous information networks as static networks and integrate the timeevolving networks into one snapshot, which lose the dynamic properties among multityped objects and links.
Another line related to our work is on the incremental tensor factorization [34]. Though tensor factorization has been widely studied in many domains, such as image processing [35] and computer vision [36], the incremental tensor factorization is still a challenging intellectual task [34]. Sun et al. proposed a general framework of incremental tensor analysis [34] for mining higherorder data streaming, which included three methods: dynamic tensor analysis, streaming tensor analysis, and windowbased tensor analysis. Even though the higherorder data streaming can be effectively analyzed in such framework, the smooth evolution of latent patterns cannot be guaranteed.
3. Problem Formulation
Following the works by Sun et al. in [15] and our prior work [33], we first introduce some definitions of heterogeneous information networks and tensor construction from a given heterogeneous information network.
A heterogeneous information network [15] is a graph consisting of more than one type of objects or links . Assume that belongs to object types , and belongs to link types . That is, in a heterogeneous information network, or . Otherwise, the network becomes a homogeneous information network.
The indicates the set of objects from the th type. We denote an arbitrary object in as , for ; , where is the number of objects in type ; that is, . Thus, the total number of objects in the heterogeneous information network is given by .
The network schema [15] for a given heterogeneous information network is a metatemplate that indicates the formation of object types and link types in the network. The network schema is denoted by . In other words, is an instance of . For example, the star network schema shown in Figure 1 is a typical network schema, in which four types of objects are contained, that is, author, paper, venue, and term. In Figure 1, paper is target object, and the others are attribute objects. The feature of star network schema is that the links in the network only appear between target object and attribute objects.
A genenetwork [33], denoted by , is a minimum instance of in the set of subnetworks of . It is noteworthy that a genenetwork is an integrated semantic relation in the heterogeneous information network, which is quite different from gene regulatory network in Bioinformatics [37]. For example, a genenetwork in DBLP network, denoted by , represents an integrated semantic relation; that is, “an author writes a paper , which contains the term and is published in the venue .” For simplicity, we can mark the genenetwork by the subscripts of objects in , that is, .
Following our prior work [33], a th order tensor can be constructed according to the distribution of genenetworks, where each mode of represents one type of objects in the network . An arbitrary element is an indicator of whether the corresponding genenetwork exists, where , for , is the index of an object in type .
The timeevolving heterogeneous information networks can be segmented into a network sequence according to a series of snapshots. The heterogeneous information network associated with timestamp can be denoted as ; then the network sequence is . Thereby, the tensor representation of the network sequence is . Actually, is the hyperadjacency tensor of the given heterogeneous information network at the th timestamp, which indicates the distribution of genenetworks.
The community in heterogeneous information network is called multityped community, which is more complex than that in homogeneous information network. A multityped community is a set of genenetworks that share the same features and connect together. In other words, a multityped community contains all associated types of objects and links. As shown in Figure 2, the multityped communities about research areas in DBLP network consist of the authors with similar research interests, the papers they wrote, the conferences they attended, and the terms they used. In each multityped community, the authors, papers, venues, and terms are connected to each other and organized as genenetworks. In fact, the objects may belong to several multityped communities since some genenetworks coming from different multityped communities may share the same objects. For example, a famous scientist can cooperate with other researchers within different areas by publishing many interdisciplinary papers; that is, the famous scientist will be contained in many genenetworks across different multityped communities.
The problem of multityped community discovery from such a network sequence can be decomposed into two subproblems: (A) detect the multityped communities in each network snapshot, and (B) model the evolution of multityped communities over time.
(A) Multityped Community Discovery in Each Network Snapshot. Without loss of generality, we take the th network snapshot as an example. Let denote hidden multityped communities in the network and represent the probability that the th object in type belongs to the th community at the th timestamp. DenoteFollowing our prior work [33], a multityped community can be represented aswhere is the outer product of two vectors. Actually, the multityped community is a rankone tensor with the same size of . Equation (3) indicates the genenetworks and the probability of associated objects belonging to the th community. Thereby, we can approximate through a sum of rankone tensors; that is,
Obviously, (4) is a tensor CP factorization. Let factor matrix be the latent community membership matrix for the th type of objects at timestamp , where . We denoteBy minimizing the Frobenius norm of the difference between and its CP approximation, the multityped community discovery in each network snapshot can be formulated as an optimization problem:where ; ; and . The first and second constraints in (6) guarantee that is the probability. The last constraint in (6) ensures that each multityped community consists of all associated types of objects.
(B) Multityped Community Evolution over Time. Equation (6) just performs the multityped community discovery at each timestamp independently and does not consider their smooth evolution at two adjacent snapshots. We denote the objective function in (6) as ; that is,In order to ensure that the evolution of the multityped communities is smooth, a temporal evolution regularization term is introduced.where is a temporally regularized parameter. Indeed, is a firstorder Markov assumption, which forces the multityped communities at current timestamp to resemble that at previous snapshot.
Denote the objective function asTherefore, the problem of multityped community discovery in timeevolving heterogeneous information networks can be formulated as
Here, are constants at current timestamp , which are solved at previous timestamp. When , we have no a priori knowledge about the multityped communities. We set , for . Thus, becomesIt is worth noting that is also a Tikhonov regularization term [38], which ensures the sparsity of the factor matrices and makes the optimization solution easy to be found. Moreover, when , problem (10) degrades into the same form as we proposed in [33]. That is, the work in [33] is the special case for static networks.
4. Algorithm
The stochastic gradient descent algorithm is an efficient tool for optimizing tensor factorization [33, 39]. However, the firstorder stochastic gradient descent algorithm has a poor convergence speed near the optimal point. It has been proven that the secondorder stochastic algorithm has not only a faster convergence speed but also better robustness with respect to the learning rate [33]. The SOSClus proposed in [33] is a secondorder stochastic algorithm which has been well studied for the case of in (10), that is, static heterogeneous information networks. Here, we present a secondorder stochastic gradient descent algorithm, named SOSComm, for the timeevolving case, which is an extension of SOSClus. In this section, some multilinear operators and tensor algebra for tensor factorization will be used, which can be found in [40].
When , the snapshot of the current heterogeneous information network and the previous community membership matrices are known. To compute the factor matrix , we can rewrite in (10) by matricization of along the th mode. According to (7), (8), and (9), we havewhereThe is the matricization of along the th mode, and the symbol indicates the KhatriRao product of two matrices. Given two matrices and , their KhatriRao product is a matrix of size and defined bywhere , , , is an element of , and , , is a column of . In particular, we denote the KhatriRao product of a series of matrices except as
Since the partial derivative of with respect to has been given in [33], we introduce the result directly.whereand symbol is Hadamard product, also named elementwise product of two matrices with the same dimension.
The partial derivative of with respect to is
Therefore, the partial derivative of with respect to is given bywhere is a unit matrix. And the secondorder partial derivative of with respect to can be obtained as
Recalling the update rule of the secondorder stochastic algorithm [33, 41], we havewhere is named learning rate or step size with a positive number.
When , (21) has the same form as SOSClus. That is, the SOSComm is an extension of SOSClus for timeevolving heterogeneous information networks. To satisfy the constraints in (10), the factor matrices derived by (21) should be normalized as
For the current network , based on the tensor representation and the previous community membership matrices , the alternating optimization can be used to update according to (21) and (22), while all other variables are fixed. The community membership matrices obtained by (21) and (22) are the approximations. We also need to recover the discrete community membership matrices from the approximations in some cases, which can be achieved by applying means to the factor matrices. Conveniently, we can simply assign each object to the multityped community which has the largest entry in the corresponding row of factor matrix. After that, the multityped communities consist of genenetworks that can be extracted according to (3). Therefore, the pseudocode of SOSComm is given in Algorithm 1.

5. Implementation Issues
5.1. New Objects Coming and Old Objects Vanishing
In realistic scenarios, objects in timeevolving heterogeneous information networks have various lifecycles. With the lifecycles beginning and end, new objects are born and join the network while old objects die and leave. The framework designed above does not consider the various lifecycles of objects, which assumes that the objects in a network remain unchanged and keep active. Here, we discuss more realistic cases that new objects coming and old objects vanishing in a timeevolving heterogeneous information network.
Note that the tensor representation is a distribution of genenetworks in the heterogeneous information network, whose elements indicate whether the genenetworks exist or not. If the lifecycle of a new object begins at the th timestamp, it will join the network and become active. Since the size of becomes and only the previous factor matrix is used to regularize the temporal smoothness, we can add an allzero row to the corresponding position on when updating .
If the lifecycle of a specified object ends at the th timestamp, it will not appear in any genenetwork in the network. According to (1), . That is, each element in the hyperplane, which is perpendicular to the th dimensionality and passes the th point of the th dimensionality in the tensor space, is zero. Therefore, we set all entries in the th row of equal to zero; that is, for . However, this operation makes the factor matrix dissatisfy the first constraint in (10). Since our framework is an approximation and the dead objects will never appear in any multityped community (according to (3)), we can loosen the first constraint in (10) aswhich does not affect the performance of recovering the discrete community membership matrices from and extracting the multityped communities .
5.2. Online Deployment
The snapshots in the network sequence of timeevolving heterogeneous information networks are coming in a stream way, which makes the storage of the whole network sequence unrealistic. Fortunately, we only use the new network snapshot and the previous community membership matrices to update the model, which makes SOSComm easy to deploy online. However, three issues should be taken into account.
Firstly, the initialization of factor matrices has a large impact on the efficiency of SOSComm. A good initialization may reduce the number of iterations significantly. In practice, previous community membership matrices served as the start when updating the current factor matrices is a good choice. That is, setin the beginning of the algorithm. See line (1) in Algorithm 1.
Secondly, the secondorder stochastic gradient descent algorithm has a fast convergence speed [33, 41] with good initialization, which will be proven in the experiments in Section 6. And the factor matrices obtained by SOSComm are the approximations to community membership matrices. Therefore, we can set the maximum iteration to be a very small positive integer.
Finally, the sparsity of heterogeneous information network should be used to speed up the calculation. According to (21), the primary computation cost for updating is calculating a series of KhatriRao products, that is, . If we store all the elements of and calculate the KhatriRao product of the factor matrices orderly, it will be a very expensive calculation because the largest scale of intermediate results will reach . Actually, the heterogeneous information networks are usually very sparse; namely, a great amount of elements in tensor are zeros. By considering as a whole, the elements of are given byObviously, when , we can directly set ; that is, the following calculation of KhatriRao products is unnecessary. Thus, by considering the sparsity, only nonzero elements in need to be stored and calculated.
5.3. Time Complexity Analysis
The primary computation cost for updating the factor matrices in each iteration of SOSComm is calculating three part: , , and the product of them. Firstly, for calculating , only nonzero elements in need to be concerned. Therefore, the time complexity is , where is the number of nonzero elements in and also is the total number of genenetworks in the network. Secondly, according to (17), since a series of matrixmatrix multiplications and Hadamard products are used to replace numerous KhatriRao products, calculating costs , where is the total number of objects in the network. Thus, the time complexity for calculating the inverse matrix of is . Finally, the product of and is a matrixmatrix multiplication, where and , so, the time complexity is .
To summarize, the time complexity for SOSComm in each iteration is , where is the total number of genenetworks, is the total number of objects, is the number of object types, and is the number of multityped communities. Since and , the time complexity for SOSComm is nearly .
6. Experiments and Results
In this section, the proposed SOSComm is evaluated on both synthetic and realworld datasets. We demonstrate the efficiency of SOSComm for multityped community discovery in timeevolving heterogeneous information networks with general network schemas and further compare the performances with several other stateoftheart community discovery methods. The experiments are simulated by MATLAB R2015a (version 8.5.0, 64bit), with the MATLAB Tensor Toolbox (version 2.6, http://www.sandia.gov/~tgkolda/TensorToolbox/). The code and datasets used in experiments are available online https://github.com/tianshuilideyu/SOSComm.
6.1. Experiments on Synthetic Datasets
6.1.1. Dataset Description
Typically, the realworld heterogeneous information networks are often without groundtruth of community membership. Furthermore, due to the large scale and sparsity, it is impossible to manually assign the community labels to objects in a realworld network. Therefore, several synthetic networks with detailed community structures are resorted to demonstrate the effectiveness of SOSComm.
We construct four synthetic networks with different parameters as the initial networks, that is, the network snapshots at . In order to obtain more realistic synthetic networks, the interactions between objects are assumed to follow Zipf’s law (see details online: https://en.wikipedia.org/wiki/Zipf’s_law), which denotes the distribution of genenetworks in networks. The parameters are as follows, and the details of the synthetic networks at the first timestamp are shown in Table 1:(i) is the number of object types in networks.(ii) is the number of multityped communities.(iii) is the network scale, and .(iv) is the tensor density, and .
To simulate the smooth evolution of multityped communities, each synthetic network is evolved into a network sequence with 10 timestamps. Within each evolution, a percentage (from 5% to 10%) of the objects from each type change their community memberships by interacting with other objects in different communities randomly at each timestamp.
For completeness, we also randomly generate from 10% to 15% new objects coming and old objects vanishing in Syn4 at each timestamp. With new objects coming and interacting with other objects, many new genenetworks are generated. Meanwhile, with old objects vanishing, they will not appear in any genenetwork in the network.
6.1.2. Comparative Methods and Experimental Setting
The performances of SOSComm on synthetic networks are compared with two stateoftheart baselines:(1)SOSClus (see [33]): an offline clustering framework for static heterogeneous information networks, which treats every snapshot in the network sequence independently without the temporal evolution regularization term.(2)CEMNTR (see [26, 27]): a framework of community evolution in multimode network with temporal evolution regularization term, denoted as CEMNTR. CEMNTR partitions the multimode network into a set of bityped networks and detects communities in each bityped network via block model approximation with temporal regularization.
Both the baselines and SOSComm share the same stopping conditions; that is, the change of corresponding objective function is less than and the maximum iterations . The experiments in our prior work [33] have shown that the secondorder stochastic gradient descent has good robustness with respect to the learning rate. Hence, we set the learning rate for both SOSClus and SOSComm. As CEMNTR needs to partition the networks into a set of bityped networks, we divide each network snapshot in Syn3 and Syn4 into 3 bityped networks and construct the adjacent matrices for each pair of object types.
Since the groundtruth of the community structures in the synthetic networks is known, we adopt the Normalized Mutual Information (NMI) [42] as the metric to evaluate the performances. NMI is a measurement of mutual dependence information between multityped community membership and the groundtruth, which ranges from 0 to 1. The larger the value of NMI is, the better the result is.
6.1.3. Experimental Results
We set the temporally regularized parameter for SOSComm and CEMNTR. Since the number of multicommunities is an important parameter for SOSComm, we evaluate the performance with different on the 4 synthetic networks firstly. With varying from to , the average values of NMIs of SOSComm on the 4 synthetic networks are shown in Figure 3. Obviously, on Syn1 and Syn3, SOSComm performs best when , and on Syn2 and Syn4, SOSComm performs best when . The results are consistent with the real setting for synthetic networks in Table 1; that is, the real number of multityped communities is for Syn1 and Syn3, and for Syn2 and Syn4. With the widening gap between and the real number of multityped communities, SOSComm performs worse and worse in all synthetic networks.
In the following experiments, we fix as the real number of multityped communities in each synthetic network. The comparison of NMIs for SOSComm and two baselines on the 4 synthetic networks is shown in Figure 4. In Figure 4, each subgraph shows NMIs of the three methods on each network snapshot in corresponding synthetic network. The tendency of the NMI curve turns out the ability of tracing communities evolution. From the 4 subgraphs in Figure 4, we find that SOSComm performs best on NMI and tracing communities evolution. Since no knowledge of previous community membership at the first timestamp is available, SOSComm and SOSClus share the same starting point on the 4 synthetic networks. Moreover, with the time evolving, SOSComm can trace the evolution of multityped communities closely, while the NMIs of SOSClus and CEMNTR on the 4 synthetic networks decline steadily.
(a) Experimental results on Syn1
(b) Experimental results on Syn2
(c) Experimental results on Syn3
(d) Experimental results on Syn4
As shown in Figure 4(d), with the new objects coming and old objects vanishing in the network at each timestamp, the NMIs of SOSClus and CEMNTR on Syn4 drop sharply; in detail, NMI of SOSClus drops from 1.0 to 0.2865 and NMI of CEMNTR drops from 0.8099 to 0.0976. Meanwhile, NMI of SOSComm keeps smooth relatively. This reveals that SOSComm can handle the timeevolving heterogeneous information networks with new objects coming and old objects vanishing effectively.
The convergence speed is also a significant focus for studying the performances of our framework. We run SOSComm on Syn3 and Syn4 with and analyze the changes of the objective function between adjacent iterations, denoted as , for all timestamps in the two network sequences. When the s almost keep constant, the algorithm converges. Figure 5 shows the experimental results of , where each subgraph displays the convergence speed of SOSComm on Syn3 and Syn4 at corresponding timestamp. In Figure 5, we can see that SOSComm converges quickly on both Syn3 and Syn4 at all timestamps. Particularly, SOSComm has converged when the iterations are less than 10 in all subgraphs, which is a good property for online deployment.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
The temporally regularized parameter in (10) controls the impact of historical information on the current community distribution. The larger the is, the more significant the impact is. To study the influence of temporally regularized parameter tuning, we apply SOSComm on Syn4 with varying from 0.1 to 100. The average values of NMIs and iterations on all network snapshots over all timestamps are shown in Figure 6, where the coordinates of axis are based on a logarithmic transformation. As shown in Figure 6, the NMIs and iterations maintain the satisfactory results when is less than 10. However, when and keeps increasing, the performances of NMIs and iterations become worse quickly. That is, the historical information dominates and the algorithm consumes more resources to smooth the timeevolving communities, when the temporally regularized parameter is too large. Certainly, the temporally regularized parameter contributes to multityped communities detection by considering the temporal information when ranges from 0.1 to 10.
To conclude, the experiments on the 4 timeevolving synthetic networks demonstrate that SOSComm outperforms the SOSClus and CEMNTR. With a fast convergence speed, SOSComm can trace the evolution of multityped communities in the 4 synthetic networks accurately. In particular, on Syn4, with the new objects coming and old objects vanishing in the network, SOSComm can detect the multityped communities evolution well, while the performances of SOSClus and CEMNTR deteriorate rapidly as time goes on. The performances of NMIs for SOSComm on the 4 synthetic networks with different show that SOSComm is sensitive to . The is closer to the real number of multityped of communities, so SOSComm performs better. Moreover, when ranges from 0.1 to 10, the performances of SOSComm are satisfactory.
6.2. Experiments on RealWorld Dataset
6.2.1. Dataset Description
Here, we compare the performances of SOSComm with the baselines on realworld dataset. The realworld dataset is a 25year DBLP network sequence, which is collected by Tang et al. [27] and available online: http://www.leitang.net/heterogeneous_network.html. In the 25year DBLP dataset, the papers published from 1980 to 2004 are extracted, and all related authors, terms (words contained in the papers’ titles), and venues (the conferences or journals the papers published in) are included. The low frequency used and stop words have been abandoned. In the realworld dataset, the 25year DBLP network is segmented into 25 network snapshots according to the publication year associated with each paper. After that, we construct a 4mode tensor for each network snapshot, where the 4 modes of the tensors represent the papers, authors, venues, and terms, respectively. Table 2 shows the number of papers, authors, venues, terms, and genenetworks in each network snapshot of the 25year DBLP dataset. Meanwhile, each row in Table 2 indicates the size of the corresponding tensor. For example, the size of the tensor for is 69,021 × 105,292 × 1,238 × 9,153, with 1,182,458 nonzero elements. It is worth noting that there is no groundtruth of community memberships in the realworld dataset, because it is difficult and unrealistic to label the massive objects in a realworld network automatically or even manually.
6.2.2. Evaluation Metrics
Different from the synthetic networks, NMI cannot be adopted as the metric to evaluate the performances due to the lack of groundtruth of community membership in the realworld dataset. In fact, to evaluate the detection of community evolution is challenging. Alternatively, we extend the modularity [43, 44], a widely used metric of measuring the quality of communities in a homogenous network, to the highorder tensor space, so that the extended modularity is suitable for the heterogeneous information networks. In a network, the high modularity reflects dense connections among vertices within a community and sparse connections among vertices across different communities.
Following the work of [44], modularity is defined as the fraction of the edges that fall within the given communities minus the expected fraction of randomization of these edges with the fixed degree of each vertex. We directly give the calculation of modularity in [44]:where is the total number of edges in the whole network, is an element of adjacent matrix , and denotes the degree of vertices . The function indicates whether the vertices and are in the same community or not. The value of falls in the range , which can be negative. In practice, when the value of ranges from 0.3 to 0.7, the quality of community is satisfactory.
Without loss of generality, we take the heterogeneous information network at the th timestamp as an example and ignore the superscript of timestamp in the following discussion. In our framework, each nonzero element of tensor maps a genenetwork in the given heterogeneous information network, while the outer product of a series of the th column in the corresponding factor matrices indicates the distribution of the th multityped community for genenetworks; that is, . In other words, a genenetwork is the minimum unit in our framework. Then, a new graph reflecting the connections of genenetworks is formed, in which each genenetwork in the original heterogeneous information network is treated as a vertex. In other words, the vertices in are the genenetworks in original heterogeneous information network. If two vertices and are connected or an edge between and exists in , this means that the genenetworks denoted by and in the original heterogeneous information network share one or more same objects.
Accordingly, the modularity can be used to evaluate the quality of communities in . Since the vertices in are in onetoone correspondence with genenetworks in original heterogeneous information network, the multityped communities consisting of genenetworks in original heterogeneous information network are also the partition of communities in . Let denote the total number of vertices in ; that is, . The adjacent matrix of becomes , whose element indicates whether connects to or not. Here, the adjacent matrix is a symmetric matrix with all zeros diagonal; that is, and .
Thereby, the total number of edges in is , and the degree of is . According to (26), the extended modularity (also denoted by ) can be calculated byIf and are in the same multityped community, . Otherwise, .
6.2.3. Experimental Results
Firstly, The baselines and SOSComm are deployed in offline mode in order to learn their best performances on multityped communities discovery. That is, the baselines and SOSComm are iterated on each network snapshot until they converge. In the offline mode, we share the same comparative methods and experimental setting as that in experiments on synthetic networks; that is, the change of corresponding objective function is less than and the maximum iterations . We set the temporally regularized parameter for SOSComm and CEMNTR.
To seek out the suitable number of multityped communities, we perform the SOSComm on the 25year DBLP network with different . Figure 7 gives the average values of modularity on the 25 timestamps with varying from to . In Figure 7, when , . Though the average values of modularity are almost equal when , and , the maximum of is obtained when . Therefore, in the following experiments, the number of multityped communities in the 25year DBLP network is fixed to 20.
The comparison of modularity for the baselines and SOSComm in offline mode is shown in Figure 8. SOSComm performs the best modularity on each network snapshot. With the time evolving, SOSComm traces the evolution of multityped communities more and more closely, while the modularity of SOSClus keeps low all the time and the modularity of CEMNTR declines steadily.
Secondly, we learn the performances of SOSComm in online mode. In the online mode, the maximum iteration is limited to 5. The comparison of modularity for the baselines and SOSComm in online mode is shown in Figure 9. Although the modularity of SOSComm has declined relatively to that in offline mode, its performance is still the best.
In addition, Figure 10 shows the comparison of modularity for SOSComm in offline mode and online mode. In Figure 10, we can find that the performance of SOSComm in online mode is not worse than that in offline mode. Before 2000, the two curves almost overlap. With the explosive growth of the tensors in the last 5 years, the modularity of SOSComm in online mode is slightly less than that in offline mode. Table 3 summarizes the running time of the baselines and SOSComm in online mode. CEMNTR and SOSComm, as shown in Table 3, yield the obvious advantages. Most of the time, SOSComm is the fastest.
To summarize, the experiments on the 25year DBLP dataset show that SOSComm outperforms the SOSClus and CEMNTR. With a larger modularity , SOSComm can detect the multityped communities and trace their evolution in the 25year DBLP network. In particular, the experimental results of online mode demonstrate that SOSComm has the best performances on modularity and running time. That is, SOSComm has a good property of online deployment.
7. Conclusion
In this paper, a novel online framework for multityped community discovery in timeevolving heterogeneous information network without the restriction of network schema is proposed. Each snapshot of the network sequence is expressed as a tensor, and the multityped community is modeled as a rankone tensor. Then, the problem of multityped community discovery is formalized as a tensor decomposition, which integrates the tensor CP factorization with a temporal evolution regularization term. In addition, a secondorder stochastic gradient descent algorithm, named SOSComm, is designed to address the tensor decomposition. In this framework, the community membership matrices of all types of objects, the multityped communities, and their evolutions over time can be obtained simultaneously. Whether in offline or online mode, the proposed algorithm outperformed the other stateoftheart methods.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This study was supported by the National Science Foundation of China (no. 61401482 and no. 61401483).