Research Article  Open Access
Context Attention Heterogeneous Network Embedding
Abstract
Network embedding (NE), which maps nodes into a lowdimensional latent Euclidean space to represent effective features of each node in the network, has obtained considerable attention in recent years. Many popular NE methods, such as DeepWalk, Node2vec, and LINE, are capable of handling homogeneous networks. However, nodes are always fully accompanied by heterogeneous information (e.g., text descriptions, node properties, and hashtags) in the realworld network, which remains a great challenge to jointly project the topological structure and different types of information into the fixeddimensional embedding space due to heterogeneity. Besides, in the unweighted network, how to quantify the strength of edges (tightness of connections between nodes) accurately is also a difficulty faced by existing methods. To bridge the gap, in this paper, we propose CAHNE (context attention heterogeneous network embedding), a novel network embedding method, to accurately determine the learning result. Specifically, we propose the concept of node importance to measure the strength of edges, which can better preserve the context relations of a node in unweighted networks. Moreover, text information is a widely ubiquitous feature in realworld networks, e.g., online social networks and citation networks. On account of the sophisticated interactions between the network structure and text features of nodes, CAHNE learns context embeddings for nodes by introducing the context node sequence, and the attention mechanism is also integrated into our model to better reflect the impact of context nodes on the current node. To corroborate the efficacy of CAHNE, we apply our method and various baseline methods on several realworld datasets. The experimental results show that CAHNE achieves higher quality compared to a number of stateoftheart network embedding methods on the tasks of network reconstruction, link prediction, node classification, and visualization.
1. Introduction
Nowadays, information networks are ubiquitous in our daily life, for example, social and communication networks, citation networks, and cooccurrence networks. At most of the time, the scales of realworld networks are very large. Thus, analyzing largescale networks has attracted considerable research attention in recent years. Network embedding (NE), also known as network representation learning, aims to generate informative numerical representations for nodes in the network to preserve network structures and further alleviates the inconveniences caused by sparsity. Network embedding methods are demonstrated to be effective in many network analysis tasks including link prediction [1], node classification [2], and clustering [3].
Many approaches have been proposed toward this goal, such as DeepWalk [4], LINE [5], Node2vec [6], and PPNE [7]. Particularly, network embedding aims to project the network into a lowdimensional space, where each node is represented using a corresponding embedding vector, and the relativity among nodes is preserved. The nodes with “high similarity” are mapped onto adjacent points (“high similarity” means nodes have similar properties and are more likely to have edges between them). The embedding vectors contain the semantic information transcribed from the network structure and can be applied in various network mining applications easily. However, most of the existing NE methods take the network structure as input to learn representations for nodes without considering any other information.
In reality, a network usually has rich heterogeneous information, such as text descriptions and other metadata. For instance, Wikipedia (https://www.wikipedia.org/) entries connect with each other and build an encyclopedia network. Simultaneously, each entry as a node has substantial text information such as keywords and introduction, which describe a node in detail and more comprehensively. Furthermore, in the realworld social network like Twitter (https://twitter.com) shown in Figure 1, users as nodes also have their own text descriptions, which may reflect the properties of each node. Hence, text information is typical and critical heterogeneous semantic information widely existing in realworld networks. However, most NE models treat all networks as homogeneous networks. In other words, most works learn representations only from network structures ignoring text information. Because of heterogeneity in networks, we put forward an idea to embed a network from both network structures and text information.
To this end, a direct way is to learn representations from text information of nodes and network structures independently, which can be called textaware embedding. However, this way ignores the complicated interactions between network structures and text information, which leads to invalidity. CANE [8] is an efficient method to capture the correlation between the text feature of a node and its neighbors’ in a network, which achieves the purpose we stated before. However, CANE only preserves the local relations in a network, while we need to take the global network structures into consideration rather than node pairs independently. For example, in Figure 1, Bob may have connections to other NLP researchers who are also his colleagues and Alice has not followed these researchers, so there may be potential relationships between these researchers and Alice in the text aspect because they have similar properties, but CANE cannot capture these relationships. Thus, how to satisfy the compatibility between network structures and text information in the network should be exploited to better represent nodes.
In addition to the problem stated above, typical NE methods are insensitive to the strength of the relationship between nodes in unweighted networks. As an intuitive example, we show some relationships from the realworld online networks in Figure 1. In Twitter, Trump is a celebrity who has plenty of followers, and each follower links to him by an edge. Alice and Bob are ordinary users, and they link with each other because they are colleagues. They also follow Trump just because they are Americans. In this case, the strength of the relationship between Alice and Bob should be stronger than that between Alice and Trump. As shown in Figure 1, we use dotted lines and solid lines to describe the strength of relationships (edges). Strong connection means high similarity between pairwise nodes, and weak connection means low similarity. In unweighted networks, classical NE methods generally treat the weight of the edge between nodes as a binary variable and ignore the rich semantics of edges we illustrated before. Therefore, the strength of connections is underlying structural information we need to take into consideration when learning network representations in realworld networks, which remains a great challenge.
From the aforementioned problems, the heterogeneity and structural complexity in realworld networks pose specific hurdles for network representation learning. Fortunately, in this paper, we propose a context attention heterogeneous network embedding (CAHNE) method with an emphasis on leveraging the rich and intrinsic information in heterogeneous networks. Specifically, CAHNE reconstructs the classical network represented as to form the heterogeneous text network denoted as . We can extract a context node sequence for each node by breadthfirst search (BFS) on the redesigned network, and the root node can be deemed the anchor node. Through a series of specific operations that we will give a detailed elaboration in the later section, combining the text information in a sequence, we can obtain a representation for the anchor of the context node sequence, which is the context embedding of the anchor node. Therefore, CAHNE integrates text information into the global structures of the network to learn the potential intertextual associations in the network. Moreover, the influence of context nodes on the anchor node can vary with different anchor nodes, and thus, we further the adopt attention mechanism to enhance the expressiveness of the influence from the context nodes on the specific anchor node. Besides, for unweighted networks, CAHNE is expected to preserve the underlying structural information on the strength of edges. Based on this idea, we give the definition of node importance that quantifies the strength of the relationship between nodes and integrate it into the network embedding method to learn a structurebased representation for each node. Finally, we concatenate the context embedding and the structurebased embedding of the node as the complete representation for the node. Empirically, we apply CAHNE to four network analysis tasks, i.e., network reconstruction, link prediction, node classification, and visualization, using seven realworld networks as datasets. Experimental results demonstrate that our method learns better nodes embeddings when compared to a variety of stateoftheart baselines in the field of NE.
The main contributions of our method are summarized as follows:(i)We propose a novel network embedding model, namely, CAHNE. The method is able to learn comprehensive representations for different types of realworld networks, which confirms the flexibility and robustness of our model.(ii)We provide a key insight regarding the strength of relationships in unweighted realworld networks. We thereby propose the definition of node importance for optimizing the objective, which more closely shows the actual situations of the network.(iii)We integrate heterogeneous information into network representation and mitigate the incompatibility between network structures and text information by extracting context node sequences accompanied by the attention mechanism to learn context embeddings.
The source code is available at https://github.com/zhuo931077127/CAHNE.
2. Related Works
Network representation learning (NRL) has been well researched for many years, for example, in earlier works such as Isomap [9], multidimensional scaling (MDS) [10], and Laplacian eigenmap (LE) [11]. These approaches represent the network as an affinity graph by using the feature vectors of the network nodes. For a given largescale information network, e.g., social network and citation network, these methods are less efficient and inflexible to generate node representations.
In recent years, inspired by the development of the machine learning and word embedding method Word2vec [12], many NRL methods have been proposed for largescale information network representation. For example, DeepWalk [4] proposes to perform random walks on the graph to obtain sequences of nodes. It introduces the SkipGram model to achieve vertex representations. Based on DeepWalk, Node2vec [6] defines a flexible notion of a node’s network neighborhood and designs a biased random walk procedure to explore the network structure more efficiently. Some other methods focus on finding multivariate structure features in the network. For example, LINE [5] embeds the network into a lowdimensional latent space to approximate the firstorder proximity and secondorder proximity of the network. Nevertheless, most of these network embedding models only focus on homogeneous networks, without taking heterogeneous information into consideration.
Different from homogeneous networks, heterogeneous networks consist of complex node and edge attributes. Several attempts have been done on heterogeneous information network (HIN) embedding and achieved promising performance in various tasks. Hin2Vec [13] learns the embeddings of a HIN by conducting multiple prediction training tasks jointly. CANE [8] learns network embeddings from network structures and text descriptions with mutual relations of pairwise nodes. ANRL [14] proposes a neighbor enhancement autoencoder to incorporate both the network structure and node attribute information in a principled way. Paper2vec [15] aims to learn the paper node embeddings from the paper citation network.
In summary, existing methods in homogeneous network embedding use either affinity matrix models or deep models to preserve network structural features in a lowdimensional space. And existing HIN embedding methods focus on different types of heterogeneous information. They have been proven useful on network analysis, but they cannot maintain the sophisticated interaction between network structures and heterogeneous information (in this paper, we consider text information). Additionally, to the best of our knowledge, all existing NE models ignore the important relationship information between nodes in unweighted realworld networks we proposed before. In contrast, our proposed model CAHNE can learn more comprehensive information than existing methods.
3. Preliminaries
In this section, we introduce basic definitions and formalize the problem of context attention heterogeneous network embedding.
3.1. Context Node Sequence (CNS)
Forming a context node sequence for the anchor node in the network can be viewed as a sampling process of detecting nodes that most likely have impact on the anchor node. Figure 2 shows the process of obtaining a context node sequence. Concretely, we first perform breadthfirst search (BFS) on the original graph G starting from a node , and we regard as an anchor node, which provides us with a BFS tree rooted at . can be considered the unique relational tree of . Context nodes are not only the neighborhood of the anchor node but also deeper layer nodes. Hence, we control the number of layers by setting the parameter k to sample context nodes. Furthermore, the value of k is uncertain and determined by the type of the given network. At last, for a given node , we can obtain its context node sequence , where m and n are the number of context nodes in the first layer and second layer, respectively, and so on. can also be treated as . It is worth noting that each node can only appear once or 0 times in a context node sequence and building BFS trees for all nodes is not computationally expensive because of the sparsity of realworld networks.
3.2. Problem Formulation
Now, we formally define the problem of CAHNE. Compared to conventional homogeneous network embedding such as DeepWalk and Node2vec, which only focus on a single network structure, our goal is to learn a representation for each node in a network graph with convergence of more heterogeneous associated information. Text information is widely available in realworld networks, e.g., social networks and citation networks, so we integrate it into the traditional graph definition () [16]. We first define a heterogeneous text network as follows.
Definition 1 (heterogeneous text network (HTN)). The HTN is denoted as , where represents the set of nodes, represents the set of edges, and is the relationship between two nodes linked with each other, with an associated weight (in this paper, we only consider unweighted networks). denotes the text information of nodes. For the text information of a specific node , we can represent it as a word sequence , where denotes the number of words in .
Noticing the difference between the definition of the heterogeneous text network and conventional network , the heterogeneous text network contains richer information. Empirically, weight often indicates the strength of the edge between two nodes. In practice, for unweighted realworld network datasets, weights are only formed as binary variables. For example, if has a neighbor , the weight of the edge between them is 1; otherwise, it is 0. However, we expect to measure the strength of the relations more in line with the actual situations of realworld online networks. Thus, we propose the definition of node importance as follows.
Definition 2 (node importance). Node importance is denoted as , which is a quantitative representation for each node in the network. It measures the strength of the edge between a given node and its neighbors. For an anchor node , is the value of node importance for .
In realworld networks such as citation networks and social networks, each node has its own context node sequence. We can integrate all nodes’ CNSs and get a global sequence for G, . The more the CNSs a node consists of, in other words, the more the times a node appears in , the less the importance for this node to its neighbors. For instance, in Twitter, a celebrity has thousands of followers, which means this celebrity consists of abundant CNSs. However, for ordinary users, the importance of the relationship with a celebrity is less than that with their real friends who have relationships with them.
Definition 3 (network embedding). Given a heterogeneous text network denoted as , network embedding aims to map the network data into a lowdimensional latent space, where each node can learn a lowdimensional embedding according to its graph structure and other information. Note that is the dimension of the latent embedding space.
Embedding a network into a lowdimensional space is helpful for many analysis tasks. In this process, the structures and properties of the network are preserved and encoded. In a heterogeneous text network, structurebased network embedding is not enough and the heterogeneous information is usually highly correlated with the network structure. Thus, we further propose the definition of context embedding.
Definition 4 (context embedding). Aiming to learn a vector representation for the text information of each node in an HTN, context embedding learns a mapping function for a node , where is the dimension of context embedding.
It is worth mentioning that more than integrating text features of the anchor node, it also takes the context node sequence into consideration. For instance, the context embedding of the anchor node is determined by its CNS and its own text description . In this paper, our method CAHNE introduces the attention mechanism to weight the context nodes for each anchor node so that we can mitigate the incompatibility between network topologies and text features to obtain more comprehensive and accurate representations for the network.
4. CAHNE: The Proposed Method
In this section, we will give a detailed introduction to our method CAHNE.
4.1. Overall Framework
For CAHNE, we need to take full use of network structures and associated text information. We propose two types of embedding for a node , i.e., structurebased embedding and context embedding . Structurebased embedding can capture the network structural information, which incorporates node importance, while context embedding can capture the textual meanings of anchor nodes accompanied by their context node sequences’ text information. We concatenate two types of embeddings and obtain the overall node embedding for a node as follows:where indicates the concatenation operation. In the following sections, we will give a detailed introduction to the two types of embeddings, respectively.
4.2. StructureBased Embedding
Without loss of universality, we assume the heterogeneous text network is directed. For the undirected network, we consider two directed edges with opposite directions and equal weights. And then, CAHNE fuses node importance as the weight for each node in the network.
4.2.1. Node Importance
As noted in Definition 2, in a realistic network, the more the times a node appears in sequence , the less the importance to its neighbors. The quantitative representation of the importance of a node is the product of two statistics, node frequency () and inverse CNS frequency (). The node frequency refers to the frequency of a given node that appears in a context node sequence, which is a binary variable. In order to get the node frequency of , first we denote as whether constitutes , where :
We denote as the total number of nodes in the sequence . And then, we define as the node frequency of in , which can be formulated as .
can be considered a measure of the universal importance of a node because it captures the distribution of importance in realworld networks. For a given node , we can denote as the inverse CNS frequency as follows:where . After incorporating the mentioned node frequency and inverse CNS frequency, the node importance () of a given node can be measured as
Note that NI is a contextbased measure for each node in the network, and it extends TFIDF thinking to network node analysis. Compared with the degreebased PageRank [17], NI incorporates richer contextual semantic structures rather than pairwise nodes, which enables our model to measure the importance of a node in the highorder neighborhood [18].
For a node in an unweighted network, can be served as the weights of edges starting from . We can also consider as the ranking of node popularity in the network. The smaller the value, the higher the prevalence of a node. After obtaining the quantitative representations of in a given network, we can simply obtain the empirical distribution of the network, which can be defined as follows:
4.2.2. StructureBased Objective
Formally, we model the conditional probability of generated by as
This equation can be interpreted as the probability of detecting the edge from to , which denotes the reconstructed distribution.
With the empirical distribution of the coincident probability between nodes and the reconstructed distribution, to preserve the node importance and network structures, a straightforward way is to minimize the following objective function:where is the distance between the two distributions. We choose KL divergence of two probability distributions to measure the difference between distributions. Thus, replacing with KL divergence, we can obtain the following objective:
With this formulation, we can minimize the objective equation (8) to obtain vectors that represent nodes in the dimensional latent space based on the network structure. We summarize the structurebased embedding method in Algorithm 1.

4.3. Context Embedding
CAHNE is expected to integrate typical heterogeneous information like text features in the network. A straightforward way is to learn representations from text information of nodes and network structures independently. However, it ignores the complex interactions and associations between topological structures and heterogeneous information. To bridge this gap, we introduce context embedding to fuse information of context nodes for an anchor in the network so that we can overcome the incompatibility problem.
As shown in Figure 2, we sample context nodes for the anchor node and obtain a context node sequence when setting k as 2. In a CNS, text features of different context nodes have various impacts on the anchor node. Thus, we expect to give a weight to each context node in a CNS, and the weights can reflect the impact trend of context nodes. To this end, we introduce exponentially weighted moving average [19].
4.3.1. Exponentially Weighted Moving Average (EWMA)
Moving average (MA) is a calculation to analyze sequential data which reflect the changing trend in the sequence. Based on MA, exponentially weighted moving average (EWMA) applies weighting factors which decrease exponentially. The older data are attached with lower weights, but weights never reach zero. The EWMA for a sequence Y can be formulated recursively:where γ is a parameter that represents the degree of weight decrease and . is the current data, and represents the EWMA value of the current data. In the tree , the deep layer nodes need to be given small weights because they are farther away from the anchor node. As a result, we can attach weight for each context node in . However, the nodes in the same layer need to be sorted first. For consistency, we sort the same layer nodes according to their values. And then, a normalized context node sequence can be generated for the anchor node as , where are sampled context nodes of . Afterwards, we apply EWMA on the context nodes from as follows:
As the similarity we introduced EWMA, we treat as the weight of the context node , which is denoted as .
4.3.2. Text Information Representation
With the development of deep learning, there are many neural network models to learn text representations, e.g., convolutional neural network (CNN) [8, 20, 21], recurrent neural network (RNN) [22], long shortterm memory (LSTM) [23], and gated recurrent units (GRUs) [24]. In this paper, we investigate different Word2vec models and find the CNN has the best performance on our tasks, which can capture comprehensive semantics in the heterogeneous text network.
In Figure 3, we show the framework of a generating process of context embedding. Given a normalized context node sequence rooted at , we take the word sequence of each node in as the input, and the CNN obtains text embedding through three layers, i.e., encoder and lookingup, convolution, and meanpooling. And then, we adopt weighted summations for the representation vectors of the anchor node and its context nodes to obtain context embedding for .
(1) Encoder and LookingUp. First, we map all words in the heterogeneous text network to a sequence of word IDs. Hence, we can obtain an ID sequence for . And then, the lookingup layer transforms each word into a vector , where is the dimension of word embeddings. Finally, we can obtain an embedding sequence for . As is shown in Figure 3, after the encoder and lookingup layer, we can get a matrix sequence , and is equivalent to .
(2) Convolution. After the encoder and lookingup layer, we use the convolution layer to extract the features of the input matrix sequence . We perform convolution operation by a kernel to slide row by row in () as follows:where denotes the feature vector of , in which is the number of words in (the text of ), and b is the bias vector.
(3) MeanPooling. We test different pooling regulations. To get fullscale features of the text information for a node, we perform meanpooling to get the text embedding . Then, we choose as the nonlinear activation function over , which iswhere , in which is the dimension of text embedding. At last, we can get the embedding of the text information for as .
So far, we have obtained text embedding by the CNN for each node in a context node sequence. Following this, we do weight summations on the context node embeddings , and this operation is sumpooling in Figure 3. The strategy of generating context embedding for is as follows:
Through the method stated, we establish correlations between the anchor node and its context nodes in terms of representation vectors and maintain text relevance. Eventually, we can get context embedding for a given node , and the whole representation of is bespoken as .
The text embedding part of the context embedding framework shown in Figure 3 looks like the convolution method of CANE. The difference is that the input of our model is the CNS of a node, while the input of CANE is a pair of nodes. In addition, we sort the nodes in the CNS according to NI and weight each node in CNS with EWMA values, as shown in equation (13).
4.3.3. Context Embedding Objective
Context embedding objective aims to measure the loglikelihood of a given directed edge as
Thus, the loss function of generating context embedding can be represented as . With above formulations, CAHNE aims to minimize the overall loss function as
At last, the workflow of the context embedding method is summarized in Algorithm 2.

4.4. Optimization of CAHNE
4.4.1. Attention for Context Node Sequence
Noticing the context embeddinggenerating strategy in equation (13), the vector representation of the anchor node is decomposed as the affinity between and its context nodes’ representations . Intuitively, the affinity between context nodes and the anchor nodes should depend on the specific anchor node. For instance, and are anchor nodes in a realworld network, but they have different properties; as a result, they have varied intensity of affinity with their context nodes. Therefore, it is a requisite to incorporate such characters of the anchor nodes in modeling the unique excitation effects α.
In line with the attention mechanism [25], a novel and popular model for machine translation, we define the weights between the anchor node and its context nodes with the softmax unit as follows:
Therefore, equation (13) can be reformulated as
4.4.2. Negative Sampling
For equation (8) and equation (14), CAHNE aims to maximize the conditional probability between and , which is computationally expensive because of the softmax function for all nodes. To address this problem, we adopt the method of negative sampling [26] to approximate the objective function as the following form:where represents the logistic function and n is the number of randomly sampled vertices. We set , where is the outdegree of . At last, we adopt the Adam algorithm [27] for optimizing equation (18) and set the learning rate as 0.001.
5. Experiment
In this section, we empirically evaluate the performance of the proposed framework CAHNE.
5.1. Dataset Descriptions
In order to comprehensively evaluate the effectiveness of our model CAHNE, we use seven realworld datasets, including two social networks, two citation networks, one language network, one cooccurrence network, and one communication network, for four applications, i.e., network reconstruction, link prediction, node classification, and visualization. The detailed descriptions are listed as follows:(i)Zhihu [28] is a network of social relationships which is an online Q&A platform in China. Users follow each other, asking and answering questions on Zhihu. The text information is concerned topics of each user, which is expressed as full text. We filter out 10000 users from Zhihu who have information on concerned topics. The size of the vocabulary is 9035, and the average length of the text is 89. We evaluate this dataset on the link prediction task.(ii)HEPTH [8] is a citation network from arXiv. After filtering out the papers without abstract, 1038 papers are preserved. The text information is expressed as full text. The size of the vocabulary is 2970, and the average length of the text is 54. We evaluate these data on the link prediction task.(iii)Cora (https://linqs.soe.ucsc.edu/data) is also a citation network containing 2708 machine learning papers with text information classified into one of seven classes. The citation network consists of 5429 links. The text information is expressed as full text. The size of the vocabulary is 16426, and the average length of the text is 88. Cora is used for the link prediction task and node classification task.(iv)BlogCatalog (http://leitang.net/social_dimension.html) is a large social network of online users listed on the BlogCatalog website. There are 39 different categories of labels for this dataset, and each label represents the metadata provided by a user. Since this dataset does not contain text information, it will be evaluated on the node classification task and network reconstruction for CAHNE (without context embedding).(v)Wikipedia [29] is a cooccurrence network which contains 2045 nodes, 17981 edges, and 19 different labels. The tfidf matrix of the Wikipedia dataset describes the text information for this dataset. There are 4973 columns that correspond to 4973 different words. This dataset will be evaluated on the node classification task.(vi)20NewsGroup (http://qwone.com/∼jason/20Newsgroups/) is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. We choose the news documents labeled as comp.graphics, rec.sport.baseball, and talk.politics.gums to evaluate our model on the visualization task. There are 1720 pieces of news contained and expressed as full text. The size of the vocabulary is 30127, and the average length of the text is 206. Besides, 20NewsGroup is a weighted network.(vii)EmailEnron (https://snap.stanford.edu/data/emailEnron.html) is a communication network that covers the email communication within a dataset. Nodes are email addresses, and edges denote interactions between emails. Text descriptions of this dataset are full email message text. The size of the vocabulary is 29523, and the average length of the text is 149. We filter 6820 nodes and 23968 edges from the original dataset.
The detailed statistics are summarized in Table 1.

5.2. Baselines
We consider the following six NE methods to demonstrate the effectiveness and robustness of CAHNE:(i)DeepWalk [4]: it adopts truncated random walk and SkipGram model to learn node representations.(ii)LINE [5]: it preserves the firstorder and secondorder proximity among nodes in the network.(iii)Node2vec [6]: it proposes a biased random walk based on DeepWalk to learn node representations.(iv)GraRep [30]: it integrates global structural information of the graph and uses SVD to train the model.(v)Naive Combination: we directly concatenate the text feature embeddings learned by the CNN and node representations learned from LINE for network representation. We choose LINE to learn structure embedding because it can exploit both firstorder and secondorder proximity in the network, which is more comprehensive than DeepWalk and Node2vec.(vi)TADW [29]: it integrates text features into network embedding by employing matrix factorization.(vii)TENE [31]: it learns the representations of nodes under the guidance of both the proximity matrix which captures the network structure and the text cluster membership matrix derived from clustering for text information.(viii)ASNE [32]: it learns representations of nodes by preserving both the structural proximity and attribute (text) proximity.
5.3. Experimental Settings
To be fair, we set the embedding dimension for all methods on HEPTH, Cora, EmailEnron, and 20NewsGroup. And for Zhihu, BlogCatalog, and Wikipedia, we set . For DeepWalk, we set the window size as 10, the walk length as 80, and the number of walks for each node as 10. For LINE, we set the learning rate as 0.001 and the number of negative samples as 5. For Node2vec, we choose the hyperparameters and q to obtain the best performance by grid search. For GraRep, we set the maximum matrix transition step as 5. For TENE, we set the parameter of the contribution of text information and the parameter β to guarantee the accuracy of the text cluster membership matrix as .
For our model CAHNE, we set the number of negative samples as 5 to speed up the training process. Besides, we set and for all datasets. Hereinafter, we use “CAHNEa” to validate the effectiveness of our method with the attention mechanism, and “CAHNE(w/o context)” denotes CAHNE without incorporating context embedding.
5.4. Network Reconstruction
Reconstructing the network and preserving the original network structure are fundamental objectives for network embedding methods. Definitely, we train an NE method to obtain vector representations of nodes and rank pairwise nodes according to the inner product similarities of them. Since the larger similarities mean higher probabilities of existing edges between pairwise nodes, the top ranking pairwise nodes are used to reconstruct the network efficiently. The precision@k [33] is used as the evaluation metric, which is formulated aswhere k is the number of evaluated pairwise nodes and ξ is a binary variable. denotes the ith reconstructed pair of nodes is correct; otherwise, it is wrong.
We use a realworld social network BlogCatalog and a communication network EmailEnron as representatives. The result on the precision@k is shown in Figure 4, from which we make the following observations:(i)Figure 4 shows that the precision@k of our method CAHNE almost outperforms that of other methods with the increase of k, which verifies that CAHNE can perfectly preserve the network structure.(ii)Because there is no text information in BlogCatalog, Figure 4(a) can clearly reveal that using node importance to weight edges is effective.(iii)Figure 4(b) shows our method has comparable performance on EmailEnron. We can notice that methods integrating text information are obviously better than other methods, and CAHNEa can have a relatively high position.
(a)
(b)
From the above observations, we regard that our method CAHNE and its expansion CAHNEa achieve a significant advance in efficiency on the task of network reconstruction.
5.5. Link Prediction
For link prediction, we use AUC [34] to evaluate the performance, which means the probability that nodes in a random edge are higher than those in a casual nonexistent edge. In this task, as shown in Tables 2–4, we randomly hide certain percentages of edges, respectively, from 85% to 5% on HEPTH, Cora, and Zhihu and use the left graph to train. We use the logistic regression method to predict the probability of a given pair of nodes has an edge between them.



From these tables, some observations can be listed:(i)The results show that the fewer the training edges, the more the nodes are ignored and the lower the performances of all methods. The results on Zhihu are worse than those on other datasets probably because realworld social networks are often accompanied by more complex information from both structures and properties compared to citation networks. However, our proposed model CAHNEa always achieves the best performances compared to all other baselines on all different datasets. Especially, when the ratio of training edges reaches 95% in Cora and HEPTH, AUC values of CAHNEa are higher than 95.(ii)CAHNE(w/o context) performs better than other structureonly methods (DeepWalk, LINE, Node2vec, and GraRep). It demonstrates that merging node importance when learning network representation is valid and leads to better predicting power for new link formation.(iii)TADW, TENE, ASNE, and CAHNE perform better than all other structureonly methods. It verifies our assumption that text information cannot be neglected in heterogeneous text networks. However, CAHNE cannot always perform better than TADW, such as shown in 15% in Table 2 and 15% in Table 3. We notice that this phenomenon occurs only when the training ratio is under 35%, which we believe is due to the fact that the CNS cannot contain most context nodes of the anchor node when the training ratio is too low. Also, if the CNS is too incomplete, it will lose a lot of information from the context. Table 5 shows the average length of CNSs when extracting different ratios of edges as training sets in three datasets. The completeness of CNSs will affect the effectiveness of CAHNE.

Thus, the results in tables can serve as evidence that CAHNEa has a stable and best performance on all datasets and different training ratios. It demonstrates the flexibility and robustness of CAHNE, and the attention mechanism is significant when learning representations for realworld networks.
5.6. Node Classification
For this task, we choose BlogCatalog, Cora, and Wikipedia as training datasets in which each node is assigned a label. Given the node embeddings obtained by different NE methods as node features, we train a logistic regression classifier to predict the node labels. We use MacroF1 and MicroF1 as measurements to evaluate the performance. We vary the size of the training set from 50% to 90%, and the remaining nodes are the testing set. We repeat each classification experiment ten times and report the average performance in terms of both MacroF1 and MicroF1 scores. The results on BlogCatalog, Cora, and Wikipedia are shown and compared in Figure 5. Since BlogCatalog is without text information, we only consider CAHNE(w/o context) on this dataset.
(a)
(b)
(c)
From the results, we obtain the following observations:(i)The performances in BlogCatalog are worse than those in other datasets because of the complexity of social networks, and BlogCatalog has the most nodes which could reduce the capability of the classification task, but our proposed model CAHNE(w/o context) still obtains the most satisfactory results.(ii)For structureonly methods, CAHNE(w/o context) has the best effectiveness on all datasets. It demonstrates that the network representations merging with node importance can be better generalized to the classification task.(iii)CAHNE(w/o context) performs better than CAHNE and CAHNEa on Wikipedia as measured by MacroF1, which indicates this dataset is not sensitive to text information. We believe this is because the text descriptions between different entries vary widely.
5.7. Visualization
Another intuitive way to investigate the qualities of network embedding methods is visualization, and in this experiment, we reduce the dimensionality of each representation vector to 2. There are many ways to visualize highdimensional vectors, e.g., PCA [35], Isomap [9], and tSNE [36]. In this paper, we adopt tSNE to achieve dimension reduction because tSNE can preserve local and global structures of the data. Therefore, we use baselines and our method CAHNEa to learn representations of the 20NewsGroup network and input them into tSNE. From 20NewsGroup, since all categories of graphs are full connection, to simplify the computational process and improve visualization performance, we filter three categories of news and their documents, comp.graphics, rec.sport.baseball, and talk.politics.gums, as our training set.
The resulting visualizations with baselines and CAHNEa are illustrated in Figure 6, from which we have the following observations:(i)For DeepWalk and GraRep, all points of different categories are chaotic and mixed with each other. Since the network is weighted, DeepWalk cannot handle weighted networks when random walking, which leads to chaos. GraRep integrates weights of edges into representation learning by using ESGNS, which is powerless to capture the nonlinear relationship between nodes.(ii)For LINE, ASNE, TENE, and Naive Combination, we can intuitively find the clusters, but the boundary of each category is not clear.(iii)For Node2vec, we can distinguish three categories more explicitly than for LINE because of a larger space between each cluster. However, the downsides of these clusters are not divisible.(iv)For TADW, the shapes of clusters are not regular, and the blue points are not getting together.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
Obviously, the visualization of our model CAHNEa has a clear boundary, and the shapes of clusters are more regular than those reported in other baselines.
6. Conclusions
In this paper, we propose a novel method to learn node representations for heterogeneous networks, namely, CAHNE. By formulating the context node sequence for each node in a realworld network and redefining the conventional network to integrate text information, CAHNE achieves the learning of node embedding and captures the comprehensive semantic information, maintaining the compatibility between network structures and text information simultaneously. For the unweighted network, we analyze the strength of the relationship between nodes and propose the definition of node importance to quantify it as the weight between nodes. We integrate node importance into the learning process of structurebased embedding to explore the potential structural information in the network. Furthermore, by plugging an attention mechanism in the influence rate of the context nodes, CAHNE obtains the capacity to decide the influence degree from context nodes for different anchor nodes. Extensive experiments prove the competitiveness of CAHNE against baselines and demonstrate the flexibility, stability, and robustness of CAHNE. Future work includes incorporating more types of heterogeneous information like attributes of nodes and edges and optimizing the training process on larger networks.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Social Science Foundation of China under Grant 17CG209 and the National Natural Science Foundation of China under Grant 61872166. The work was also supported in part by the Natural Science Foundation of Jiangsu Province under Grant BK20180600 and Fundamental Research Funds for the Central Universities under Grant JUSRP11852.
References
 L. Lü and T. Zhou, “Link prediction in complex networks: a survey,” Physica A: Statistical Mechanics and Its Applications, vol. 390, no. 6, pp. 1150–1170, 2011. View at: Publisher Site  Google Scholar
 C. Li, Z. Li, S. Wang, Y. Yang, X. Zhang, and J. Zhou, “Semisupervised network embedding,” in Proceedings of the International Conference on Database Systems for Advanced Applications, pp. 131–147, Springer, Suzhou, China, March 2017. View at: Google Scholar
 Y. Jiang, F.L. Chung, S. Wang, Z. Deng, J. Wang, and P. Qian, “Collaborative fuzzy clustering from multiple weighted views,” IEEE Transactions on Cybernetics, vol. 45, no. 4, pp. 688–701, 2015. View at: Publisher Site  Google Scholar
 B. Perozzi, R. AlRfou, and S. Skiena, “Deepwalk: online learning of social representations,” in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 701–710, ACM, New York, NY, USA, August 2014. View at: Google Scholar
 J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: largescale information network embedding,” in Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077, International World Wide Web Conferences Steering Committee, Florence, Italy, May 2015. View at: Google Scholar
 A. Grover and J. Leskovec, “Node2vec: scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864, ACM, San Francisco, CA, USA, August 2016. View at: Google Scholar
 C. Li, S. Wang, D. Yang et al., “PPNE: property preserving network embedding,” in Proceeding of the International Conference on Database Systems for Advanced Applications, pp. 163–179, Springer, Suzhou, China, March 2017. View at: Google Scholar
 C. Tu, H. Liu, Z. Liu, and M. Sun, “CANE: contextaware network embedding for relation modeling,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 1722–1731, Vancouver, BC, Canada, August 2017. View at: Google Scholar
 M. Balasubramanian and E. L. Schwartz, “The isomap algorithm and topological stability,” Science, vol. 295, no. 5552, p. 7a, 2002. View at: Publisher Site  Google Scholar
 T. F. Cox and M. A. A. Cox, Multidimensional Scaling, Chapman and hall/CRC, Boca Raton, FL, USA, 2000.
 M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in Proceedings of the Advances in Neural Information Processing Systems, pp. 585–591, Vancouver, Canada, December 2002. View at: Google Scholar
 Y. Goldberg and O. Levy, “Word2vec explained: deriving mikolov et al.’s negativesampling wordembedding method,” 2014, http://arxiv.org/abs/1402.3722. View at: Google Scholar
 T.y. Fu, W.C. Lee, and Z. Lei, “Hin2vec: explore metapaths in heterogeneous information networks for representation learning,” in Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1797–1806, Singapore, November 2017. View at: Google Scholar
 Z. Zhang, H. Yang, J. Bu et al., “Attributed network representation learning via deep neural networks,” in Proceedings of the International Joint Conference on Artificial Intelligence, vol. 18, pp. 3155–3161, Stockholm, Sweden, July 2018. View at: Google Scholar
 S. Ganguly and V. Pudi, “Paper2vec: combining graph and text information for scientific paper representation,” in Proceedings of the European Conference on Information Retrieval, pp. 383–395, Springer, Aberdeen, Scotland, April 2017. View at: Google Scholar
 F. B. Viégas and J. Donath, “Social network visualization: can we go beyond the graph,” in Proceedings of the CSCW Workshop on Social Networks, vol. 4, pp. 6–10, Chicago, IL, USA, November 2004. View at: Google Scholar
 L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: bringing order to the web,” Stanford InfoLab, Stanford, CA, USA, 1999, Technical report. View at: Google Scholar
 C. Yang, M. Sun, Z. Liu, and C. Tu, “Fast network embedding enhancement via high order proximity approximation,” in Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 3894–3900, Melbourne, Australia, August 2017. View at: Google Scholar
 J. M. Lucas and M. S. Saccucci, “Exponentially weighted moving average control schemes: properties and enhancements,” Technometrics, vol. 32, no. 1, pp. 1–12, 1990. View at: Publisher Site  Google Scholar
 Y. Kim, “Convolutional neural networks for sentence classification,” 2014, http://arxiv.org/abs/1408.5882. View at: Google Scholar
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the Advances in Neural Information Processing Systems, pp. 1097–1105, Lake Tahoe, NV, USA, December 2012. View at: Google Scholar
 D. P. Mandic and J. A. Chambers, Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures and Stability, John Wiley & Sons, Hoboken, NJ, USA, 2001.
 M. Sundermeyer, R. Schlüter, and H. Ney, “LSTM neural networks for language modeling,” in Proceedings of the Thirteenth Annual Conference of the International Speech Communication Association, Portland, Oregon, USA, September 2012. View at: Google Scholar
 J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” 2014, http://arxiv.org/abs/1412.3555. View at: Google Scholar
 A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you need,” in Proceedings of the Advances in Neural Information Processing Systems, pp. 5998–6008, Long Beach, CA, USA, December 2017. View at: Google Scholar
 T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proceedings of the Advances in Neural Information Processing Systems, pp. 3111–3119, Lake Tahoe, NV, USA, December 2013. View at: Google Scholar
 D. P. Kingma and J. Ba. Adam, “A method for stochastic optimization,” 2014, http://arxiv.org/abs/1412.6980. View at: Google Scholar
 X. Sun, J. Guo, X. Ding, and T. Liu, “A general framework for contentenhanced network representation learning,” 2016, http://arxiv.org/abs/1610.02906. View at: Google Scholar
 C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Chang, “Network representation learning with rich text information,” in Proceedings of the TwentyFourth International Joint Conference on Artificial Intelligence, pp. 2111–2117, Buenos Aires, Argentina, July 2015. View at: Google Scholar
 S. Cao, W. Lu, and Q. Xu, “GraRep: learning graph representations with global structural information,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 891–900, Melbourne, Australia, October 2015. View at: Google Scholar
 S. Yang and B. Yang, “Enhanced network embedding with text information,” in Proceedings of the 24th International Conference on Pattern Recognition (ICPR), pp. 326–331, IEEE, Beijing, China, August 2018. View at: Google Scholar
 L. Liao, X. He, H. Zhang, and T.S. Chua, “Attributed social network embedding,” IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 12, pp. 2257–2270, 2018. View at: Publisher Site  Google Scholar
 D. Wang, P. Cui, and W. Zhu, “Structural deep network embedding,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1225–1234, ACM, San Francisco, CA, USA, August 2016. View at: Google Scholar
 J. M. Lobo, A. JiménezValverde, and R. Real, “AUC: a misleading measure of the performance of predictive distribution models,” Global Ecology and Biogeography, vol. 17, no. 2, pp. 145–151, 2008. View at: Publisher Site  Google Scholar
 S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” Chemometrics and Intelligent Laboratory Systems, vol. 2, no. 1–3, pp. 37–52, 1987. View at: Publisher Site  Google Scholar
 L. v. d. Maaten and G. Hinton, “Visualizing data using tSNE,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008. View at: Google Scholar
Copyright
Copyright © 2019 Wei Zhuo et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.