Abstract

In this paper, we propose a novel semisupervised learning framework to learn the flows of edges over a graph. Given the flow values of the labeled edges, the task of this paper is to learn the unknown flow values of the remaining unlabeled edges. To this end, we introduce a value amount hold by each node and impose that the amount of values flowing from the conjunctive edges of each node to be consistent with the node’s own value. We propose to embed the nodes to a continuous vector space so that the embedding vector of each node can be reconstructed from its neighbors by a recursive neural network model, linear normalized long short-term memory. Moreover, we argue that the value of each node is also embedded in the embedding vectors of its neighbors, thus propose to approximate the node value from the output of the neighborhood recursive network. We build a unified learning framework by formulating a minimization problem. To construct the learning problem, we build three subproblems of minimization: (1) the embedding error of each node from the recursive network, (2) the loss of the construction for the amount of value of each node, and (3) the difference between the value amount of each node and the estimated value from the edge flows. We develop an iterative algorithm to learn the node embeddings, edge flows, and node values jointly. We perform experiments based on the datasets of some network data, including the transportation network and innovation. The experimental results indicate that our algorithm is more effective than the state-of-the-arts.

1. Introduction

1.1. Background

Learning the flow direction and amount of the edges for a network has been a critical problem in the network analysis. The edges connect two nodes, and the flow is defined from one node to another connected by the corresponding edge [13]. This problem is called the edge flow estimation. In this problem, we already know the network structure, including the sets of the nodes, and the edges between the nodes. We also know the directions and amounts of the flows of some edges, but for the rest edges, the flows are still unknown. The target is to predict the flow direction and amounts of these edges. Some examples of the applications of the edge estimation are given as follows:(i)For example, in the transportation network, each intersection is a node, and each road is an edge, while the flows along the roads need to be estimated for the purpose of traffic control. According to the historical data of the traffic flows, we know the flows of some roads; however, other read flows need to be estimated. This use case raises the problem of learning edge flows from both the traffic network and existing flows of roads of the network [4, 5].(ii)Another example is the innovation network analysis, where each research article is a node, and each citation is an edge between two articles. For the task of innovation, it is important to know the flowing of knowledge from an article to other articles. Please note that one article cites another one that does not necessarily mean there is an amount of “knowledge” flowing from the cited article to the citing article. By reading the content of the article which cites the others, we can annotate the “knowledge flowing’’ if one work is inspired or a following work of the other research works. However, such annotation by reading is time-consuming and subject to individual annotators, thus it is very necessary to develop an automatic system to estimate the knowledge flow from the article citation network [614].

Although the edge flow prediction/estimation is such a critical problem, the works on this direction are very few. Most recently, Ransom et al. [1] designed a new algorithm to predict the edge flow by balancing the amount of flows moving into and out of the nodes. Another condition of this algorithm is that the predicted flow of edges should be consistent to the known flows, for these edges whose flows are already known. In this paper, we proposed a novel edge flow estimation algorithm based on both semisupervised learning and network embedding.

1.2. Our Contributions

In this paper, we build a novel method for the problem of edge flow prediction. This method is both for network embedding and edge flow prediction. The two problems are solved at the same time. We designed a new framework for the learning problem. In this framework, for the first time, we bridge the node embedding and edge flow estimation, by introducing a node value for each node. On the one side, the node value is used to balance the flowing-in and flowing-out amount of a node. It plays the role of measuring the balance of the amount of the node at any moment of the flowing process, i.e., with the incoming flow and outgoing flow changing, the amount of the value in this node remains as the node value. One the other side, the node is employed to regularize the learning of the embedding vectors, thus we impose that the node value can be estimated from the embedding vector by a linear function.

We propose a novel algorithm to learn the embedding vectors and edge flows simultaneously. We model the learning problem as a minimization problem where the embedding vectors reconstruction error of the LSTM embedding model, the node value estimation error from the embedding vectors, and the node value flow amount error are minimized. In the iterative algorithm, the node embeddings, node values, and the edge flow amounts are optimized alternately.

We evaluated the proposed method over benchmark datasets of networks, conducted experiments to reveal the properties of the proposed algorithms, and show its advantage over state-of-the-art methods.

Remark: the superiority of the proposed semisupervised edge flow learning method compared with the traditional semisupervised learning method is listed as follows:(1)The traditional semisupervised learning methods can only predict the labels of the nodes but are not able to predict the flows of the edges. For the applications discussed in this paper, the traditional semisupervised learning methods are not suitable. Our method is especially designed for these applications.(2)Traditional semisupervised learning methods can only use the edge information of a graph but ignore the flow information, which is critical for both the node and flow label prediction. However, our method can effectively use them to learn better node embeddings and flow amounts.

1.3. Paper Origination

This paper is organized as follows. In Section 2, we introduce the joint learning framework, with an objective and optimization solution. In Section 3, we experimentally evaluate the performance of the proposed method and conduct studies over its properties. In Section 4, we conclude this paper with some future works.

2. Proposed Method

In this section, we introduce the proposed joint network embedding and edge flow learning framework. In this framework, we propose a deep learning-based network embedding method [1523] and further use the embedding vectors to estimate the flow amount regarding each node [15]. The flow amount is used to regularize the edge flow learning process.

2.1. Problem Definition

Suppose we have input network, denoted as , where is the set of n nodes, and is an edge linking the i-th and j-th nodes. Here, we assume that . For a group of edges, we already know their flows. This set of edges is denoted as . For such an edge , we define its flow as . The direction of the flow is the sign of , and the amount of the flow is the absolute value of . For the other edges, the flows are unknown, and we want to predict them. For these flows, we define a set as EU. We define a vector as the flows of all the nodes, as . Thus, the prediction of the flows of the nodes is transformed to the solving of .

2.2. Problem Modeling

We firstly embed each node to a vector. The dimension of the vector is denoted as . Moreover, for each node, we define a node value, . We also propose to calculate the node value from the node’s embedding vector. After is given, we use it to regularize the estimation of the flow of edges connected to the node. The flow chart of the proposed semisupervised edge flow learning method is given in Figure 1.

2.2.1. Recursive Node Embedding

The network embedding converts the nodes to a set of vectors, denoted as , where . To this end, we want to reconstruct the embedding vector of one node from its linked nodes. The linked nodes are presented as a sequence of nodes. Accordingly, the sequence of neighbouring nodes of the i-th node are given as

We firstly sort the nodes in the neighbouring set, , and the sorted set’s embedding vector set is

We apply aln-LSTM to this sequence of embedding vectors. In this model, for an input node’s embedding vector, , its inputs are itself and the output of its previous ,where is the cell function of ln-LSTM and is its parameter.

Then, we try to use the output of the cell function from the last node to approximate the embedding vector of the i-th node itself,

Thus, the minimization of the error of the approximation is modelled as

By solving this problem, the learning of the embedding vector and the parameter of the cell function are solved together. With the good quality embedding and cell function parameter, we should be able to approximate the embedding vectors for the neighbouring nodes’s embedding vectors.

2.2.2. Node Value Estimation from Recursive Embedding

In out learning framework, the embedding of each node approximated by the recursive model plays two roles, representing the node’s neighbor structure and estimating the amount of value hold by the nodes. We define a node value for the i-th node, . The function of this value is to balance the incoming and outgoing flow of the node. Moreover, it can somehow measure the nature of the node. For example, a node in traffic network at the working time may have more incoming flow, because it is an office area. This value can indicate this node’s nature of being in office area. Moreover, we want to use the embedding vector of the node to calculate the value as follows:where is the function of node value approximation. This function is actually a single-layer neural network. We proposed to minimize the square error of the approximation over all the nodes,

The minimization is performed regarding the node value amount, embedding vectors, recursive model parameters, and the single-layer neural network parameter. In this way, we bridge the learning of the flow amount and embedding of nodes by using an LSTM recursive model.

2.2.3. Flow Prediction by Using Node Value

Given an edge , we want to predict its flow . To this end, we use the node value as reference. Our assumption is that for any node, the value of the node is controlling the incoming and outgoing flow. To be more specific, for a node, the difference between its incoming flow and outgoing flow should be equal to the node value itself. For this purpose, we define two sets of edges for a node and . The amount of the incoming flow is . Since we hope this amount is equal to the node value,

We also denote the node value of all nodes in a vector . A matrix is also proposed as . It is the node-flow mapping matrix. So, equation (8) is transformed to

We minimize the squared approximation error regarding both the edge flow vector and the node value vector,

The final objective and minimization problem is the combination of (5), (7), and (10) as follows:

Here, we also add the norm regularization terms to the parameters to prevent the over-fitting problem.

2.3. Problem Solution

The solving of the minimization problem is conducted in an iterative algorithm. In each iteration, every parameter is updated sequentially. The others are not updated when one parameter is being updated. In the following subsections, we introduce the optimization of each parameter one by one, while the others are fixed.

2.3.1. Optimization of LSTM Parameters,

When the other parameters are fixed, and only the LSTM model parameters are considered, we have the following minimization problem:which can be solved by the ADAM algorithm. To use the ADAM algorithm, we calculate the gradient function of the objective regarding θ as follows:where is the derivative function of regarding variable .

2.3.2. Optimization of Embedding Vectors,

When the other variables are fixed and only the embedding vectors are considered, we have the following minimization problem:

To solve the vectors, we apply the sequential optimation method and optimize the embedding vectors of the nodes one by one. When one node embedding vector is optimized, others are fixed. When the i-th vector, , is considered, we have the following problem for minimization:

Again, we use the ADAM algorithm to solve this problem similar to the optimization of θ.

2.3.3. Optimization of Edge Flow Vector, f

When the edge flow vector is considered, we have the following minimization problem:

This is a linear constrained quadratic programming problem; we employ the active-set algorithm to solve it.

2.3.4. Optimization of Node Value Vector,

To solve the node value vector, we fix the other variables and have the following minimization problem:

We set the derivative of regarding to zero and have the solution of ϕ as

2.3.5. Optimization of

To solve the parameter of function , we have the following minimization problem:

We still employ the ADAM algorithm to solve this minimization problem.

3. Experiments

In this section, we give the experimental setting and results. The algorithm tested and developed is called edge flow prediction by network embedding (EFPNF).

3.1. Benchmark Dataset and Experimental Setting

Four datasets of network are used in our setting. We summarize the datasets in Table 1.

The 10-fold cross-validation is used to generate the training and test set. The Pearson correlation coefficient is used to measure the quality of the flow predicted by the algorithm [28].

3.2. Experimental Results
3.2.1. Comparison to State-of-the-Arts

As shown in Figure 2, the algorithm is compared to the others which are also used to predict the flows of edges. The compared algorithms are flow SSL algorithm proposed by Jia et al. [1] and the LineGraph algorithm. According to the figure, the proposed algorithm EFPNF has better performance for all four datasets than the compared algorithms. Especially for the Balerma dataset, the EFPNF is the only algorithm which has the correlation score larger than 0.7. We also can see that the knowledge network dataset is the hardest task, and the EFPNF still gives the best performance.

3.2.2. Convergence Analysis

Since our algorithm is an iterative algorithm, we are also interested in the convergence of the algorithm. Thus, we plot the curve of correlation while the number of iterations is increasing (Figure 3). We can see from these curves that our algorithm’s performance is boosted when the iteration number is increased from 5 to 20. Since our algorithm aims to minimize the objective function, more iteration numbers result in a smaller value of the objective. This verifies the effectiveness of the objective to achieve a better edge flow estimation performance. However, when the iteration number further increases from 20 to 100, the performance improvement is not significant, meaning our algorithm does not need a large number of iterations to give a good performance.

4. Conclusion

We developed an iterative algorithm to learn the node embeddings and missing flows of the edges for a network. This algorithm is based on the deep recurrent network for the embedding purpose. Moreover, it uses the embeddings to calculate the node values and further uses the node values to approximate the flows around the node. A unified learning framework is built for the learning of embedding network, node value, and flows of the edges. The learning process is guided by the minimization of the reconstruction errors of embedding vectors, node values, and incoming/outgoing flows. Experimental results show the advantage of the proposed algorithm.

Data Availability

All the datasets used in this paper to produce the experimental results are publicly accessed online.

Conflicts of Interest

The authors of this paper claim no conflicts of interest regarding the work reported in this paper.

Acknowledgments

This paper was funded by the National Natural Science Foundation of China (Project nos. 71704036 and 71473062); Social Science Foundation of Ministry of Education of China (Project no. 19YJA790087); and the Talents Plan of Harbin University of Science and Technology: Outstanding Youth Project (Project no. 2019-KYYWF-0216).