Abstract

Low rank matrices approximations have been used in link prediction for networks, which are usually global optimal methods and lack of using the local information. The block structure is a significant local feature of matrices: entities in the same block have similar values, which implies that links are more likely to be found within dense blocks. We use this insight to give a probabilistic latent variable model for finding missing links by convex nonnegative matrix factorization with block detection. The experiments show that this method gives better prediction accuracy than original method alone. Different from the original low rank matrices approximations methods for link prediction, the sparseness of solutions is in accord with the sparse property for most real complex networks. Scaling to massive size network, we use the block information mapping matrices onto distributed architectures and give a divide-and-conquer prediction method. The experiments show that it gives better results than common neighbors method when the networks have a large number of missing links.

1. Introduction

As a fundamental problem in the network researches, link prediction attempts to estimate the likelihood of relationship between two individuals by the study of observed links and the property of nodes. Researches on the problem can benefit a variety of fields. For example, researchers in different areas can efficiently find their cooperative partners or assistants. Security agencies can more precisely focus their efforts on probable relationships in malicious networks. In social networks, people can find companions based on the prediction of their surrounding networks.

The natural framework of link prediction methods is the similarity-based algorithm. Simple similarity-based measures such as neighborhood-based measures, for example, Adamic-Adar score [1] and common neighbors [2], require consideration of the local structure of the networks. Recently, a considerable amount of work which draws attention to community structure and scalable proximity estimation [3, 4] gives good prediction accuracy. Some similarity-based measures such as the path based methods, for example, Katz [5] and Rooted PageRank [6, 7], which focus on the global structure of the networks, are more effective but have a high computational complexity. A new measure based on neighbor communities has a good performance with a low complexity [8]. Maximum likelihood estimation, such as hierarchical structure model [3] and stochastic block model [9, 10], presuppose some organizing principles of the network structure. Some algorithms, such as probabilistic relational models [11], probabilistic entity-relationship models [12], and stochastic relational models [13], learn the underlying structure from the observed network and then predict the missing links. Lichtenwalter et al. [14] designed a flow based method for link prediction, which is more localized. Low rank matrices approximations can also be used in link prediction for network [1517]. Based on the technique of cluster low rank approximation for massive graphs, Shin et al. proposed a multiscale link prediction method [18], which captures the information of global structure of network and can handle massive networks quickly.

In order to capture the information of both global structure and clustering structure of network, we consider low rank approximations as well as blocks in networks’ adjacent matrices. Low rank approximations algorithms are good techniques to get the global information of the matrices. Meanwhile block structure is an important feature of matrices and it is often true that links in the same blocks have similar properties. Indeed, links are easy to be found in dense blocks. Good block detection algorithms have error tolerance: they are unaffected by a few missing edges in a network. This suggests that the principle of block detection could be applied to edge prediction.

Theoretically, a probabilistic latent variable model is proposed that combines both the concepts of block structure and low rank approximations for matrices. The model provides a framework for predicting links. Firstly, any modularity clustering algorithm can be used to generate blocks, while the only limit is the computational complexity. Then different from the low rank matrices approximations algorithms already used for link predictions, we use a new low rank matrices approximations algorithm named convex nonnegative matrix factorization (CNMF) [19] to get the predicting results within the blocks. The reason we use CNMF is that the sparseness of solutions is in accord with the sparse property for most real complex networks, so the predicting results are more reliable. In small networks we use -means to detect the block structure of a network’s adjacency matrix and average the prediction matrices for some to get the predicting results. Experiments show that our method shows better performance. Scaling to the massive size networks, it is infeasible to use CNMF directly for the high computational complexity. In this case, we use the block information mapping matrices onto distributed architectures and give a divide-and-conquer prediction method to embrace distributed computing.

2. Background

The network of nodes can be represented mathematically by an adjacency matrix . Here we set the diagonal entries to be 1, which means each node has a link to itself. This adjacency matrix can be treated as an object-feature matrix. Reduced rank method CNMF gives an approximation which has an interesting property: if is sparse, the factors and both tend to be sparse.

CNMF has a direct interpretation:    matrix is a convex combinations of the columns of ; thus we could interpret its columns as weighted sums of certain objects’ coordinates (the coordinate of -object is given by -column of ). So the columns of can be treated as cluster centroids of objects and weights the association between objects and clusters. Meanwhile measure the strength of relationship between clusters and features; that is, if cluster has feature ; otherwise. So can measure the strength of relationship between object with feature and then can be used to predict link between and .

3. A Probabilistic Latent Variable Model

Although the background gives an intuitionistic interpretation of CNMF used in link prediction, we still need theoretical guarantee. Here we propose a probabilistic latent variable model, and the model ensures that the probability of a link between two nodes can be expressed as a combination of CNMF and the block structure of a given adjacent matrix.

In probabilistic view, the observed network data is a realization of an underlying probabilistic model, either because it is itself the result of a stochastic process, or because the sampling has uncertainty. We can think of the adjacent matrices of network data as the object-feature matrices for objects and features . In this paper, contains an adjacency matrix and its blocks found by clustering methods such as -means. The joint occurrence probability of an object and a feature can be factorized as where is a variable for the index of objects, is a variable for the index of features, and is a variable for the index of sampling. is the joint occurrence conditional probability given the observation , and is the priori probability that is observed.

Objects in real data are often organized into modules or clusters and the probability that a object has a feature depends on the groups to which they belong. These clusters memberships are unknown to us. In the language of statistical inference, they are latent variable. Assuming each cluster is a combination of objects, the joint occurrence probability can be factorized as where is the variable for the index of cluster. Here, we assume that the random variables , are conditional independent given .

Define a random variable If observing once, the expected value is Let be a random variable of the occurrence frequency of in observing times. Then the expected value is

For the reason of interpretability, we suppose the joint concurrent probability of th object, th cluster, and th data sampling is given by a combination of data as follows: where and Constraint (8) ensures that the probability defined by (7) is well defined.

This constraint has the advantage that we could interpret as weighted sums of certain joint concurrence probability of object, features, and data, given by Therefore, (6) can be expressed as where . Our goal is to compute two -order tensors and .

If we are only inputting the adjacency matrix, we can drop the index for sampling; then Equation (11) can be expressed by matrix as Now, we show that this factorization is equivalent to CNMF.

In fact, for any CNMF solution , does not hold and holds for any . Let be the diagonal matrix, containing the row sums of . We say that the matrix approximately satisfies (8). This can be proved as follows: So the factorization satisfies the condition of (11). If we are also inputting blocks, (10) can be solved as where and are solutions of CNMF for . This can be proved as the case of only inputting the adjacency matrix.

The algorithm of CNMF gives a global optimal solution to . The computational complexity of it (the most time-consuming step in our method) for matrix is of order for factor and is of order for factor , where is the number of iterations [19].

4. Algorithms

Inputting the network data, the missing link prediction by calculating (15) has three steps. First, partition the observed adjacent matrix into blocks, using any modularity clustering algorithms. Secondly, the predicting matrix is given by doing CNMF to approximate each block. Thirdly, sum the corresponding entities of predicting matrices for all to make the final prediction. In small networks, We call our method -CNMF, as we use -means to partition the matrix. The diagram of -CNMF for is shown in Figure 1.

The purpose of the first step is to use several scales structures information of the observed network. For small networks, CNMF approximation can be computed directly on the original matrix (block generated by -means with ) to use the global information. A simple interpretation of our method is that if an edge is predicted to exist in many scales, it should be a missing link with high probability. The input of -CNMF also needs two parameters: desired rank and scale parameter . Algorithm 1 shows the algorithm for -CNMF.

input: //observed network
     , //desired rank, scale parameter
output: //predicting matrix
for in range of
 do -means to partition matrix into
 for block
  do CNMF with rank min
   //sum the corresponding entities
 end for
end for

When predicting links in massive size network, -means is unsuitable for high dimensional data clustering. Meanwhile, the high computational complexity of CNMF makes it also infeasible to be used on the large adjacent matrix directly. So we use fast modularity clustering algorithm [20] to generate blocks. Based on block structures, we give a divideand-conquer algorithm (-CNMF) to predict links, which is shown in Algorithm 2. The algorithm works by partitioning a matrix into blocks which are small enough for CNMF directly. Then the predicting results for the small blocks are combined to give the final predicting result for the original matrix. In order to give a solution for CPU load balancing, the size of blocks should be similar, which is achieved by splitting the large blocks and combining the small blocks to make their sizes in the neighborhood of a given threshold.

input: //observed network
     , //desired rank, scale parameter
output: //predicting matrix
find community structure
for
 if size
  divide into int size equal parts
  append each part to
  delete
 end if
end for
for
 if size
  for ,
   if size
    
    delete
   end if
  end for
 end if
end for
partition matrix into by
for block
 do CNMF with rank min
//sum the corresponding entities
end for

5. Experiments and Comparison

In general, links between different nodes may have different weights in networks, representing their relative importance in the network. In our experiments, we set all weights to be one and get the original adjacency matrix of the network. The observed network is generated by removing a fraction of links randomly from the original network , which will be called the missing edges. Then we use the two algorithms -CNMF and -CNMF to get the probability of links between nodes, which appears to be links’ weight in the observed network.

5.1. Evaluation Method

To measure the accuracies of link prediction methods, the main metric we use is AUC [21], area under the receiver operating characteristic (ROC) curve, which is widely used in the machine learning and social network communities. If we rank pairs of nodes in order of decreasing, AUC is mean value of the probability that a missing link () has a higher ranking than a nonexistent link (). In practice, we do independent comparisons. At each time we randomly pick a missing link and a nonexistent link to compare their ranking. If there are times the missing link has a higher ranking than the missing one and times they have the same ranking, the AUC value is

The missing links fraction ranges from 0.05 to 0.95, and the interval is set at 0.05.

5.2. Methods Used to Compare

We compare our algorithm with three prediction methods: Common Neighbors, Block Model, and Hierarchical Random Graphs.

(1) Common Neighbors (CN) [2]. If two nodes, and have many common neighbors, they are more likely to have a link. The measure of this is where is the set of neighbors of .

(2) Block Model (BM) [4]. In block models, nodes are partitioned into groups and the connecting probability of two nodes only depends on the groups they belong to. Given a partition of the network, is the number of edges in the observed network between nodes in groups and , and is the maximum possible number of links between and . The reliability of an individual link is where the sum is over partitions in the space of all possible partitions of the network, is node ’s group, , and .

(3) Hierarchical Random Graphs (HRG) [3]. The hierarchical structure of a network can be represented by a dendrogram with leaves (the vertices from the given network) and internal nodes. A probability is associated with internal node and the connecting probability of a pair of leaves is equal to , where is the deepest common ancestor of these two leaves. HRG combines the maximum likelihood method and Markov chain Monte Carlo method to sample the hierarchical structure with probability proportional to their likelihood from the observed network and then calculate .

5.3. Performance of -CNMF

We evaluate the performance of -CNMF using four high-quality small networks, and they are listed in Table 1: the social network of interactions between people in a karate club [22], the social network of frequent associations between dolphins [23], the air transportation network of USA, the coappearance network of characters in the novel Les Miserables [24], and a network of hyperlinks between weblogs on US politics [25]. Each AUC is obtained by averaging over 100 independent realizations.

Communities are basic structure in networks, which is widely used to predict missing links. Using block structure, our combined method is also dependent on communities of the networks. As Figures 2(a) and 2(b) show, -CNMF (, ) performs much better than CNMF () alone on Karate and Les-Mis, because both of those networks have more than two communities. However the enhancement by block structure is small on PB with desired rank (see Figure 2(c)), which has two main communities. As the desired rank increases, the enhancement by block structure decreases. That is because the local information can be revealed by the richness of clustering structures given by CNMF with high desired rank. So the enhancement is also small on US-Air (nodes: 332) with desired rank (see Figure 2(d)).Will more block information usage bring more accuracy?

If partitioning matrix into too small blocks, -CNMF will have too many parameters relative to the observed data, and then overfitting will occur. An overfitted model describes noise instead of the underlying relationship and generally has poor predictive performance. From the performance of -CNMF () with different on Dolphins (see Table 2), we can see that -CNMF () has revealed enough local information, and increasing caused overfitting.

Figure 3 shows the comparison of -CNMF with CN, BM, and HRG on Karate (inputting parameters of -CNMF: , ) and Dolphins (, ), Les-Mis (, ), and US-Air (, ). -CNMF performs better than CN, because it concerns the property of both local and global information. The performance of -CNMF is comparable with BM and HRG, but faster, as it does not need Monte Carlo samplings.

5.4. Performance of -CNMF

We examine the performance of -CNMF by the main components of four real-world networks: Arxiv GR-QC collaboration network (inputting parameters: , ) [26], the Western States Power Grid of the United States (, ) [27], Enron email network (, ) [28], and the subnetwork of EU email communication network generated by email form the first 5000 nodes to first 10000 nodes (, ) [26].

Comparisons are made only between -CNMF and CN, for the reason that BM and HRG are not suitable for large networks. Figure 4 shows the comparison of -CNMF with CN. The performances of -CNMF are better than CN when the observed networks are much sparse, because common neighbors miss too much in sparse case and CN only concerns this property. In the power network, our method is obviously better than CN, because the original network is sparse.

Figures 5(a) and 5(b) show the comparison of AUC results between -CNMF and -CNMF on Karate and Dolphins, respectively, where , . There are no obvious rules that different modularity clustering algorithms will influence the results of AUC.

6. Conclusions

We have introduced a probabilistic latent variable model for finding missing edges, which combines convex nonnegative matrix factorization with block structures detection. It is inspired by two properties of block structure for matrices: the facts that entities in the same block tend to be similar and that good block detection algorithms have tolerance to missing edges. Scaling to massive size network, we use fast modularity clustering algorithm to generate blocks and give a divide-and-conquer algorithm (-CNMF) for predicting links. For the load balancing of CPU, we split the large blocks and combine the small blocks to make their sizes in the neighborhood of a given threshold.

Since most applications of link prediction are facing the problems of sparse data, such as personalized recommendation, we plan to combine other sparse low rank approximation algorithms with block detection methods to get effective link prediction algorithms for massive networks in the future.

Conflict of Interests

The authors declare that they do not have any commercial or associative interest that represents a conflict of interests in connection with the work.