Abstract

Link prediction is an important task in complex network analysis. Traditional link prediction methods are limited by network topology and lack of node property information, which makes predicting links challenging. In this study, we address link prediction using a sparse Gaussian graphical model and demonstrate its theoretical and practical effectiveness. In theory, link prediction is executed by estimating the inverse covariance matrix of samples to overcome information limits. The proposed method was evaluated with four small and four large real-world datasets. The experimental results show that the area under the curve (AUC) value obtained by the proposed method improved by an average of 3% and 12.5% compared to 13 mainstream similarity methods, respectively. This method outperforms the baseline method, and the prediction accuracy is superior to mainstream methods when using only 80% of the training set. The method also provides significantly higher AUC values when using only 60% in Dolphin and Taro datasets. Furthermore, the error rate of the proposed method demonstrates superior performance with all datasets compared to mainstream methods.

1. Introduction

Since 2010, link prediction has become an increasingly distinctive and important part of complex network analysis. Link prediction refers to the prediction of a possible link between two nodes when links are unknown [1]. Such prediction involves the prediction of existing yet unknown links and future links. Link prediction is the basis of data mining problems and lays the foundation for complex network research. Link prediction provides a mechanism for both structure and evolution of networks. Studying this problem is important from both theoretical and practical perspectives [2]. Existing community detection research is primarily based on an adjacency matrix, and community detection typically depends on the adequacy and completeness of the adjacency matrix. Link prediction is instrumental for accurately analysing social-network structures, helping community detection, and improving the accuracy of community detection [2, 3]. Link prediction can be used to predict missing data and can help analyse network evolution [4]. For example, we can use the current network structure to predict users who have not been recognized as friends or can develop into friends.

Link prediction methods have made remarkable achievements in various fields, including biology, social science, security, and medicine. Ermiş et al. [2] address the link prediction problem by data fusion formulated as simultaneous factorization of several observation tensors where latent factors are shared among each observation; some studies turn to multirelational link prediction [3]; Yang et al. also studied evaluation of link prediction [5].

Gong et al. in 2014 extended the Social-Attribute Network framework with several supervised and unsupervised link-prediction algorithms and demonstrate their method performance improvement [6]. However, such methods have limitations when processing social datasets. First, social datasets are low-quality datasets that include faulty links and noise. Such datasets must be preprocessed before similarity measurement, set partitioning, and common neighbour (CN) count. Moreover, node properties are cumbersome to obtain, and most social network data can only be used to obtain a raw adjacency matrix that does not include specific attribute information because user information is private in most online systems. Consequently, many prediction methods cannot use the features of such properties, and we cannot calculate feature property. Therefore, using only an adjacency matrix can avoid interference of node properties, which is convenient and feasible.

Many community detection methods are based on an adjacency matrix. Thus, the adjacency matrixes integrity directly affects the results. Through link prediction using an adjacency matrix, we can determine the relationships between unconnected nodes, and the entire community structure can be obtained by analysing these relationships. Thus, an effective link prediction method is required. This issue presents a series of challenges such as the following: () link prediction must function with an adjacency matrix that does not contain properties, () a graph structure model for estimating the network is required to determine the role of different types of connections, and () verification and evaluation of link prediction are required.

Existing link prediction methods use similarities, node properties, edge properties, and so forth. However, properties require a large amount of test data and heavily rely on network connectivity and structure; thus, link prediction without properties is less robust. When the network structure changes, it is difficult to mine the relationships between nodes. Thus, determining how to use limited test data to predict a network edge is the motivation of this study.

To solve the above problems, this paper presents a link prediction method based on the application of a Gaussian graphical model (GGM) to an adjacency matrix. This concept references Friedman et al.’s [7] sparse inverse covariance estimation theory. The study uses the original adjacency matrix for sampling, thereby obtaining a sample matrix. Thus, we use a sparse GGM (SGGM) inverse covariance matrix for link prediction. The main contributions of this study are as follows:(1)Sampling of a network to build a feature matrix, seeking maximum likelihood estimation using an SGGM and estimating an inverse covariance matrix (precision matrix) of the adjacency matrix.(2)Establishing conditional independence between nodes to complete link prediction using the Markov random field independence principle.(3)Proving that the proposed method is more effective than previous methods by testing the methods using four real-world datasets.

The remainder of this paper is organised as follows. We introduce related work in Section 2, including many previous link prediction methods. In Section 3, we present our SGGM-based link prediction method. In Section 4, we introduce eight real-world datasets and test the methods using these datasets to prove that the proposed method is more effective than previous methods. Finally, we conclude the paper and present suggestions for future work in Section 5.

2.1. Problem Description

Existing link prediction methods can be divided into three categories.

2.1.1. Similarity Link Prediction Employs Different Methods

One is based on node properties, such as sex, age, occupation, preferences, and other properties, to compute node similarity. It is more probable that edges will exist between high-similarity nodes. Another method is based on the network structures similarity, for example, the use of CN nodes. However, this method is only applicable to a network with a high network clustering coefficient.

2.1.2. Estimates Based on the Maximum Likelihood Estimation of a Link Can Be Divided into Two Categories

One method is based on network hierarchy, but it has high complexity because it generates many network samples. The other method is based on stochastic block model prediction, wherein nodes are divided into some sets and the probability of an edge depends on corresponding sets.

2.1.3. A Link Prediction Model Based on Probability Builds a Model by Adjusting the Parameters

This can fit the structure of the relationships in real networks. A pair of nodes will generate an edge determined by probability using the optimum parameter. A probabilistic model considers the probability of existing edges as a property. It transforms edge prediction into property issues. This method takes advantage of the network structure and node properties with high precision but offers poor universality.

Due to the poor universality of maximum likelihood estimation and the probability model, which depend highly on node properties, these methods cannot be applied to many networks. Herein, we consider a link prediction method which is only based on similarity and discuss experiments performed to compare the proposed and previous methods.

2.2. Similarity-Based Link Prediction

Here, we compare the prediction accuracies of 13 similarity measures. All of these measures are based on the local structural information contained in a test set. We first introduce each measure briefly. The formulas are shown in Table 1. Here, is an undirected network, is a set of nodes, and is a set of edges. The total number of nodes for the network is and the number of edges is . For a node and its neighbours , the degree of is . The network has node pairs, that is, a universal set . When given a link prediction method, each pair of nodes without an edge will have a score . Then, all unconnected pairs of nodes are ordered by the score value in descending order and the probability of an edge appearing is the largest on the top.

CN nodes are based on a local information similarity index, and this is one of the simplest methods [8]. In other words, it is more likely that a link will exist between two high-similarity nodes that have many neighbours. For node , represent its neighbours, and if nodes and have many CNs, they are more likely to have a link. Visibly, the structural equivalence pays more attention to whether two nodes are in the same circumstance. For example, in a social network context, if two people share many friends, they are more likely to be friends themselves. We consider the impact of degree of both nodes and generate six similarity indices, namely, Salton index (cosine-based similarity) [9], Jaccard index [10], Sørensen index [11], hub promoted index (HPI), hub depressed index (HDI), and Leicht-Holme-Newman (LHN) index [12]. These indices are based on CN similarity. Another similarity-degree-based method is preferential attachment (PA) [13], which can generate a scale-free network. Note that the complexity of this algorithm is lower than that of others because it requires less information.

If we consider degree information of two nodes CNs, we have the Adamic-Adar (AA) index [14], which considers that the contribution of a small degree of CN nodes is greater than that of a larger one. Liu et al. [15] proposed the RA index, which forms a view of network resource allocation. The RA and AA indices determine the weight of CN nodes in different ways; RA decreases by and AA is . The RA index performs better than the AA index in a weighted network and community mining. Traditional CN methods do not distinguish the different roles of CNs. Liu et al. [15] proposed a local naive Bayes (LNB) model. This model creates a role parameter for marking the different roles of CNs. However, it is only applied to certain types of networks.

Most similarity algorithms perform well with ideal datasets and large training datasets. Such algorithms do not consider missing, incomplete, and polluted data in an adjacency matrix. As previously mentioned, the accuracy of similarity-based methods depends on whether the method can determine the characteristics of the network structure. For example, CN-based algorithms utilise a nodes CNs, and a pair of nodes with many CNs are more likely to connect. Such algorithms perform well, sometimes even better than complex algorithms, in a network with a high clustering coefficient. However, in networks with a low clustering coefficient, such as router or power networks, accuracy is significantly worse.

This study attempts to determine links in an adjacency matrix to reveal actual links between nodes. The proposed SGGM method is based on an undirected graphical model and transforms the adjacency matrix without using node properties. The SGGM method estimates the precision of the matrix of a network to predict unknown edges. To verify link prediction accuracy, area under the curve (AUC) and an error metric are used to prove the proposed methods’ effectiveness.

3.1. Sparse Graphical Model

Biologists interested in genetic connections use the GGM to estimate genetic interaction. Edge relationships in an undirected graph are represented by the joint distribution of random variables. For example, genome work is based on biological functions, and some supervising relationships exist between genes. Corresponding to the graph, genes represent nodes and edges represent this supervising relationship. The relationship between genes provides a method to model such relationships. We assume that variables have Gaussian distribution; therefore, the GGM is most frequently used. Therefore, the problem is equivalent to estimating the inverse covariance matrix (precision matrix ), and the diagonal elements of precision matrix represent the edges in the graph [16]. The GGM denotes the statistical dependencies between variables. If there is no link between two nodes, such nodes have conditional independence. In the GGM, a precision matrix can parameterise each edge.

A popular modern method for evaluating a GGM is the graphical lasso, and Friedman et al. [7] added an norm to punish each off-diagonal element of the inverse covariance matrix.

The GGM encodes the conditional dependence relationships among a set of variables and observations that have identical and independent Gaussian distribution. Motivated by network terminology, we can refer to the variables in a graphical model as nodes. If a pair of variables (or features) is conditionally dependent, then there is an edge between the corresponding pair of nodes; otherwise, no edge is present. The basic model for continuous data assumes that the observations have a multivariate Gaussian distribution with mean and covariance matrix , and . If the th component of is zero, then the variables and are conditionally independent given the other variables. Specifically, given and , the nonzero elements in represent the graphs structure. In other words, if , nodes and have no connecting edge in the graph. The precision matrix is sparse due to the conditional independence of the variables. To estimate a sparse GGM, many methods are based on maximum likelihood estimation (MLE), and one such method is the graphical lasso, which is expressed as

Here, is the experiential covariance matrix and is positive tuning parameter. If , is a positive definite matrix. is the punishment element, and becomes increasingly sparse with increasing . An -norm is utilised to consider a variable rather than a fixed parameter.

3.2. Graphical Lasso Algorithm

In 2008, Banerjee et al. [17] proved that formula (1) is convex. They estimated via only and subsequently solved the problem by optimising over each row. Moreover, they optimised the corresponding column of in a block coordinate descent fashion:

By partitioning and , the solution for satisfies

Banerjee et al. [17] proved that solving (5) is equivalent to solving its dual problem:

Here, assume that is the solution to (6); that is, ; thus, is the solution to (5). Therefore, (6) is similar to lasso regression and is the basis for the graphical lasso algorithm. First, we must prove that (6) and (1) are equivalent. Given , that is,

the MLE of (1) is rewritten as

Note that the derivative of equals . Here, ; that is, , if ; else if . Thus, the upper-right block of (8) is expressed as

The subgradient equation of (6) is expressed as follows:where . Here, suppose that solves (8); consequently, solves (9). Then, and solve (10). For the sign terms, is derived from (7); therefore, we obtain . Since , it follows that , which proves the equivalence. The solution to the lasso problem (6) gives the relevant part of .

Problem (6) appears similar to a lasso (-regularised) least-squares problem. In fact, if , the solutions of are equal to the lasso estimates for the th variable. In 2007, Friedman et al. used fast coordinate descent algorithms to solve the lasso problem. Hsieh et al. [18, 19] proposed a novel block-coordinate descent method via clustering to solve the optimization problem. The method can reach quadratic convergence rates, which is designed for large network.

3.3. Link Prediction Scheme Based on Sparse Inverse Covariance Estimation

Here, we consider an undirected simple network , where is the set of links and is the set of nodes. Moreover, is an adjacent matrix, is the graphical lasso parameter, denotes the number of nodes, and the universal set consists of edges. The set of links are randomly divided into two parts: training () and testing () sets. Note that only the information in the training set can be used for link prediction. denotes the ratio of the training set which means the amount of edges being used. Clearly, , and . In the training set sampling process, we generated independent observations , each from normal distribution , where . Here, denotes the sample covariance matrices of . We then apply SGGM for link prediction. The pseudocode of the link prediction scheme is given in Algorithm 1.

Require: , , ,
Ensure:
: randomly select edges from .
: independent observations from
 Sparse Inverse Covariance Estimation , ,
if   and   then
  
else
  
end if

4. Experiment and Analysis

4.1. Evaluation Metrics
4.1.1. AUC

In link prediction, we focus on accurately recovering missing links. Here, we have chosen the area under a relative operating characteristic (ROC) curve as a metric. An ROC curve represents a comparison of two operating characteristics TPR and FPR as the criterion changes. The AUC ranges from to , and a higher value means a better model.

4.1.2. Error Rate

We also defined an error metric to evaluate the difference between the original and estimated networks and are denoted by and , respectively. The error metric is defined as , where is the Frobenius norm.

4.2. Real-World Datasets

In this paper, we consider four representative networks drawn from disparate fields (data sources: () http://www-personal.umich.edu/~mejn/netdata/  (Football and Dolphin) and () http://vlado.fmf.uni-lj.si/pub/networks/data/ucinet/ucidata.htm (Tailor Shop and Taro)).

We selected four real-world datasets to compare the proposed method with the 13 similarity measures. Table 2 shows the basic topological features of the four real-world networks, wherein and are the total numbers of nodes and links, respectively. is the network efficiency [20] which is defined as , where denotes the shortest distance between nodes and and if and are in two different components. and denote the clustering coefficient [21] and average degree, respectively, and is the network assortative coefficient [22]. The network is assortative if a node tends to connect with the approximate node. means that the entire network has assortative structure, and a node with large degree tends to connect to other nodes with large degree. means that the entire network has disassortative structure, and implies that there is no correlation in the network structure. Figure 1 shows the distribution of the four datasets. It is clear that the degree distributions of the four data sets are considerably different.

4.3. Experimental Settings

To evaluate the performance of the algorithms with different sized training sets, we randomly selected 60%, 70%, 80%, and 90% of the links from the original network as training sets. Only the information in the training set was used for estimation. Each training set was executed 10 times. We choose 10-fold cross validation because it has been widely accepted in machine learning and data mining research. Thirteen similarity measures were implemented following Zhou et al. [14], and the SGGM was implemented using the SLEP toolbox [23]. We evaluated the performance of the GGM algorithm with different sample scales. Here, denotes the number of nodes. We set the sample scale to , , , and . The parameter in the SGGM controlled the sparsity of the estimated model. Larger gives a sparser estimated network. Here, ranged from to with a step of .

4.4. Results
4.4.1. Tuning Parameter for the SGGM

Using the training set that was scaled to 90%, we tested the proposed SGGM with different and sample scales. As shown in Figure 2, the AUC of the SGGM increased with increasing sample scale for the four datasets. One advantage of the SGGM is that it is not sensitive to the parameter . Then, we set the sample scale to . Figure 3 shows that the AUC increased with increase in training set scale. For the Tailor Shop dataset, the proposed algorithm performs optimally with the 70% scaled training set.

4.4.2. Comparison on AUC

Table 3 lists the results for 14 methods with the Football dataset. The SGGM performs optimally when 80% or 90% of the training set is used and the number of samples is greater than . The prediction accuracy of the SGGM increased by 5% compared to the other 13 methods. However, with lower proportions of the training set, the SGGMs AUC is very close to that of the other methods. Note that PA yielded the poorest result with this dataset.

As shown in Table 4, the SGGM method performs optimally with the Dolphin dataset. The prediction accuracy of the SGGM method increased by at least 3% compared to the other thirteen methods. PA is optimal among the similarity measures, and Jaccard yielded the poorest result.

As shown in Table 5, the SGGM method outperforms the other 13 methods with the Taro dataset. The AUC improved by at least 10%. For the Taro dataset, the degree of 70% nodes is 3, which indicates a low degree of heterogeneity; thus, the performance of the other thirteen methods is very close. One might expect PA to show good performance on assortative networks and poor performance on disassortative networks. However, the experimental results show no obvious correlation between PA performance and the assortative coefficient.

For the Tailor Shop dataset, PA performs optimally when the 60% and 70% training sets were used, while the SGGM method performs optimally with the 80% and 90% training sets (Table 6). The SGGMs prediction accuracy gradually increased with increasing the number of samples. Thus, when sampling conditions permit, multiple samples could improve prediction accuracy.

4.4.3. Comparison on ROC

The ratio of the training set was set to 90%. The sample scale of the SGGM varied within , , , and ; . CN was chosen as the representative neighbourhood because six other measures (Salton, Jaccard, Sørensen, HPI, HDI, and LHN) are variants of CN.

The ROC of the RA index is similar to that of the AA index. As observed in Figure 4, in most cases, the SGGM shows an improvement over existing similarity measures when sufficient samples were used. Note that more samples lead to higher accuracy.

4.4.4. Comparison on Error Rate

The error rate is a metric to evaluate the difference between two matrices. A lower error rate indicates that the estimated matrix approximates the original matrix. From Table 7, we observe that the SGGM has the lowest error rate among all methods shown. To show the effectiveness of the proposed SGGM, both the original adjacency matrix and estimated adjacency matrix are presented as a coloured graph in Figure 5. We used the Taro dataset with a 90% training set to recover the original network. Graphs with fewer red points indicate good recovery of the original matrix. As can be seen from Figure 5, using the SGGM method can restore the original image, whereas the other similarity measures return greater error rates. Note that the SGGM error rate decreased with increasing sample scale.

4.5. Large Datasets

The main challenge of link prediction is dealing with large network. We carefully choose four large datasets from four individual domain. For SGGM method, is set to ; 90% of data was used as training set.

In order to prove the performance of SGGM, we compare the proposed method with the other thirteen methods on these large datasets (for details see Table 8) in terms of AUC and error rate (data sources: () http://konect.uni-koblenz.de/networks/ (Email). () http://www3.nd.edu/~networks/resources.htm (Protein). () http://www-personal.umich.edu/~mejn/netdata/power.zip (Grid). () http://konect.uni-koblenz.de/networks/ (Astro-ph)).

To estimate the large network efficiently, we used QUIC [19] to solve sparse inverse covariance estimation (formula (1)) instead of glasso [7]. Overall, the proposed method can efficiently recover the original network and predict the missing links accurately. As observed from Table 9, AUC of SGGM is highest on all 4 datasets. For Email, Protein, Grid, and Astro-ph datasets, SGGM improves 9.2%, 12.3%, 23.1%, and 5.3% on AUC, respectively. For error rate metric, as shown in Table 10, the SGGM method performs optimally with all 4 large datasets. It indicates that SGGM method could accurately recover the original network.

4.6. Analysis

In the comparative analysis of the performance of the 14 methods using the eight real-world datasets, the SGGM method was outstanding in terms of AUC and error rate. This method can accurately recover the original adjacency matrix and, in most of cases, requires fewer samples, that is, far fewer than the actual number of nodes. The SGGM method’s prediction precision gradually increased with the increase in number of samples; moreover, the recovery error rate decreased with the increase in matrix sparsity. Therefore, the SGGM method can be applied to small samples; however, it performs better with more samples. These results demonstrate that the SGGM method can be implemented in a scalable fashion.

5. Conclusions

Link prediction is a basic problem in complex network analysis. In real-world networks, obtaining node and edge properties is cumbersome. Link prediction methods based on network structure similarity do not depend on node properties; however, these methods are limited by network structure and are difficult to adapt in different network structures.

Our work is not dependent on node properties and expands link prediction methods that cannot deal with diverse network topologies. We sampled the original network adjacency matrix and used the SGGM method to depict the network structure.

Most nodes are independent in the actual network; thus, we used the SGGM method to estimate a precision matrix of the adjacency matrix to predict links. Our experimental results show that using the SGGM method to recover network structure is better than 13 mainstream similarity methods. We tested the proposed method with eight real-world datasets and used the AUC and error rate as evaluation indexes. We found that this method obtains higher prediction precision. We also analysed the influence of different parameters on the SGGM method and found no significant influence. The SGGM method returns a high AUC value within a certain range. Furthermore, the proposed method retains high prediction precision with fewer training samples; it can be applied in large network.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors thank Longqi Yang, Zhisong Pan, and Guyu Hu for their contributions to work on link prediction and Zhen Li and Junyang Qiu for their valuable discussions, comments, and suggestions. This work was supported by the National Technology Research and Development Program of China (863 Program) (Grant no. 2012AA01A510), the National Natural Science Foundation of China (Grant no. 61473149), and the Natural Science Foundation of Jiangsu Province, China (Grant no. BK20140073).