Mathematical Problems in Engineering

Volume 2016, Article ID 7213432, 11 pages

http://dx.doi.org/10.1155/2016/7213432

## Link Prediction via Sparse Gaussian Graphical Model

College of Command Information System, PLA University of Science and Technology, Nanjing 210007, China

Received 10 November 2015; Revised 27 January 2016; Accepted 27 January 2016

Academic Editor: David Bigaud

Copyright © 2016 Liangliang Zhang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Link prediction is an important task in complex network analysis. Traditional link prediction methods are limited by network topology and lack of node property information, which makes predicting links challenging. In this study, we address link prediction using a sparse Gaussian graphical model and demonstrate its theoretical and practical effectiveness. In theory, link prediction is executed by estimating the inverse covariance matrix of samples to overcome information limits. The proposed method was evaluated with four small and four large real-world datasets. The experimental results show that the area under the curve (AUC) value obtained by the proposed method improved by an average of 3% and 12.5% compared to 13 mainstream similarity methods, respectively. This method outperforms the baseline method, and the prediction accuracy is superior to mainstream methods when using only 80% of the training set. The method also provides significantly higher AUC values when using only 60% in Dolphin and Taro datasets. Furthermore, the error rate of the proposed method demonstrates superior performance with all datasets compared to mainstream methods.

#### 1. Introduction

Since 2010, link prediction has become an increasingly distinctive and important part of complex network analysis. Link prediction refers to the prediction of a possible link between two nodes when links are unknown [1]. Such prediction involves the prediction of existing yet unknown links and future links. Link prediction is the basis of data mining problems and lays the foundation for complex network research. Link prediction provides a mechanism for both structure and evolution of networks. Studying this problem is important from both theoretical and practical perspectives [2]. Existing community detection research is primarily based on an adjacency matrix, and community detection typically depends on the adequacy and completeness of the adjacency matrix. Link prediction is instrumental for accurately analysing social-network structures, helping community detection, and improving the accuracy of community detection [2, 3]. Link prediction can be used to predict missing data and can help analyse network evolution [4]. For example, we can use the current network structure to predict users who have not been recognized as friends or can develop into friends.

Link prediction methods have made remarkable achievements in various fields, including biology, social science, security, and medicine. Ermiş et al. [2] address the link prediction problem by data fusion formulated as simultaneous factorization of several observation tensors where latent factors are shared among each observation; some studies turn to multirelational link prediction [3]; Yang et al. also studied evaluation of link prediction [5].

Gong et al. in 2014 extended the Social-Attribute Network framework with several supervised and unsupervised link-prediction algorithms and demonstrate their method performance improvement [6]. However, such methods have limitations when processing social datasets. First, social datasets are low-quality datasets that include faulty links and noise. Such datasets must be preprocessed before similarity measurement, set partitioning, and common neighbour (CN) count. Moreover, node properties are cumbersome to obtain, and most social network data can only be used to obtain a raw adjacency matrix that does not include specific attribute information because user information is private in most online systems. Consequently, many prediction methods cannot use the features of such properties, and we cannot calculate feature property. Therefore, using only an adjacency matrix can avoid interference of node properties, which is convenient and feasible.

Many community detection methods are based on an adjacency matrix. Thus, the adjacency matrixes integrity directly affects the results. Through link prediction using an adjacency matrix, we can determine the relationships between unconnected nodes, and the entire community structure can be obtained by analysing these relationships. Thus, an effective link prediction method is required. This issue presents a series of challenges such as the following: () link prediction must function with an adjacency matrix that does not contain properties, () a graph structure model for estimating the network is required to determine the role of different types of connections, and () verification and evaluation of link prediction are required.

Existing link prediction methods use similarities, node properties, edge properties, and so forth. However, properties require a large amount of test data and heavily rely on network connectivity and structure; thus, link prediction without properties is less robust. When the network structure changes, it is difficult to mine the relationships between nodes. Thus, determining how to use limited test data to predict a network edge is the motivation of this study.

To solve the above problems, this paper presents a link prediction method based on the application of a Gaussian graphical model (GGM) to an adjacency matrix. This concept references Friedman et al.’s [7] sparse inverse covariance estimation theory. The study uses the original adjacency matrix for sampling, thereby obtaining a sample matrix. Thus, we use a sparse GGM (SGGM) inverse covariance matrix for link prediction. The main contributions of this study are as follows:(1)Sampling of a network to build a feature matrix, seeking maximum likelihood estimation using an SGGM and estimating an inverse covariance matrix (precision matrix) of the adjacency matrix.(2)Establishing conditional independence between nodes to complete link prediction using the Markov random field independence principle.(3)Proving that the proposed method is more effective than previous methods by testing the methods using four real-world datasets.

The remainder of this paper is organised as follows. We introduce related work in Section 2, including many previous link prediction methods. In Section 3, we present our SGGM-based link prediction method. In Section 4, we introduce eight real-world datasets and test the methods using these datasets to prove that the proposed method is more effective than previous methods. Finally, we conclude the paper and present suggestions for future work in Section 5.

#### 2. Related Work

##### 2.1. Problem Description

Existing link prediction methods can be divided into three categories.

###### 2.1.1. Similarity Link Prediction Employs Different Methods

One is based on node properties, such as sex, age, occupation, preferences, and other properties, to compute node similarity. It is more probable that edges will exist between high-similarity nodes. Another method is based on the network structures similarity, for example, the use of CN nodes. However, this method is only applicable to a network with a high network clustering coefficient.

###### 2.1.2. Estimates Based on the Maximum Likelihood Estimation of a Link Can Be Divided into Two Categories

One method is based on network hierarchy, but it has high complexity because it generates many network samples. The other method is based on stochastic block model prediction, wherein nodes are divided into some sets and the probability of an edge depends on corresponding sets.

###### 2.1.3. A Link Prediction Model Based on Probability Builds a Model by Adjusting the Parameters

This can fit the structure of the relationships in real networks. A pair of nodes will generate an edge determined by probability using the optimum parameter. A probabilistic model considers the probability of existing edges as a property. It transforms edge prediction into property issues. This method takes advantage of the network structure and node properties with high precision but offers poor universality.

Due to the poor universality of maximum likelihood estimation and the probability model, which depend highly on node properties, these methods cannot be applied to many networks. Herein, we consider a link prediction method which is only based on similarity and discuss experiments performed to compare the proposed and previous methods.

##### 2.2. Similarity-Based Link Prediction

Here, we compare the prediction accuracies of 13 similarity measures. All of these measures are based on the local structural information contained in a test set. We first introduce each measure briefly. The formulas are shown in Table 1. Here, is an undirected network, is a set of nodes, and is a set of edges. The total number of nodes for the network is and the number of edges is . For a node and its neighbours , the degree of is . The network has node pairs, that is, a universal set . When given a link prediction method, each pair of nodes without an edge will have a score . Then, all unconnected pairs of nodes are ordered by the score value in descending order and the probability of an edge appearing is the largest on the top.