Dynamic Influence Prediction of Social Network Based on Partial Autoregression Single Index Model
Everything is connected in the world. From small groups to global societies, the interactions among people, technology, and policies need sophisticated techniques to be perceived and forecasted. In social network, it has been concluded that the microblog users influence and microblog grade are nonlinearly dependent. However, to the best of our knowledge, the nonlinear influence predication of social network has not been explored in the existing literature. This article proposes a partial autoregression single index model to combine network structure (linear) and static covariates (nonparametric) flexibly. Compared with previous work, our model has fewer limits and more applications. The profile least squares estimation is employed to infer this semiparametric model, and variables selection is performed via the smoothly clipped absolute deviation penalty (SCAD). Simulations are conducted to demonstrate finite sample behaviors.
The 21st century has seen the explosion of data collection in this era of Big Data. Nowadays the focus has inclined from time series and longitudinal data to network data. Specifically, the network data can show details of complex relationship rather than isolated individual. The rising of such data has been fundamental in many areas, including biomedical sciences , transportation , socialization , and physics .
In consideration of traditional dynamic data, time series analysis  played an important role. Common univariate models included ARMA, ARIMA, ARCH, and GARCH. While multivariate time series also payed attention to the relationship between variables, the classical models such as vector autoregression (VAR) and state space model have been well-studied, with illustration in Fan (2003)  and Box (2010) . The longitudinal data sets arising often in applied sciences grabbed the scientific interest of either the pattern of change over time or simply the dependence of outcome on the covariates. Statistical methods to longitudinal data were well developed. The earliest research was parameter regression method, including linear model, generalized linear model , and mixed effect model . When the actual model was not clear, Yang et al. (2009)  have applied a single index model to model longitudinal data and established a large sample property of estimation of index coefficients. The semiparametric regression model combined the advantages of parametric and nonparametric regression models. Li (2015)  inferred longitudinal data with partial linear single index model. In addition, partial differential equation (PDE) containing unknown multivariable functions and their partial derivatives could also depict multidimensional dynamic system; see Wu (2014)  and Chen (2017) .
Involved in another type of data, network analysis offers a wide variety of tools to observe complex connected systems, which has changed our way of perceiving and analyzing networks in the world. The ability to store and operate with network data in a digital environment has enabled multiplicity of new analytic methods. Besides the developed static network analysis, researchers have made a substantial effort on dynamic network to explore its inherent evolution. Recently, anomaly detection, specifically the incremental community detection of dynamic networks, has attracted a lot of attention. In  Bansal et al. have developed iterative execution at vertex level based on the CNM algorithm, and Nguyen et al.  have proposed a rule-based QCA algorithm.
As regards the complex dynamic network, Lu (2016)  has extended H-index to quantify nodal influence with emphasis on spreading influence of node. Meanwhile, we focus on the nodal influence predication of social network. The current frontier is the Network Vector Autoregression (NAR) model (Zhu, 2017) , which made use of a linear regression approach synthesizing time series dynamics and network structure. However, it is hard for such a linear model to fit complex and volatile configuration all the time. Motivated by NAR model, this paper introduces a competitive semiparametric approach that applies partially linear single index models to dynamic autoregression networks analysis, which inherits the advantages of both models.
When conducting quantitative analysis, the real model and its variables is usually unknown and misspecified. A crucial problem in building a multiple regression model is the selection of predictors to include. When the number of predictors is large compared with the sample size, it is desirable to produce sparse models that involve only a small subset of them. To improve prediction accuracy, the variable selection should be done in advance. The most commonly used criteria exist in large numbers, such as AIC, BIC, Cp , LASSO , group LASSO , SCAD , and elastic net . It is worth noting that SCAD, which is employed in our paper, has superior oracle properties.
The rest of the paper is organized as follows. In addition to constructing the partial autoregression single index model, related well defined properties are introduced in Section 2. Section 3 adopts the profile least squares for parameter estimation of our model. Abundant simulations are carried out in Section 4 to support the small sample performance of this method. We also confirm that our model is better than previous model. Our proposed model can be applied in dynamic influence (response variable) prediction of users. Specifically, nonlinear static covariates, such as users age, gender, and registration time, are modeled in single index part. The current average effect of other users is quantified linearly while the previous nodal influences are autoregressive. However, since there is no appropriate dataset to our knowledge, we have not performed this empirical analysis. Relevant technical proofs will be delayed to the appendix.
Define the number of nodes in the social networks as , which usually takes large value. To illustrate the relationship among nodes, we denote the adjacency matrix , where . If , node can be affected by nodes ; otherwise, . represents the response of the th node at time . For each node , assume that there exists -dimensional time-independent feature covariate . Considering the flexible correlation between above explanatory and response variables, we present the following partial autoregression single index (PASI) model. We merely discuss the case of this model for simplicity since they share similar inferences and properties. where , called out-degree , denotes the total number of edges from node to others. The error term is independent of ; and for .
This PASI model is divided into linear and nonlinear parts. Nonlinear characterizes the influence of on the dependent variable, which extends the model range of nodal effect. Let denote the covariable coefficient (i.e., nodal effect coefficient). In the linear part, indicates the average effect of others on the node at time , and is the network effect coefficient. shows the standard autoregression effect, which means the observation at moment is correlated to , and is autoregression coefficient.
For convenience, define , , , , . Formula (2) can be rewritten aswhere , and is a row-normalized adjacency matrix. is unit matrix with matching dimensions. Supposing is a nonrandom matrix, and are nonrandom but is random.
It is worth noting that the nonparametric/nonlinear effect of static covariate is nontrivial actually. Ma et al. (2013)  have concluded that the microblog users influence (i.e., ) and microblog grade (i.e., certain component of ) are nonlinearly dependent. Thus, the pure linear NAR model is not sufficient enough and the PSAI model is established, which has few limits and can be applied to more areas. Similar to NAR model, we address the identification of PASI model (3). The proof of the following theorem is delayed to the appendix.
Theorem 1. Suppose that and is a fixed value. If , there exists a sole strictly stationary solution for model (3) and such analytic solution can be expressed as Based on the form of solution (4), it is convenient to deduce the conditional distribution of given nodal information . Denote and for simplicity. For any integer , the conditional autocovariance is . It is easy to prove for and for . Conditional mean and variance of are advanced in the following proposition.
Proposition 2. Assume the conditions of Theorem 1 hold. Conditioned on a given , the strictly stationary solution (4) follows the normal distribution with mean and covariance as follows: where is the vectorization of matrix and marks Kronecker product. According to formula (5), the conditional mean of depends on four factors: the effect of the node , the effect coefficient of the network , the autoregression coefficient , and the structure of the network . In order to explain this proposition, we will discuss several special cases as below. The proof for Proposition 2 including the following three cases is given in the appendix.
Case 1. Suppose for ; that is to say, each node has the same nodal effect. Without losing generality, set . It can be proved that eigenvector of is 1, and the corresponding eigenvalue is , where 1= with matching dimensions. Then 1=1. Through formula (5), we obtain . Obviously, the conditional mean is irrelevant to the network structure , and the stronger the network effect (i.e., ) or the autoregression coefficient (i.e., ), the larger the value the node mean takes. In this case, when .
Case 2. Assuming for , every node is connected to the others, which makes the network fully connected and extremely dense. Generally, this kind of network does not exist. Therefore, we only intend to complete the theoretical integrity. In this case, it can be proved that where , . Regarding (5), single node mean depends on its nodal effect apparently. Then the node mean will increase with rising. Under the stability condition , can be easily proved too.
Case 3. First-order Taylor expansion: It is difficult to explain the general network structure through (5) and (6). Accordingly, we utilize first-order Taylor’s expansion of to approximate and . Note that if is relatively large (small), such approximation performs bad (well). Based on (8) and (9), It should be noticed that the of formula (10) represents the average nodal impact from its neighbors. We attribute this to the local impact of the node and denote . In addition, (10) shows that node’s large impact and local impact will amplify its mean. By formula (11), the conditional variance of any node is determined only by the covariance of the autoregression coefficient of and the variance of . And (12) implies that the correlation of the interconnected nodes () is stronger than that of unconnected ones ().
3. Parameter Estimation
As shown in previous section, the PASI model is constructed with the advantages of both partial linear single index and NAR model. We now employ the profile least squares method to estimate the parameters.
Denote as the th column of . Then Model (2) can be reconfigured as follows.Let , . For simplicity, we drop the subscript of , , , and in above formula: where , with the first element positive to ensure the identification.
Define , , . Then (14) can be further expressed as . Supposing unknown link function is second order differentiable. In the arbitrary neighborhood of , we have , and local linear regression techniques can be used to estimate it by minimizing the below objective function: where , is kernel function and is bandwidth. we use cross-validation (CV) to select the optimal bandwidth.
Herein we use the Newton-Raphson iterative algorithms to perform calculation, which iteratively updates estimations of nonparametric and parametric parts through corresponding objective functions. The procedures will not stop until the parameters converge. can be obtained when iteration stops. Accordingly, we get the following.
Besides, as the consistency and asymptotic normality of the NAR model have been confirmed, the large sample properties of our estimations behave well since nonlinear has linear expansion. The related proofs are omitted.
Now we consider the original PASI model (1) with -order autoregression. When the numbers of covariates and autoregression terms are large compared with the sample size , it is desirable to produce sparse models that involve only a small subset of predictors. With such models one can improve the prediction accuracy and enhance the model interpretability. To this end, we use the following objective function to reduce the dimension of models: where is a penalized function with tuning parameter . It is noteworthy that each component of and has different penalty functions with different tuning parameters. In order to select autoregression terms only, we set and adjust the objective as below: Analogous work can be done to select Z-variables: A number of penalty functions have been studied by scholars. In this paper, we use SCAD penalty for the sake of oracle property, whose first-order derivative is and , where and is the hinge loss function. The tuning parameters and are chosen by BIC. In the end, we obtain the penalized estimators by minimizing with respect to and .
4. Numerical Simulation and Results
We demonstrate the advantage of our proposal by three different generation mechanisms corresponding to different adjacency matrices in the following subsections. We ran a simulation with data generated according to where and . The random error (i.e., ) comes from the standard normal distribution . The covariate follows a multivariate normal distribution. Its mean and covariance matrix are and , where . We set the parameter of the covariate to be . The above model has been analyzed by Carroll et al. (1997) . In order to generate a sequence of response variables, we need to generate the initial value of dependent variable based on the strictly stationary distribution given in Proposition 2. Once the initial value of dependent variable () is generated, the response sequence () can be generated based on formula (3).
4.1. Social Networking Generation Mechanism
4.1.1. Dyad Independence Model
Define the dyad: , . The dyad independent model  assumes that is independent. represents ; namely, node and node are interconnected. indicates ; that is, node points to node , but is opposite to . and belong to one-side connection. means node has nothing to do with node .
For the purpose of ensuring the sparsity of network, we set and , and it is easy to calculate . Finally, set and . Figure 1 shows the network graph obtained by mechanism of dyad independence model. The histograms of the in-degree and out-degree for the network nodes are given in Figure 2 when N = 100. From Figure 2, it can be seen that both in-degree and out-degree distribution are asymptotically normal.
4.1.2. Stochastic Block Model
A stochastic block model  divides the N individuals in the network into community locations (). In other words, there exists a mapping : . For example, if an individual is in the community , then . As can be seen, the stochastic block model is a simplified representation of the multiple relationships of the network, and it explains the overall structure characteristics of the network. It is noteworthy that stochastic block model is a study at the location level rather than an individual level.
We randomly assign a block label (i.e., ) to each network node. Set ; it indicates that there are communities in the network. If the node and the node belong to the same community, then we set ; otherwise . Nodes in the same community are more likely to have edge relationship than in different communities. Set , .Therefore, we show the network graph generated by a stochastic block model for 100 and 5 in Figure 4. Under the same conditions as network graph, the histograms of the out-degree and in-degree for the network nodes generated by stochastic block model are given in Figure 5. In this figure, the horizontal and vertical axes are the out-degree (or in-degree) and frequency, respectively. From Figure 5, it can be seen that the distribution of the out-degree and in-degree of the network is skewed and right-biased.
4.1.3. Power Law Distribution Model
The power law distribution model  reveals scale-free properties for social network; that is, most users have less social relations, and only a very small number of users have more social relations. The scale-free characteristic of the network means that the node degree follows the power law distribution; that is, , where is the power law index and represents the proportion of nodes with a degree in the total network.
Let the in-degree of nodes in the network follows the power distribution; that is, ( is a constant). We set the power index as 2, 3, 5. As shown in Figure 9, the smaller the power index is, the longer the tail of in-degree distribution is. Set , . Figure 7 is a network graph generated by the node degree that obeys the power distribution (). Figure 8 shows a histogram of the distribution of the network’s in-degree and out-degree.
Figures 3, 6, and 10 show sequence diagrams and histograms under three network generation mechanisms. The vertical axis of the left graph in the three figures denotes the average responses at each moment; that is, . The horizontal axis denotes 11 moment points. The right graphs of three figures reveal a histogram for dependent variable of node. The horizontal axis of the right graph represents the sum of the response variables of each node at 11 time points, and the vertical axis denotes the frequency. We can find that the distribution of variables shows normal characteristics; namely, the middle part is high, and the sides are low.
Based on the parameter settings in Section 4.1, we next explore the consistency of parameter estimation via Mean Square Error (MSE) criterion. Firstly, we simulated networks generated by three generation mechanisms with distinct network size (i.e., ). Furthermore, multiple experiments are implemented with repetition times M=30, where () denotes the parameter estimation in the th experiment. Finally, we derive a MSE by taking average of squared difference between the values of the true parameter and the estimated parameter.
Before constructing the PASI model, we need to detect the correlation between explanatory variables and response variables to determine the linear and nonlinear parts in the PASI model. A relational matrix can be used to represent the linear relationship between response variables and interpretation variables. If the correlation coefficient between the response variable and the independent variable is large, these variables are taken as the linear part of the model. Otherwise, then the variables are regarded as the nonlinear part of the model. In addition, we also test whether it is reasonable to establish a linear regression as presented in Table 1.
Figure 11 is a correlation coefficient matrix. The columns in this figure represent the dependent variable , the covariables , and independent variable from left to right. We can see correlation coefficients and . That is to say, the correlation of and with is stronger than the other three variables. Table 1 shows the results of linear regression analysis between dependent and independent variables. As a result, coefficients and are significant while , , and are not significant. Therefore, it is not appropriate to use , , and as linear parts. Combining Figure 11 with Table 1, we set and as linear part and the remaining covariables are used as variables for nonparametric part in the PASI model.
The MSE of parameter estimation are displayed in Tables 2, 3, and 4. They come from dyad independent model, stochastic block model, and power law distribution model, respectively. These tables present consistent results. As expected, MSE of parameter estimation decreases when network size becomes greater. That is to say, the estimation of parameter is of consistency.
To compare the results provided by the PASI and NAR models, we summarized the MSE of dependent variable over when in Figure 12. One can see that the MSE of PASI model are smaller than those of the NAR model in different generation mechanisms of adjacency matrix. This suggests that PASI is a competitive tool for analyzing the nodal effect of finite sample in social network.
Proof of Theorem 1. We review formula (3) . Based on time series theory, we firstly center the sequence because . we get the expectation of . is centralized sequence of ; we have where denotes delay operator.
Proof of Proposition 2.
Proof of Case 2.
Proof (9). Define
Proof (16). The two derivatives are set to zero. According to (A.12), we have the following.Plugging (A.14) into (A.13), we can get (here define ) the following.Note that ; (16) can be obtained. Here .
Since data in the Network Vector Autoregression (NAR) is not public, we have not done empirical analysis.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This paper is supported by NNSF projects of China (11501314, 11701318, 11871294), NSF project of Shandong Province of China (ZR2019PF012, ZR2019BA028), and a project of Shandong Province Higher Educational Science and Technology Program (J18KA356).
J. D. Cryer and K. S. Chan, “Time series analysis : with applications in r,” Journal of the Royal Statistical Society, vol. 174, no. 2, pp. 507–507, 2011.View at: Google Scholar
N. P. Nguyen, T. N. Dinh, Y. Xuan, and M. T. Thai, “Adaptive algorithms for detecting community structure in dynamic social networks,” in Proceedings of the IEEE INFOCOM 2011, pp. 2282–2290, China, April 2011.View at: Google Scholar
L. Y. Lü, T. Zhou, Q. M. Zhang, and H. E. Stanley, “The h-index of a network node and its relation to degree and coreness,” Nature Communications, vol. 7, Article ID 10168, 2016.View at: Google Scholar
C. Gerda and H. N. Lid, “Model selection and model averaging,” International Encyclopedia of the Social & Behavioral Sciences, vol. 172, no. 4, pp. 937–937, 2010.View at: Google Scholar
J. E. Taylor, K. J. Worsley, and F. Gosselin, “Model selection and estimation in the gaussian graphical model,” Biometrika, vol. 94, no. 1, pp. 19–35, 2007.View at: Google Scholar
J. MA, G. Zhou, B. Xu, and Y. Z. Huang, “Analysis of user influence in microblog based on individual attribute features,” Application Research of Computers, vol. 30, no. 8, pp. 2483–2487, 2013.View at: Google Scholar
Y. Virkar and A. Clauset, “Power-law distributions in empirical data,” Siam Review, vol. 51, no. 4, pp. 661–703, 2009.View at: Google Scholar