Complexity

Volume 2019, Article ID 6902027, 17 pages

https://doi.org/10.1155/2019/6902027

## Mining the Hidden Link Structure from Distribution Flows for a Spatial Social Network

^{1}School of Finance, Zhejiang University of Finance and Economics, China^{2}School of Data Science, Zhejiang University of Finance and Economics, China^{3}Urban Informatics-Spatial Computing Lab & College of Computing, New Jersey Institute of Technology, USA^{4}School of Economics & Management, Guangxi Normal University, China

Correspondence should be addressed to Xiaoqi Zhang; ude.olaffub@hziqoaix

Received 30 December 2018; Revised 3 March 2019; Accepted 31 March 2019; Published 2 May 2019

Academic Editor: Giulio Cimini

Copyright © 2019 Yanqiao Zheng et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

This study aims at developing a non-(semi-)parametric method to extract the hidden network structure from the -valued distribution flow data with missing observations on the links between nodes. Such an input data type widely exists in the studies of information propagation process, such as the rumor spreading through social media. In that case, a social network does exist as the media of the spreading process, but its link structure is completely unobservable; therefore, it is important to make inference of the structure (links) of the hidden network. Unlike the previous studies on this topic which only consider abstract networks, we believe that apart from the link structure, different social-economic features and different geographic locations of nodes can also play critical roles in shaping the spreading process, which has to be taken into account. To uncover the hidden link structure and its dependence on the external social-economic features of the node set, a multidimensional spatial social network model is constructed in this study with the spatial dimension large enough to account for all influential social-economic factors. Based on the spatial network, we propose a nonparametric mean-field equation to govern the rumor spreading process and apply the likelihood estimator to make inference of the unknown link structure from the observed rumor distribution flows. Our method turns out easily extendible to cover the class of block networks that are useful in most real applications. The method is tested through simulated data and demonstrated on a data set of rumor spreading on Twitter.

#### 1. Introduction

Flow data has been widely studied by different disciplines [1–6]. Especially in recent years, the development of internet makes an increasing amount of flow data sets publicly available, among them new types of flows are emerging and attracted more and more attentions from scholars [7, 8].

Unlike the physical movement, such as the trajectory of taxi, the information flow data, such as the time series of the retweet status of a class of tweet articles within a population, does not contain any trajectory-level information, because a user may tweet after he saw many friends had done so. In that case, a group of friends can contribute to the spreading of the tweet, and it becomes impossible to figure out which one is the real single source, neither is it possible to track the trajectory of retweeting. Therefore, this flow data are no longer stored as a collection of well-defined trajectories; instead, they consist of a time series of distributions of a given kind of information within entire population. In addition, the distribution flows are highly “context-dependent”, which means the social-economic factors behind every agent joining the spreading process (such as the education, income, and the neighborhood) might significantly affect the speed, extent, and coverage of spreading, suggesting a spatial social network to be uncovered from the distribution flows.

Of course, the emergence of new types and new features of flow data inevitably brings unprecedented opportunities to improve our understanding of interaction patterns between people and thus enrich relevant theories, but the missing observation on the trajectory-level information and the add-in of social-economic context make it challenging to uncover the agent-to-agent links, or equivalently the entire hidden interaction network. As a result, it becomes necessary to develop more data-driven approaches tailored to uncover the hidden spatial social network behind the distribution flows.

Distribution flows are frequently studied in the field of rumor and/or flu spreading. Existing methods, in broad terms, can be suppressed into two classes, the agent-based modelling/simulation/calibration (ABM) techniques [9–15] and the differential equation (DE) based approaches [16–19]. The class of DE approaches is helpful to derive qualitative conclusions regarding the steady state distribution of the spreading processes and how the equilibrium depends on model parameter in a coarse sense. However, to guarantee the meaningful qualitative results are achievable, the setup of differential equations is often oversimplified, but it would cause the loss of insights into the complex reality. In addition, due to the lack of explicit solution in most cases, it is not possible to apply the DE techniques to fit the real data and generate detailed quantitative results. In contrast, ABM approaches are more realistic and suitable for quantitative research on the real distribution flows. However, there are still a couple of shortages in the existing ABM models [20].

First, ABM often assumes that the spreading process is carried on a network where nodes represent agents that can potentially spread out or be infected with a certain type of object (e.g., rumor); edges are the links between agents. Rumor can only be spread between agents linked by edges. Under ABM framework, this interaction network is supposed to be known and prescribed in prior. Prior network may lose critical information of the interaction patterns of population [15, 21, 22]. For instance, in the Twitter network, a natural interaction network structure is the network formed by friendship or followership relation between users which is also frequently used as the prior network for rumor spreading studies [14, 18]. However, rumor does not have to follow this network to spread [23]. In fact, the retweet action of big name users is more likely to be visible through other channels to those users who are not linked in the friendship network, such as by TV shows, and newspapers. Therefore, the spreading between a big name user and an ordinary user is still possible even if they are not linked at all by merely counting the friend or follower relation. The existence of hidden links makes prior network fail to capture all structural features of interactions, a data-driven or posterior network would be helpful to overcome this issue.

Second, given the prior network, ABM assumes the spreading occurs through interaction mechanisms between two randomly picked agents. Widely used interaction mechanisms include the independent cascade model and the linear threshold model and so on [24–26]. These mechanisms are often parametrized and assumed homogeneous for all agents; i.e., the mechanism is determined by a set of parameters that are constant and invariant for different agents. In reality, both of the relative positions of an agent within the network, such as the degree, centrality, betweenness of an agent [17, 18, 27], and many social-economic factors external to the entire network, such as the geographic location, social status, education level, and wealth [22, 27], can drastically affect the likelihood that agents get infected by the rumor. But the heterogeneity among agents is often missing from the standard ABM framework.

To resolve the above issues, we propose a novel and completely data-driven modelling approach to characterize the hidden interaction network and the spreading process. Our study contributes to the existing literature in the following aspects.

First, we consider the interaction network as a weighted multidimensional spatial social network, which is an extension to the standard spatial network and the nodes in the network are embedded into a multidimensional feature space . The weighted edge between nodes is considered as a continuous function on . Within such a network, the value of edge weight function can depend on features of both the start nodes and end nodes, so it gives full respect to the heterogeneity of nodes and its effect on shaping the spreading process and distribution flow.

Second, we link the interaction network with the distribution flows by the classical mean-field models [9, 16–18] and the law of distribution transition is realized by a kernel operator with its kernel function given by the edge weight function. Such a construction allows the infection status of a given node to depend on all other nodes in the network in a smooth manner, which avoids the arbitrariness of distinguishing the impact of neighbor and nonneighbor nodes, while also facilitating the inclusion of the context information embedded in the spatial social network into the analysis of spreading.

Third, we adopt the kernel smoothing technique and nonparametric likelihood estimation from statistics [28, 29] to fit our model into real distribution flows, where the entire edge weight function is supposed to be unknown and needs to be estimated from the distribution flow data from the real world. The nonparametricity makes our method a powerful tool of information mining for distribution flow data. Finally, the widely used block models [30–33] can be easily incorporated into our framework, which helps better uncover hidden social-economic connections between individuals from distribution flows.

The paper is organized as follows. In Section 2, we give an overview of existing methods of network estimation. Section 3 formally presents the setup of our method, including the definition of feature space network, mean-field models and their simulation techniques, and the design of our likelihood estimators. Section 4 validates the effectiveness of our estimators to the hidden network by synthetic data and numerical experiments. Section 5 applies our method to a distribution flow dataset of the information spreading on Twitter relevant to the event “Unite the Right rally”, 2017.

#### 2. Relevant Methods

The proposed method in this paper is essentially a network estimation tool, while network estimation is a long-standing topic in many different fields.

In the studies of agent-based model (ABM), simulation-based estimation is usually adopted to calibrate the unknown parameters involved in model setup [13–15, 18, 34, 35]. Simulation-based estimation is efficient in dealing with the estimation of ABMs as it is often impossible to derive an analytic expression for the standard error functions in ABM setting; simulation can help generate an empirical version of the error function and facilitate the application of standard ordinary least square (OLS) and maximum likelihood (ML) estimation strategy. However, the simulation-based estimation is more frequently applied to parametric ABM where only a finite-dimensional parameter vector is to be estimated; it is rarely used to estimate the hidden network structure as the unknown network is essentially nonparametric, which is less tractable than the parametric models. To our best knowledge, the only exception comes from Grazzini and Richiardi [35]; Kukacka and Barunik [36], in which the interaction mechanism when two agents meet is allowed to include a nonparametric component, and the kernel smoothing method and nonparametric likelihood (or least square) estimators are applied to cope with model estimation. However, Grazzini and Richiardi [35]; Kukacka and Barunik [36] do not include the interaction network between agents into their analysis, nor the model identifiability issue is resolved; thus, further exploration is needed in this direction.

The other related works deal with link prediction by stochastic-network models. In this field, nonparametric tricks are more often adopted to make inference of hidden features of stochastic network [23, 31, 32, 37, 38]. Lü and Zhou [31] review the main-stream heuristic algorithms to forecast the missing links within a partially observed network. Bickel et al. [39], from the perspective of statistic inference, summarize and validate the application of variational expectation maximization (VEM) algorithm to infer the probability of existence of a link between two nodes from observed edge data. Matias et al. [38] extend the VEM method to deal with the future occurrence probability of edges given a dynamic linked network and the historic edge data; this extended method can handle the case where the evolution of occurrence probability depends nonparametrically on an unknown hazard function. All these methods were developed under a common assumption that at least the edge information of part of the network has already been observed, which is possible for trajectory data, but not possible for distribution flows. Thus, a further extension is needed to handle the case that all edge data are missing.

In the literature of physics, the task of detecting the hidden network link structure from node-level time-series data is phrased as “network reconstruction”. Taking distribution flows as the input, two outstanding network reconstruction methodologies are directly comparable to ours. One is based on the compressive sensing technique as proposed in Shen et al. [40]; the other is based on the combination of likelihood estimation and the mean-field approximation technique as discussed in Roudi and Hertz [41]. The basic idea in Shen et al. [40] is to convert the network reconstruction problem to a classical convex optimization problem with linear constraints, which is the so-called compressive sensing (CS) problem. In the CS problem, the linear constraints come from the transition probability of nodes within the network from the uninfected state to the infected state, while the objective function arises from the sparsity assumption regarding the network link structure. Unlike the applications of CS approach to the network reconstruction from continuous time-series data [42–44] where the feature variables associated with every node are directly observable, in the case of distribution flows, the key variable, transition probability, is not observable from the data. Therefore it has to be calculated so as to form the required linear constraints. Inferring the transition probability from the -valued distribution flow data requires a stationary assumption on the underlying model which is too restrictive in many applications. For instance in the spreading of virus, an agent might die immediately after it is infected, in which case the infected agent is censored in the sense that its infectious status is constantly one since the time of being infected. When censored agents exist in the network, stationarity of the transition is impossible and the CS framework in Shen et al. [40] is no longer applicable. The other problem of the CS framework is its incapability of handling the spatial heterogeneity among different nodes. As we have highlighted that the education, wealth, and many other social-economic factors can play critical roles to determine the link strength among people and therefore affect the information spreading dynamics, modelling the dependence of the hidden link structure on those social-economic factors is necessary in the studies of social network. The inclusion of social-economic factors would introduce heterogeneity among nodes, which makes it challenging to identify which two nodes are relatively homogeneous and can be grouped together. In the CS framework, grouping different nodes is the premise to calculate the transition probability. In an abstract network, all nodes are homogeneous, and the grouping can be simply taken as the set of all nodes as done in Shen et al. [40], while in a spatial network with heterogeneity widely existing such a simple grouping trick is meaningless. How to extend the CS framework to spatial social network becomes a tough job and extensive studies are needed.

The deep reason that restricts the CS framework is its reliance on the unobservable transition probability. That restriction can be effectively resolved by applying the likelihood technique as suggested in Roudi and Hertz [41]. The goodness of likelihood-based approach is that it can compute the unknown transition probability simultaneously with the other model parameters. But the computation usually takes too much time because there is no explicit solution for the first-order condition of the maximum likelihood; numerical solution is required. To make the computation easier, a mean-filed approximation technique is presented in Roudi and Hertz [41], which can definitely increase the computation speed. However, the approximation can only work for the case that all link strengths have to be close to zero, which restricts its usefulness in many applications of social network. On the other hand, the current version of the approximation technique in Roudi and Hertz [41] still assumes an abstract network structure, and no dependence of the link strength on social-economic factors is allowed; it is unclear whether the approximation is extendible to account for the reconstruction of spatial social networks. Finally, Roudi and Hertz [41] are only concerned with the situation that the number of nodes () is relatively small, and the computation complexity comes mainly from numerically solving the maximum likelihood problem. But when is large, the computation complexity would be dominated by the matrix multiplication for the adjacency matrix. Since the approximation technique in Roudi and Hertz [41] still requires the matrix multiplication, its speed-up effect for giant networks may not be that significant. More explorations on the fast reconstruction of giant spatial social networks are needed.

#### 3. Model Setup

##### 3.1. Feature Space Network

We consider a weighted multidimensional spatial social network, where nodes of the network are considered as elements in a -dimensional Euclidean space and every dimension of is interpreted as a feature of nodes; thus, is interpretable as a feature space. Edges between nodes are assumed to depend on features of nodes in a smooth way, i.e., edge set of the graph is equivalent to a smooth function (up to a certain order of derivatives) or an almost-everywhere smooth function (i.e., the function is smooth for all points except those contained in a zero-measure set), denoted as , where, without loss of generality, edge weight between two nodes is restrained within the unit interval. Such a specification admits a stochastic-network interpretation of our model; the weight can be thought of as the probability that two nodes share an edge. Since the nodes of the network may not be evenly distributed within the entire space , without loss of generality, we assume the node’s distribution is characterized by a probability measure on , and is supposed to be known from the data. In sum, the -dimensional spatial network can be recorded as , or shortly when there is no ambiguity regarding its nodes space, distribution, and edge function.

There are several advantages to assume that the spreading process and distribution flows occurred within . First, the embedding of the node set into feature space allows us to characterize the feature information of nodes that are external to the network structure [21, 22, 27], which are usually as important as the network structure itself in determining the spreading process and distribution flows. Luo et al. [22] argue that including social-economic factors such as the intensity of population gathering in a set of locations can significantly increase the capacity of forecast of illness spreading among residents. Viboud et al. [45] report similar findings. Second, allowing nodes unevenly distributed within the feature space admits us to include more general network into analysis. For instance, by proper choice of the measure (e.g., finitely supported), it is even possible to consider a network with only finitely many nodes but sitting in the infinite feature space ; this allows us to include most of networks that we can meet in practice. Finally, allowing the edge weight to smoothly depend on features of both the flow-in and flow-out nodes makes it possible to incorporate the background information into the interaction mechanism; this is critical when the network itself is only a small component of a larger background system [27]. In addition, a by-product of treating edges as a smooth function is its induced computational efficiency. In fact, when a network consists of a giant number of nodes, even a simple summation operation can take a long time and huge memory, but when edges vary smoothly along with nodes, it becomes possible to only do calculation on a small set of nodes and the global features of edges then can be inferred from the result on the relatively small set by the kernel smoothing technique from nonparametric statistics [28, 29]. Based on these advantages, we will concentrate on the spatial network, , instead of a more general concept of network.

##### 3.2. Mean-Field Models

To model spreading processes within a spatial network , we follow the convention in the studies in rumor spreading literature [10, 17] and adopt the common assumption that a rumor can be spread out from a node to the other if and only if (1) the initial node must have been infected with the rumor, recorded as the event ; (2) there is an edge between them, or equivalently, ; and (3) when condition (1) and (2) hold, whether or not the spreading actually happens is purely random up to a probability . Different spreading models impose different requirement on the probability . In the current studies, we adopt the mean-field model to determine , as suggested in most of previous studies. Formally, for every fixed time , the probability of node being infected is determined by the following mean-field equation:The interpretation of (1) is that at the temporal variation rate of the probability that node is infected (represented as ) is a proportion to the probability that node has not yet been infected by time (represented as ), and the proportion is determined through a weighted sum of the probability of all other nodes in the network having been infected by . The weight function describes the strength of connection between nodes and , thus can be formulated as the edge function . Using the classical result of mean-field equations [46–50], it can be easily verified that the infection probability in (1) is exactly equal to the probability of for a given right-continuous mean-field point process satisfying the following:where is the left-limit of process . The interpretation of (2) is more straightforward than (1); (2) points out that the average rate of node being infected is contributed by all those nodes that (1) have a connection to and (2) have been infected by the current time. These two conditions are often imposed in literature.

Let be a function satisfying the functional differential equation (1), also denote as the density or mass function associated with probability ; then, the event that a given node is observed at time and its infectious status is observed to be infected has the probability density:in contrast, the density for the event that is observed to be uninfected at is given asSuppose that given a time , the infectious status of a set of randomly picked nodes is observable and represented aswith being not infected and being infected; then, the likelihood function of the observations can be written in the following way by using (3) and (4):where we add the edge function into likelihood because it affects through determining the functional form of . Maximizing (6) can yield the classical maximum likelihood (ML) estimator of .

##### 3.3. Nonparametric Likelihood Estimator and Kernel Smoothing

In the study of spreading process, only the distribution flows of the form (5) are available; the details of link structure between nodes represented by edge function are not observable, thus need to be estimated. In this section, we construct a nonparametric simulated maximum likelihood estimator (NPSML) to the functional form of given the observed distribution flows on a sequence of time. The NPSML is an efficient nonparametric inference technique proposed by Kristensen and Shin [29]. NPSML applies well to the case where an explicit expression of the likelihood function is not achievable, which is exactly what we need to handle because the distribution function in (6) is the solution to the functional differential equation (1); there is no clean analytic expression available for it.

However, our task is different from the situation discussed originally in Kristensen and Shin [29]. First, the original NPSML applies nonparametric kernel smoothing to approximate the unknown likelihood function; the model generating the likelihood function is still parametric, but in (6) the likelihood depends on the nonparametric edge function . To this situation, one extra kernel smoothing step is needed to approximate . Second, in Kristensen and Shin [29]; Kukacka and Barunik [36], simulation is conducted on the level of random variable, while, in our case, simulation is on the level of distribution that is equivalent to numerically solve the mean-field equation (1). Finally, due to the involvement of nonparametric model setup, the model identifiability has to be checked in order to guarantee the correctness of the resulting estimation.

Due to the first and second differences, we provide the following algorithm to generate the simulated likelihood function (in the following constructions, we always use to denote the -dimensional standard Gaussian kernel function, for some positive constant .).

*Step 1*. Select constant , large positive integer and ( is the length of every time step used for numerically solving the functional differential equation (1), and are the number of random samples that will be drawn to generate the kernel smoothing approximation to the unknown likelihood function and edge weight function).

*Step 2*. Draw random samples from distribution , and random samples from the product measure .

*Step 3*. Given , construct function as follows:

*Step 4*. Given , let denote the observation set at time whose cardinality is , constructing function as follows:

*Step 5*. Solve mean-field equation (1) over interval at the set of sample point drawn in Step 2 by Euler’s method with time step subject to the initial condition as follows:where , is the greatest integer less than .

*Step 6*. For the observation set at with cardinality , generate the simulated density at the sample nodes as follows:and construct the simulated likelihood function as follows:

The full information likelihood function for all observation time can be constructed from (11) in the following way:The estimator of unknown edge function can be derived from maximizing the simulated full information likelihood function (12) by selecting appropriate ; the final estimator is constructed from the optimal in the way of (7).

Comparing to NPSML in Kristensen and Shin [29], the algorithm in our study includes one extra sampling step to draw random points from which are used for approximating unknown . In addition, there are two kernel smoothing steps (Steps 4 and 6) regarding the density function , one for the initial density in the starting time and the other for the end-time density at . The two kernel smoothing steps are not required when the total number of nodes are small (a few hundred or a few thousand), in which case the whole set of nodes is directly used as the samples drawn in Step 2. However, when the system has a giant node set (say millions), the sample size can be applied in order to lift the computation efficiency. Moreover, the node sets being observed at different observation time may not always be identical; it is more often the case that when a node is tracked to be uninfected at some time , it will be regarded as safe and missing from the consecutive tracking in the next few observation time points. In this interval-censor situation, the sampled nodes and the two kernel smoothing steps are needed to avoid the noise induced by censoring.

As documented in Kristensen and Shin [29]; Kukacka and Barunik [36], the NPSML estimator does not suffer from the “curse of dimension” despite its nonparametric essence because the number of simulation samples is independent from the number of observation samples. When the latter is large, the inefficiency induced by kernel smoothing vanishes during the aggregation involved in the likelihood function. By the same argument and the fact that in most real-world applications the number of observed nodes is giant, our modified NPSML estimator is free from the curse of dimensionality as well.

##### 3.4. A Fast Algorithm

As shown in (9), the estimation procedure requires repeated evaluation of the multiplication between a matrix and a dimensional vector; the computation complexity is of the order . Although can be taken as much smaller than the number of nodes in observations (), it still has to increase as increases. So when is a giant number, has to be large as well; the computation complexity of the entire estimation procedure will be dominated by . In this section, we propose a fast algorithm which can reduce the computation complexity in (9) to be linearly dependent on that is reasonable and implementable in practice.

The idea of the fast algorithm comes from the technique of agent-based simulation (ABS). In every iteration of ABS, every agent in the network is only required to interact with another agent randomly picked from its neighbor. In our setting, there is no strict “neighbor” defined, while it is still possible to randomly pick one agent from the entire population and the interaction is only counted on the given agent and its randomly picked partner. Formally, Step 5 in previous paragraph is split to three substeps.

*Step 5(1)*. For fixed and fixed , randomly pick one from ;

*Step 5(2)*. Compute

*Step 5(3)*. Repeat the above two steps for all , , and for all s.

Comparing (9) and (13), the main difference is that the inner product of vectors (i.e., the sum over ) is replaced with a scalar multiple, so the resulting computation complexity for all nodes linearly depends on which is significantly faster than the original algorithm.

For the accuracy of the fast algorithm, we claim that compared to the original algorithm, the accuracy loss induced by the fastness is controlled by a constant multiple of . In fact, due to the randomness of s, it is easily to verify the following:(i)the expectation of the left hand side of (9) is identical to the expectation of left hand side of (13);(ii)denote as the increment; , then for , and all s, .

The property (i) and the identity for in (ii) are quite trivial. For , then can be decomposed as the sum of the following two components where is the norm of a bounded valued function. The above inequality holds straightforwardly from the fact is bounded by and its temporal derivative is given by (1) which is also uniformly bounded by ; then, the statement (ii) follows immediately.

Using Property (i), (ii) and the law of large number, it is straightforward that the difference between the likelihood function constructed from (9) and by (13) is bounded by a constant multiple of as the number of nodes . If we further require along with , the two types of calculation of the likelihood function would be asymptotically identical, which leads to the same estimator to the hidden network.

Also notice that by the fast algorithm, the choice of is independent with the estimation accuracy, so in practice it can be selected directly as to increase the speed.

##### 3.5. Block Network

The NPSML algorithm constructed in previous section can be further extended to make inference for the block network model. As in many applications [33, 38, 39], the existence of connection between two agents is only relevant to the groups they belong to, and the features of agents only affect which group they are assigned to. Without loss of generality, the set of groups can be considered as a partition of the set of all nodes; then, the edge function can be decomposed as two components:(i)the group weight function ;(ii)the group-level edge weight , which is a matrix with each entry valued in ;

The edge function for the block network model can be recovered from (i) and (ii) as follows:where the image of is viewed as a -dimensional columns vector and the subscript represents vector transpose. The group weight function is required to satisfy that for every and , there exist only one with , which means every node can only have positive probability to belong to at most one group which guarantees the requirement that groups constitute a partition of the node set.

The estimation of block network is equivalent to the estimation of (1) the group weight function , which is unknown and consists of the fully nonparametric component of the network, and (2) the interaction matrix , which is the parametric component of the network. So the estimation is essentially semiparametric. The six-step algorithm discussed in Section 3.3 and the fast algorithm in Section 3.4 are still applicable to that case. The only modification is for Step 3, where the kernel smoothing method is no longer applied to the unknown edge weight . Instead, it is applied to generate the estimate to group weight . Then the hidden weight function is constructed from the kernel smoothed and the given interaction matrix in the way of (15).

Block network model has many advantages. For instance, when the number of groups involved is small and does not depend on the number of nodes, the number of parameters to solve is only , while the number is when there is no block structure at all. To generate good approximation to the true edge function, has to increase along with the number (although slowly); when the node number in observation is giant, has to be large as well; then . Through block network, we can sharply reduce the dimension of parameter space when solving the maximum likelihood problem, which can significantly lift the computation efficiency.

In addition, block network is much easier to identify than the general fully nonparametric networks, which will be discussed in the next section. Finally, under block network, the equilibrium infectious distribution of the spreading process has a clear analytic expression as stated in the following proposition (proof for Proposition 1 is quite trivial, hence omitted).

Proposition 1. *Denote as the projection of vector to its th coordinate. Define that consists of the set of nodes belonging to group ; then, within a mean-field model of the form (2) with edge function given by (15), every equilibrium infection distribution (i.e., satisfying ) must have the following form:where is the prescribed initial distribution of infectious status, is the projection of a vector to its th dimension, and denotes the th power of matrix .*

Proposition 1 is meaningful in the sense that it links the types of equilibria infectious distribution with the matrix algebra, facilitating the qualitative analysis of the equilibria distribution. For instance, when is an upper triangle matrix with all its lower off-diagonal entries being zero and all diagonal and upper off-diagonal entries being strictly positive, such as in (17),then the equilibrium distribution and the initial distribution satisfy the relation:

##### 3.6. Validity of NPSML

Due to the nonparametric nature of the edge function , its identifiability is tricky. When the spreading process can be observed for multiple times ( times) with random initializations and is large as assumed in Roudi and Hertz [41]; Shen et al. [40], both of the fully nonparametric network and the block network are identifiable. However, in real applications, a spreading process can at most be observed for a few times; it is not expected that can be very large. In that case, the fully nonparametric edge function is no longer fully identifiable; i.e., there exists that leads to the same likelihood function (6) in the limit case. However, it can be shown that is identifiable up to compact convex set; i.e., the set is a compact convex set within the function space , where stands for the true value of edge function. It can also be proved that the set also varies along with the initial infectious status . Formally, we have that , if and only if the following holds for all :where is a bounded operator over the functional space defined through as for every with being the default node distribution; is the multiplicative operator determined by such that ; the th power in (19) represents the self-composition of an operator for times. (19) implies that the identifiability of the true edge function is limited by the extent of the ergodicity of the spreading process within the node space . For instance, when there exists a small open set such that all nodes are infected before the initial time , i.e., for all , then it can be verified by (19) that all functions that deviate from only within the band set are contained in . On the other hand, if there exists open such that for all and all , then all functions that deviate from only within are contained in . In both of the two cases, nodes in or are not in the ergodic range of the spreading process; hence, the transmission of their infectious status is not observable. For nodes in , their infections occur ahead of the observation period hence not observable after the start of spreading, while for nodes in it can be verified that they will never be infected over the entire spreading process. Therefore, the identifiability of is restricted by the experience of the spreading process, which is reasonable.

It is still an open question what conditions added to and/or can guarantee the identifiability of the fully nonparametric . But in the special case of block networks, one simple identifiability condition can be figured out. In fact, for block networks, it is straightforward that is identifiable, if and only if there does not exist a pair that differs from the true but leads to the same likelihood function (6) in the limit case, if and only if, for the true , the vector space spanned by the family of vectors is the entire feature space , i.e., has full rank. is the number of blocks, is a -dimensional column vector for every and for each , , is the th entry of . To reach the full rank condition, the well-known Wronskian determinant [51] can be applied leading to the following clean-form identifiability condition:where is the other -dimensional column vector determined by the true function such that for ; is the operation that convert a -dimensional vector to a matrix with its diagonal elements being the given vector. By the polynomial nature of the determinant function, it can be verified that (20) holds “generically” in the sense that the set of s that forces (20) to be constantly equal to is contained in an dimensional surface within , and for those s that (20) is not constantly , the set of that forces (20) to be is only contained in a dimensional surface within . Therefore, (20) holds for almost all and except for some extreme cases that have measure under the standard Lebesgue measure.

The “almost” identifiability for block networks guarantees that in most cases when the number of observed nodes is large and the distribution of observation time is dense, the estimated and from the NPSML asymptotically converge to their true values and point-wisely follow multivariate normal distributions. This asymptotic result follows straightforwardly from Kristensen and Shin [29]; Kukacka and Barunik [36] and the general properties of maximum likelihood estimator. So, the theoretical validity of the estimators developed in previous sections is established.

*Remark 2 (sparsity). *Although in general the complete identifiability for both the general network and the block network is hard to achieve, but if we follow the idea in the network reconstruction literature, Shen et al. [40] only concentrate on the case that the hidden network is as sparse as possible in the sense: the norm of the edge weight function for the general network and/or the entry-wise square sum of the block network (this is the norm on the discrete set with cardinality ) is as small as possible. To automate the selection of the sparsest network, we can consider the norm function as a penalty and subtract it from the log-likelihood function (6), and then optimizing (6) would guarantee the solution converging to the sparsest network. It is easily verified that such a sparse solution is always asymptotically unique, because as we discussed in previous paragraphs, all networks that can lead to exactly the same log-likelihood function form a compact convex set in the functional space; by the compactness and convexity, there always exists a unique (or ) such that its -distance to the origin reaches the minimum.

#### 4. Numerical Experiment with Synthetic Data

Two synthetic data sets are generated from simulation to test the effectiveness of the NPSML estimator designed in previous sections, one for the fully nonparametric network and the other for the block network. For both examples, the node set consists of 200 nodes which are drawn purely randomly from the unit cube ; thus, these nodes follow the uniform distribution. Consider the following model setup.

*Example 1 (full nonparametric network). *Edge function is negatively proportional to the standard Euclidean distance between two nodes, i.e.,

*Example 2 (block network). *Set , block membership function satisfiesMatrix is given as follows:

For both examples, the spreading process is initialized as that 30% of all nodes are infected at the very beginning and the infected nodes are randomly picked from the node set. The full spreading process is generated from a discrete version of (2) with sufficiently small time step (e.g., that makes the resulting distribution flows as the first-order approximation to the true flows); a coarse time step () is used for the estimation procedure (9) in order to test the robustness. The process is followed up until day 5; i.e., the time horizon in this simulation study is with . The observation of the distribution flows is supposed to be available only at the initial time and the end of every day; i.e., there are 6 chances to observe the distribution of infections at .

For the fully nonparametric Example 1, the spreading process is regenerated for 100 times with 100 random initializations; this is necessary to address the identification issues as pointed out in Section 3.6. For the 100 trails, both the node set and the initial infectious subset are regenerated although their distributions are held constant. For the block network Example 2, the spreading process is generated only once in order to evaluate the fitting performance under the situation that no repeated observation of the spreading process is available. For both examples, the estimated edge function is evaluated on a fixed set of grids for easy comparison where the grid set forms a lattice of the unit cube, i.e., .

If all nodes are included in the computation of the NPSML estimator, there are in principle a 40,000()-dimensional parameter space for full nonparametric network Example 1 and a ()-dimensional parameter space for block network Example 2 to be searched which are too time consuming. As in the introduction of NPSML estimator, by the smoothness of edge function, the number of nodes actually used to evaluate the edge function can be much smaller than the size of the entire node set. So, to reduce computation load, we generate another nodes from the uniform distribution which will be used in Step 3 (Section 3.3) for simulating the distribution function . Accordingly, the node pairs will be selected as the product of the 20 nodes for the fully nonparametric Example 1; then, there are parameters to optimize in Example 1, and the size is quite reasonable for most nonparametric tasks. For the block network Example 2, as no node pairs are needed for block networks, there are only () parameters to optimize. As for the selection of kernel width , and , we set , , and . This is because the kernel smooth method requires kernel width to satisfy and in order to guarantee the consistency and asymptotic normality [28, 29, 36, 52], where is input sample size and is the dimension of the data. By a rule of thumb, we select the kernel width as . For , it is only used in Example 1 to estimate the edge function, where the sample size is and the data dimension is two times of the dimension of node space; thus, is . For and , they are used in both examples for estimating the distribution function ; thus, data dimension is always . The sample size for is because it is used to turn the real observed on 200 nodes to its kernel smooth version and the sample size for is because it turns the estimated on 20 sampled nodes to its values on the full node set.

For the inference of the block network, the number of block is usually not known in prior, so it is also a parameter to estimate. As determines the model dimension, we adopt the classical Bayesian information criteria (BIC) introduced in Schwarz [53] to detect the correct model dimension. As defined in Schwarz [53], the greater BIC for a fitted model implies the better explanatory power [53]; therefore, the best choice of corresponds to the maximal BIC. In practice, it is not possible to calculate the BIC value for all positive , so we follow the convention and only compute the BIC on a small set of . The associated with the maximal BIC and the corresponding estimates of , are selected as the final estimators and reported in the following. In our example, the correct is always achieved, so we omit this trivial result.

In Figure 1, we plot the difference between the real edge function and the NPSML estimated edge function on the set of node pairs for both examples, where the horizontal axis represents the true value of edge weight on every node pair and the vertical axis represents the estimated weight on the same node pair. To facilitate visualization, Figure 1 is sorted according to the horizontal axis in an ascending manner. The red dots represent the pairs of (estimated weight, true weight), the blue line sketches the identity function ; therefore, a red dot being closer to the blue line means the better fitting accuracy. Apparently, for most of node pairs, the difference is negligible. To further verify this visual judgement, test is carried out for every node pair with the null hypothesis . Following the asymptotic normality of NPSML estimator at every , the distribution of test statistics under null hypothesis should be a distribution with degree of freedom , where is the asymptotic variance of estimator , which can be calculated by bootstrap method. We count the number of node pairs that fail to support the null hypothesis at 90% credential level; the result shows that, in both examples, only less than 10% out of all 10,000 evaluation pairs in fail to support the null hypothesis. So our estimation accuracy is quite satisfactory, which agrees with the visualization in Figure 1.