Abstract

This study aims at developing a non-(semi-)parametric method to extract the hidden network structure from the -valued distribution flow data with missing observations on the links between nodes. Such an input data type widely exists in the studies of information propagation process, such as the rumor spreading through social media. In that case, a social network does exist as the media of the spreading process, but its link structure is completely unobservable; therefore, it is important to make inference of the structure (links) of the hidden network. Unlike the previous studies on this topic which only consider abstract networks, we believe that apart from the link structure, different social-economic features and different geographic locations of nodes can also play critical roles in shaping the spreading process, which has to be taken into account. To uncover the hidden link structure and its dependence on the external social-economic features of the node set, a multidimensional spatial social network model is constructed in this study with the spatial dimension large enough to account for all influential social-economic factors. Based on the spatial network, we propose a nonparametric mean-field equation to govern the rumor spreading process and apply the likelihood estimator to make inference of the unknown link structure from the observed rumor distribution flows. Our method turns out easily extendible to cover the class of block networks that are useful in most real applications. The method is tested through simulated data and demonstrated on a data set of rumor spreading on Twitter.

1. Introduction

Flow data has been widely studied by different disciplines [16]. Especially in recent years, the development of internet makes an increasing amount of flow data sets publicly available, among them new types of flows are emerging and attracted more and more attentions from scholars [7, 8].

Unlike the physical movement, such as the trajectory of taxi, the information flow data, such as the time series of the retweet status of a class of tweet articles within a population, does not contain any trajectory-level information, because a user may tweet after he saw many friends had done so. In that case, a group of friends can contribute to the spreading of the tweet, and it becomes impossible to figure out which one is the real single source, neither is it possible to track the trajectory of retweeting. Therefore, this flow data are no longer stored as a collection of well-defined trajectories; instead, they consist of a time series of distributions of a given kind of information within entire population. In addition, the distribution flows are highly “context-dependent”, which means the social-economic factors behind every agent joining the spreading process (such as the education, income, and the neighborhood) might significantly affect the speed, extent, and coverage of spreading, suggesting a spatial social network to be uncovered from the distribution flows.

Of course, the emergence of new types and new features of flow data inevitably brings unprecedented opportunities to improve our understanding of interaction patterns between people and thus enrich relevant theories, but the missing observation on the trajectory-level information and the add-in of social-economic context make it challenging to uncover the agent-to-agent links, or equivalently the entire hidden interaction network. As a result, it becomes necessary to develop more data-driven approaches tailored to uncover the hidden spatial social network behind the distribution flows.

Distribution flows are frequently studied in the field of rumor and/or flu spreading. Existing methods, in broad terms, can be suppressed into two classes, the agent-based modelling/simulation/calibration (ABM) techniques [915] and the differential equation (DE) based approaches [1619]. The class of DE approaches is helpful to derive qualitative conclusions regarding the steady state distribution of the spreading processes and how the equilibrium depends on model parameter in a coarse sense. However, to guarantee the meaningful qualitative results are achievable, the setup of differential equations is often oversimplified, but it would cause the loss of insights into the complex reality. In addition, due to the lack of explicit solution in most cases, it is not possible to apply the DE techniques to fit the real data and generate detailed quantitative results. In contrast, ABM approaches are more realistic and suitable for quantitative research on the real distribution flows. However, there are still a couple of shortages in the existing ABM models [20].

First, ABM often assumes that the spreading process is carried on a network where nodes represent agents that can potentially spread out or be infected with a certain type of object (e.g., rumor); edges are the links between agents. Rumor can only be spread between agents linked by edges. Under ABM framework, this interaction network is supposed to be known and prescribed in prior. Prior network may lose critical information of the interaction patterns of population [15, 21, 22]. For instance, in the Twitter network, a natural interaction network structure is the network formed by friendship or followership relation between users which is also frequently used as the prior network for rumor spreading studies [14, 18]. However, rumor does not have to follow this network to spread [23]. In fact, the retweet action of big name users is more likely to be visible through other channels to those users who are not linked in the friendship network, such as by TV shows, and newspapers. Therefore, the spreading between a big name user and an ordinary user is still possible even if they are not linked at all by merely counting the friend or follower relation. The existence of hidden links makes prior network fail to capture all structural features of interactions, a data-driven or posterior network would be helpful to overcome this issue.

Second, given the prior network, ABM assumes the spreading occurs through interaction mechanisms between two randomly picked agents. Widely used interaction mechanisms include the independent cascade model and the linear threshold model and so on [2426]. These mechanisms are often parametrized and assumed homogeneous for all agents; i.e., the mechanism is determined by a set of parameters that are constant and invariant for different agents. In reality, both of the relative positions of an agent within the network, such as the degree, centrality, betweenness of an agent [17, 18, 27], and many social-economic factors external to the entire network, such as the geographic location, social status, education level, and wealth [22, 27], can drastically affect the likelihood that agents get infected by the rumor. But the heterogeneity among agents is often missing from the standard ABM framework.

To resolve the above issues, we propose a novel and completely data-driven modelling approach to characterize the hidden interaction network and the spreading process. Our study contributes to the existing literature in the following aspects.

First, we consider the interaction network as a weighted multidimensional spatial social network, which is an extension to the standard spatial network and the nodes in the network are embedded into a multidimensional feature space . The weighted edge between nodes is considered as a continuous function on . Within such a network, the value of edge weight function can depend on features of both the start nodes and end nodes, so it gives full respect to the heterogeneity of nodes and its effect on shaping the spreading process and distribution flow.

Second, we link the interaction network with the distribution flows by the classical mean-field models [9, 1618] and the law of distribution transition is realized by a kernel operator with its kernel function given by the edge weight function. Such a construction allows the infection status of a given node to depend on all other nodes in the network in a smooth manner, which avoids the arbitrariness of distinguishing the impact of neighbor and nonneighbor nodes, while also facilitating the inclusion of the context information embedded in the spatial social network into the analysis of spreading.

Third, we adopt the kernel smoothing technique and nonparametric likelihood estimation from statistics [28, 29] to fit our model into real distribution flows, where the entire edge weight function is supposed to be unknown and needs to be estimated from the distribution flow data from the real world. The nonparametricity makes our method a powerful tool of information mining for distribution flow data. Finally, the widely used block models [3033] can be easily incorporated into our framework, which helps better uncover hidden social-economic connections between individuals from distribution flows.

The paper is organized as follows. In Section 2, we give an overview of existing methods of network estimation. Section 3 formally presents the setup of our method, including the definition of feature space network, mean-field models and their simulation techniques, and the design of our likelihood estimators. Section 4 validates the effectiveness of our estimators to the hidden network by synthetic data and numerical experiments. Section 5 applies our method to a distribution flow dataset of the information spreading on Twitter relevant to the event “Unite the Right rally”, 2017.

2. Relevant Methods

The proposed method in this paper is essentially a network estimation tool, while network estimation is a long-standing topic in many different fields.

In the studies of agent-based model (ABM), simulation-based estimation is usually adopted to calibrate the unknown parameters involved in model setup [1315, 18, 34, 35]. Simulation-based estimation is efficient in dealing with the estimation of ABMs as it is often impossible to derive an analytic expression for the standard error functions in ABM setting; simulation can help generate an empirical version of the error function and facilitate the application of standard ordinary least square (OLS) and maximum likelihood (ML) estimation strategy. However, the simulation-based estimation is more frequently applied to parametric ABM where only a finite-dimensional parameter vector is to be estimated; it is rarely used to estimate the hidden network structure as the unknown network is essentially nonparametric, which is less tractable than the parametric models. To our best knowledge, the only exception comes from Grazzini and Richiardi [35]; Kukacka and Barunik [36], in which the interaction mechanism when two agents meet is allowed to include a nonparametric component, and the kernel smoothing method and nonparametric likelihood (or least square) estimators are applied to cope with model estimation. However, Grazzini and Richiardi [35]; Kukacka and Barunik [36] do not include the interaction network between agents into their analysis, nor the model identifiability issue is resolved; thus, further exploration is needed in this direction.

The other related works deal with link prediction by stochastic-network models. In this field, nonparametric tricks are more often adopted to make inference of hidden features of stochastic network [23, 31, 32, 37, 38]. Lü and Zhou [31] review the main-stream heuristic algorithms to forecast the missing links within a partially observed network. Bickel et al. [39], from the perspective of statistic inference, summarize and validate the application of variational expectation maximization (VEM) algorithm to infer the probability of existence of a link between two nodes from observed edge data. Matias et al. [38] extend the VEM method to deal with the future occurrence probability of edges given a dynamic linked network and the historic edge data; this extended method can handle the case where the evolution of occurrence probability depends nonparametrically on an unknown hazard function. All these methods were developed under a common assumption that at least the edge information of part of the network has already been observed, which is possible for trajectory data, but not possible for distribution flows. Thus, a further extension is needed to handle the case that all edge data are missing.

In the literature of physics, the task of detecting the hidden network link structure from node-level time-series data is phrased as “network reconstruction”. Taking distribution flows as the input, two outstanding network reconstruction methodologies are directly comparable to ours. One is based on the compressive sensing technique as proposed in Shen et al. [40]; the other is based on the combination of likelihood estimation and the mean-field approximation technique as discussed in Roudi and Hertz [41]. The basic idea in Shen et al. [40] is to convert the network reconstruction problem to a classical convex optimization problem with linear constraints, which is the so-called compressive sensing (CS) problem. In the CS problem, the linear constraints come from the transition probability of nodes within the network from the uninfected state to the infected state, while the objective function arises from the sparsity assumption regarding the network link structure. Unlike the applications of CS approach to the network reconstruction from continuous time-series data [4244] where the feature variables associated with every node are directly observable, in the case of distribution flows, the key variable, transition probability, is not observable from the data. Therefore it has to be calculated so as to form the required linear constraints. Inferring the transition probability from the -valued distribution flow data requires a stationary assumption on the underlying model which is too restrictive in many applications. For instance in the spreading of virus, an agent might die immediately after it is infected, in which case the infected agent is censored in the sense that its infectious status is constantly one since the time of being infected. When censored agents exist in the network, stationarity of the transition is impossible and the CS framework in Shen et al. [40] is no longer applicable. The other problem of the CS framework is its incapability of handling the spatial heterogeneity among different nodes. As we have highlighted that the education, wealth, and many other social-economic factors can play critical roles to determine the link strength among people and therefore affect the information spreading dynamics, modelling the dependence of the hidden link structure on those social-economic factors is necessary in the studies of social network. The inclusion of social-economic factors would introduce heterogeneity among nodes, which makes it challenging to identify which two nodes are relatively homogeneous and can be grouped together. In the CS framework, grouping different nodes is the premise to calculate the transition probability. In an abstract network, all nodes are homogeneous, and the grouping can be simply taken as the set of all nodes as done in Shen et al. [40], while in a spatial network with heterogeneity widely existing such a simple grouping trick is meaningless. How to extend the CS framework to spatial social network becomes a tough job and extensive studies are needed.

The deep reason that restricts the CS framework is its reliance on the unobservable transition probability. That restriction can be effectively resolved by applying the likelihood technique as suggested in Roudi and Hertz [41]. The goodness of likelihood-based approach is that it can compute the unknown transition probability simultaneously with the other model parameters. But the computation usually takes too much time because there is no explicit solution for the first-order condition of the maximum likelihood; numerical solution is required. To make the computation easier, a mean-filed approximation technique is presented in Roudi and Hertz [41], which can definitely increase the computation speed. However, the approximation can only work for the case that all link strengths have to be close to zero, which restricts its usefulness in many applications of social network. On the other hand, the current version of the approximation technique in Roudi and Hertz [41] still assumes an abstract network structure, and no dependence of the link strength on social-economic factors is allowed; it is unclear whether the approximation is extendible to account for the reconstruction of spatial social networks. Finally, Roudi and Hertz [41] are only concerned with the situation that the number of nodes () is relatively small, and the computation complexity comes mainly from numerically solving the maximum likelihood problem. But when is large, the computation complexity would be dominated by the matrix multiplication for the adjacency matrix. Since the approximation technique in Roudi and Hertz [41] still requires the matrix multiplication, its speed-up effect for giant networks may not be that significant. More explorations on the fast reconstruction of giant spatial social networks are needed.

3. Model Setup

3.1. Feature Space Network

We consider a weighted multidimensional spatial social network, where nodes of the network are considered as elements in a -dimensional Euclidean space and every dimension of is interpreted as a feature of nodes; thus, is interpretable as a feature space. Edges between nodes are assumed to depend on features of nodes in a smooth way, i.e., edge set of the graph is equivalent to a smooth function (up to a certain order of derivatives) or an almost-everywhere smooth function (i.e., the function is smooth for all points except those contained in a zero-measure set), denoted as , where, without loss of generality, edge weight between two nodes is restrained within the unit interval. Such a specification admits a stochastic-network interpretation of our model; the weight can be thought of as the probability that two nodes share an edge. Since the nodes of the network may not be evenly distributed within the entire space , without loss of generality, we assume the node’s distribution is characterized by a probability measure on , and is supposed to be known from the data. In sum, the -dimensional spatial network can be recorded as , or shortly when there is no ambiguity regarding its nodes space, distribution, and edge function.

There are several advantages to assume that the spreading process and distribution flows occurred within . First, the embedding of the node set into feature space allows us to characterize the feature information of nodes that are external to the network structure [21, 22, 27], which are usually as important as the network structure itself in determining the spreading process and distribution flows. Luo et al. [22] argue that including social-economic factors such as the intensity of population gathering in a set of locations can significantly increase the capacity of forecast of illness spreading among residents. Viboud et al. [45] report similar findings. Second, allowing nodes unevenly distributed within the feature space admits us to include more general network into analysis. For instance, by proper choice of the measure (e.g., finitely supported), it is even possible to consider a network with only finitely many nodes but sitting in the infinite feature space ; this allows us to include most of networks that we can meet in practice. Finally, allowing the edge weight to smoothly depend on features of both the flow-in and flow-out nodes makes it possible to incorporate the background information into the interaction mechanism; this is critical when the network itself is only a small component of a larger background system [27]. In addition, a by-product of treating edges as a smooth function is its induced computational efficiency. In fact, when a network consists of a giant number of nodes, even a simple summation operation can take a long time and huge memory, but when edges vary smoothly along with nodes, it becomes possible to only do calculation on a small set of nodes and the global features of edges then can be inferred from the result on the relatively small set by the kernel smoothing technique from nonparametric statistics [28, 29]. Based on these advantages, we will concentrate on the spatial network, , instead of a more general concept of network.

3.2. Mean-Field Models

To model spreading processes within a spatial network , we follow the convention in the studies in rumor spreading literature [10, 17] and adopt the common assumption that a rumor can be spread out from a node to the other if and only if (1) the initial node must have been infected with the rumor, recorded as the event ; (2) there is an edge between them, or equivalently, ; and (3) when condition (1) and (2) hold, whether or not the spreading actually happens is purely random up to a probability . Different spreading models impose different requirement on the probability . In the current studies, we adopt the mean-field model to determine , as suggested in most of previous studies. Formally, for every fixed time , the probability of node being infected is determined by the following mean-field equation:The interpretation of (1) is that at the temporal variation rate of the probability that node is infected (represented as ) is a proportion to the probability that node has not yet been infected by time (represented as ), and the proportion is determined through a weighted sum of the probability of all other nodes in the network having been infected by . The weight function describes the strength of connection between nodes and , thus can be formulated as the edge function . Using the classical result of mean-field equations [4650], it can be easily verified that the infection probability in (1) is exactly equal to the probability of for a given right-continuous mean-field point process satisfying the following:where is the left-limit of process . The interpretation of (2) is more straightforward than (1); (2) points out that the average rate of node being infected is contributed by all those nodes that (1) have a connection to and (2) have been infected by the current time. These two conditions are often imposed in literature.

Let be a function satisfying the functional differential equation (1), also denote as the density or mass function associated with probability ; then, the event that a given node is observed at time and its infectious status is observed to be infected has the probability density:in contrast, the density for the event that is observed to be uninfected at is given asSuppose that given a time , the infectious status of a set of randomly picked nodes is observable and represented aswith being not infected and being infected; then, the likelihood function of the observations can be written in the following way by using (3) and (4):where we add the edge function into likelihood because it affects through determining the functional form of . Maximizing (6) can yield the classical maximum likelihood (ML) estimator of .

3.3. Nonparametric Likelihood Estimator and Kernel Smoothing

In the study of spreading process, only the distribution flows of the form (5) are available; the details of link structure between nodes represented by edge function are not observable, thus need to be estimated. In this section, we construct a nonparametric simulated maximum likelihood estimator (NPSML) to the functional form of given the observed distribution flows on a sequence of time. The NPSML is an efficient nonparametric inference technique proposed by Kristensen and Shin [29]. NPSML applies well to the case where an explicit expression of the likelihood function is not achievable, which is exactly what we need to handle because the distribution function in (6) is the solution to the functional differential equation (1); there is no clean analytic expression available for it.

However, our task is different from the situation discussed originally in Kristensen and Shin [29]. First, the original NPSML applies nonparametric kernel smoothing to approximate the unknown likelihood function; the model generating the likelihood function is still parametric, but in (6) the likelihood depends on the nonparametric edge function . To this situation, one extra kernel smoothing step is needed to approximate . Second, in Kristensen and Shin [29]; Kukacka and Barunik [36], simulation is conducted on the level of random variable, while, in our case, simulation is on the level of distribution that is equivalent to numerically solve the mean-field equation (1). Finally, due to the involvement of nonparametric model setup, the model identifiability has to be checked in order to guarantee the correctness of the resulting estimation.

Due to the first and second differences, we provide the following algorithm to generate the simulated likelihood function (in the following constructions, we always use to denote the -dimensional standard Gaussian kernel function, for some positive constant .).

Step 1. Select constant , large positive integer and ( is the length of every time step used for numerically solving the functional differential equation (1), and are the number of random samples that will be drawn to generate the kernel smoothing approximation to the unknown likelihood function and edge weight function).

Step 2. Draw random samples from distribution , and random samples from the product measure .

Step 3. Given , construct function as follows:

Step 4. Given , let denote the observation set at time whose cardinality is , constructing function as follows:

Step 5. Solve mean-field equation (1) over interval at the set of sample point drawn in Step 2 by Euler’s method with time step subject to the initial condition as follows:where , is the greatest integer less than .

Step 6. For the observation set at with cardinality , generate the simulated density at the sample nodes as follows:and construct the simulated likelihood function as follows:

The full information likelihood function for all observation time can be constructed from (11) in the following way:The estimator of unknown edge function can be derived from maximizing the simulated full information likelihood function (12) by selecting appropriate ; the final estimator is constructed from the optimal in the way of (7).

Comparing to NPSML in Kristensen and Shin [29], the algorithm in our study includes one extra sampling step to draw random points from which are used for approximating unknown . In addition, there are two kernel smoothing steps (Steps 4 and 6) regarding the density function , one for the initial density in the starting time and the other for the end-time density at . The two kernel smoothing steps are not required when the total number of nodes are small (a few hundred or a few thousand), in which case the whole set of nodes is directly used as the samples drawn in Step 2. However, when the system has a giant node set (say millions), the sample size can be applied in order to lift the computation efficiency. Moreover, the node sets being observed at different observation time may not always be identical; it is more often the case that when a node is tracked to be uninfected at some time , it will be regarded as safe and missing from the consecutive tracking in the next few observation time points. In this interval-censor situation, the sampled nodes and the two kernel smoothing steps are needed to avoid the noise induced by censoring.

As documented in Kristensen and Shin [29]; Kukacka and Barunik [36], the NPSML estimator does not suffer from the “curse of dimension” despite its nonparametric essence because the number of simulation samples is independent from the number of observation samples. When the latter is large, the inefficiency induced by kernel smoothing vanishes during the aggregation involved in the likelihood function. By the same argument and the fact that in most real-world applications the number of observed nodes is giant, our modified NPSML estimator is free from the curse of dimensionality as well.

3.4. A Fast Algorithm

As shown in (9), the estimation procedure requires repeated evaluation of the multiplication between a matrix and a dimensional vector; the computation complexity is of the order . Although can be taken as much smaller than the number of nodes in observations (), it still has to increase as increases. So when is a giant number, has to be large as well; the computation complexity of the entire estimation procedure will be dominated by . In this section, we propose a fast algorithm which can reduce the computation complexity in (9) to be linearly dependent on that is reasonable and implementable in practice.

The idea of the fast algorithm comes from the technique of agent-based simulation (ABS). In every iteration of ABS, every agent in the network is only required to interact with another agent randomly picked from its neighbor. In our setting, there is no strict “neighbor” defined, while it is still possible to randomly pick one agent from the entire population and the interaction is only counted on the given agent and its randomly picked partner. Formally, Step 5 in previous paragraph is split to three substeps.

Step 5(1). For fixed and fixed , randomly pick one from ;

Step 5(2). Compute

Step 5(3). Repeat the above two steps for all , , and for all s.

Comparing (9) and (13), the main difference is that the inner product of vectors (i.e., the sum over ) is replaced with a scalar multiple, so the resulting computation complexity for all nodes linearly depends on which is significantly faster than the original algorithm.

For the accuracy of the fast algorithm, we claim that compared to the original algorithm, the accuracy loss induced by the fastness is controlled by a constant multiple of . In fact, due to the randomness of s, it is easily to verify the following:(i)the expectation of the left hand side of (9) is identical to the expectation of left hand side of (13);(ii)denote as the increment; , then for , and all s, .

The property (i) and the identity for in (ii) are quite trivial. For , then can be decomposed as the sum of the following two components where is the norm of a bounded valued function. The above inequality holds straightforwardly from the fact is bounded by and its temporal derivative is given by (1) which is also uniformly bounded by ; then, the statement (ii) follows immediately.

Using Property (i), (ii) and the law of large number, it is straightforward that the difference between the likelihood function constructed from (9) and by (13) is bounded by a constant multiple of as the number of nodes . If we further require along with , the two types of calculation of the likelihood function would be asymptotically identical, which leads to the same estimator to the hidden network.

Also notice that by the fast algorithm, the choice of is independent with the estimation accuracy, so in practice it can be selected directly as to increase the speed.

3.5. Block Network

The NPSML algorithm constructed in previous section can be further extended to make inference for the block network model. As in many applications [33, 38, 39], the existence of connection between two agents is only relevant to the groups they belong to, and the features of agents only affect which group they are assigned to. Without loss of generality, the set of groups can be considered as a partition of the set of all nodes; then, the edge function can be decomposed as two components:(i)the group weight function ;(ii)the group-level edge weight , which is a matrix with each entry valued in ;

The edge function for the block network model can be recovered from (i) and (ii) as follows:where the image of is viewed as a -dimensional columns vector and the subscript represents vector transpose. The group weight function is required to satisfy that for every and , there exist only one with , which means every node can only have positive probability to belong to at most one group which guarantees the requirement that groups constitute a partition of the node set.

The estimation of block network is equivalent to the estimation of (1) the group weight function , which is unknown and consists of the fully nonparametric component of the network, and (2) the interaction matrix , which is the parametric component of the network. So the estimation is essentially semiparametric. The six-step algorithm discussed in Section 3.3 and the fast algorithm in Section 3.4 are still applicable to that case. The only modification is for Step 3, where the kernel smoothing method is no longer applied to the unknown edge weight . Instead, it is applied to generate the estimate to group weight . Then the hidden weight function is constructed from the kernel smoothed and the given interaction matrix in the way of (15).

Block network model has many advantages. For instance, when the number of groups involved is small and does not depend on the number of nodes, the number of parameters to solve is only , while the number is when there is no block structure at all. To generate good approximation to the true edge function, has to increase along with the number (although slowly); when the node number in observation is giant, has to be large as well; then . Through block network, we can sharply reduce the dimension of parameter space when solving the maximum likelihood problem, which can significantly lift the computation efficiency.

In addition, block network is much easier to identify than the general fully nonparametric networks, which will be discussed in the next section. Finally, under block network, the equilibrium infectious distribution of the spreading process has a clear analytic expression as stated in the following proposition (proof for Proposition 1 is quite trivial, hence omitted).

Proposition 1. Denote as the projection of vector to its th coordinate. Define that consists of the set of nodes belonging to group ; then, within a mean-field model of the form (2) with edge function given by (15), every equilibrium infection distribution (i.e., satisfying ) must have the following form:where is the prescribed initial distribution of infectious status, is the projection of a vector to its th dimension, and denotes the th power of matrix .

Proposition 1 is meaningful in the sense that it links the types of equilibria infectious distribution with the matrix algebra, facilitating the qualitative analysis of the equilibria distribution. For instance, when is an upper triangle matrix with all its lower off-diagonal entries being zero and all diagonal and upper off-diagonal entries being strictly positive, such as in (17),then the equilibrium distribution and the initial distribution satisfy the relation:

3.6. Validity of NPSML

Due to the nonparametric nature of the edge function , its identifiability is tricky. When the spreading process can be observed for multiple times ( times) with random initializations and is large as assumed in Roudi and Hertz [41]; Shen et al. [40], both of the fully nonparametric network and the block network are identifiable. However, in real applications, a spreading process can at most be observed for a few times; it is not expected that can be very large. In that case, the fully nonparametric edge function is no longer fully identifiable; i.e., there exists that leads to the same likelihood function (6) in the limit case. However, it can be shown that is identifiable up to compact convex set; i.e., the set is a compact convex set within the function space , where stands for the true value of edge function. It can also be proved that the set also varies along with the initial infectious status . Formally, we have that , if and only if the following holds for all :where is a bounded operator over the functional space defined through as for every with being the default node distribution; is the multiplicative operator determined by such that ; the th power in (19) represents the self-composition of an operator for times. (19) implies that the identifiability of the true edge function is limited by the extent of the ergodicity of the spreading process within the node space . For instance, when there exists a small open set such that all nodes are infected before the initial time , i.e., for all , then it can be verified by (19) that all functions that deviate from only within the band set are contained in . On the other hand, if there exists open such that for all and all , then all functions that deviate from only within are contained in . In both of the two cases, nodes in or are not in the ergodic range of the spreading process; hence, the transmission of their infectious status is not observable. For nodes in , their infections occur ahead of the observation period hence not observable after the start of spreading, while for nodes in it can be verified that they will never be infected over the entire spreading process. Therefore, the identifiability of is restricted by the experience of the spreading process, which is reasonable.

It is still an open question what conditions added to and/or can guarantee the identifiability of the fully nonparametric . But in the special case of block networks, one simple identifiability condition can be figured out. In fact, for block networks, it is straightforward that is identifiable, if and only if there does not exist a pair that differs from the true but leads to the same likelihood function (6) in the limit case, if and only if, for the true , the vector space spanned by the family of vectors is the entire feature space , i.e., has full rank. is the number of blocks, is a -dimensional column vector for every and for each , , is the th entry of . To reach the full rank condition, the well-known Wronskian determinant [51] can be applied leading to the following clean-form identifiability condition:where is the other -dimensional column vector determined by the true function such that for ; is the operation that convert a -dimensional vector to a matrix with its diagonal elements being the given vector. By the polynomial nature of the determinant function, it can be verified that (20) holds “generically” in the sense that the set of s that forces (20) to be constantly equal to is contained in an dimensional surface within , and for those s that (20) is not constantly , the set of that forces (20) to be is only contained in a dimensional surface within . Therefore, (20) holds for almost all and except for some extreme cases that have measure under the standard Lebesgue measure.

The “almost” identifiability for block networks guarantees that in most cases when the number of observed nodes is large and the distribution of observation time is dense, the estimated and from the NPSML asymptotically converge to their true values and point-wisely follow multivariate normal distributions. This asymptotic result follows straightforwardly from Kristensen and Shin [29]; Kukacka and Barunik [36] and the general properties of maximum likelihood estimator. So, the theoretical validity of the estimators developed in previous sections is established.

Remark 2 (sparsity). Although in general the complete identifiability for both the general network and the block network is hard to achieve, but if we follow the idea in the network reconstruction literature, Shen et al. [40] only concentrate on the case that the hidden network is as sparse as possible in the sense: the norm of the edge weight function for the general network and/or the entry-wise square sum of the block network (this is the norm on the discrete set with cardinality ) is as small as possible. To automate the selection of the sparsest network, we can consider the norm function as a penalty and subtract it from the log-likelihood function (6), and then optimizing (6) would guarantee the solution converging to the sparsest network. It is easily verified that such a sparse solution is always asymptotically unique, because as we discussed in previous paragraphs, all networks that can lead to exactly the same log-likelihood function form a compact convex set in the functional space; by the compactness and convexity, there always exists a unique (or ) such that its -distance to the origin reaches the minimum.

4. Numerical Experiment with Synthetic Data

Two synthetic data sets are generated from simulation to test the effectiveness of the NPSML estimator designed in previous sections, one for the fully nonparametric network and the other for the block network. For both examples, the node set consists of 200 nodes which are drawn purely randomly from the unit cube ; thus, these nodes follow the uniform distribution. Consider the following model setup.

Example 1 (full nonparametric network). Edge function is negatively proportional to the standard Euclidean distance between two nodes, i.e.,

Example 2 (block network). Set , block membership function satisfiesMatrix is given as follows:

For both examples, the spreading process is initialized as that 30% of all nodes are infected at the very beginning and the infected nodes are randomly picked from the node set. The full spreading process is generated from a discrete version of (2) with sufficiently small time step (e.g., that makes the resulting distribution flows as the first-order approximation to the true flows); a coarse time step () is used for the estimation procedure (9) in order to test the robustness. The process is followed up until day 5; i.e., the time horizon in this simulation study is with . The observation of the distribution flows is supposed to be available only at the initial time and the end of every day; i.e., there are 6 chances to observe the distribution of infections at .

For the fully nonparametric Example 1, the spreading process is regenerated for 100 times with 100 random initializations; this is necessary to address the identification issues as pointed out in Section 3.6. For the 100 trails, both the node set and the initial infectious subset are regenerated although their distributions are held constant. For the block network Example 2, the spreading process is generated only once in order to evaluate the fitting performance under the situation that no repeated observation of the spreading process is available. For both examples, the estimated edge function is evaluated on a fixed set of grids for easy comparison where the grid set forms a lattice of the unit cube, i.e., .

If all nodes are included in the computation of the NPSML estimator, there are in principle a 40,000()-dimensional parameter space for full nonparametric network Example 1 and a ()-dimensional parameter space for block network Example 2 to be searched which are too time consuming. As in the introduction of NPSML estimator, by the smoothness of edge function, the number of nodes actually used to evaluate the edge function can be much smaller than the size of the entire node set. So, to reduce computation load, we generate another nodes from the uniform distribution which will be used in Step 3 (Section 3.3) for simulating the distribution function . Accordingly, the node pairs will be selected as the product of the 20 nodes for the fully nonparametric Example 1; then, there are parameters to optimize in Example 1, and the size is quite reasonable for most nonparametric tasks. For the block network Example 2, as no node pairs are needed for block networks, there are only () parameters to optimize. As for the selection of kernel width , and , we set , , and . This is because the kernel smooth method requires kernel width to satisfy and in order to guarantee the consistency and asymptotic normality [28, 29, 36, 52], where is input sample size and is the dimension of the data. By a rule of thumb, we select the kernel width as . For , it is only used in Example 1 to estimate the edge function, where the sample size is and the data dimension is two times of the dimension of node space; thus, is . For and , they are used in both examples for estimating the distribution function ; thus, data dimension is always . The sample size for is because it is used to turn the real observed on 200 nodes to its kernel smooth version and the sample size for is because it turns the estimated on 20 sampled nodes to its values on the full node set.

For the inference of the block network, the number of block is usually not known in prior, so it is also a parameter to estimate. As determines the model dimension, we adopt the classical Bayesian information criteria (BIC) introduced in Schwarz [53] to detect the correct model dimension. As defined in Schwarz [53], the greater BIC for a fitted model implies the better explanatory power [53]; therefore, the best choice of corresponds to the maximal BIC. In practice, it is not possible to calculate the BIC value for all positive , so we follow the convention and only compute the BIC on a small set of . The associated with the maximal BIC and the corresponding estimates of , are selected as the final estimators and reported in the following. In our example, the correct is always achieved, so we omit this trivial result.

In Figure 1, we plot the difference between the real edge function and the NPSML estimated edge function on the set of node pairs for both examples, where the horizontal axis represents the true value of edge weight on every node pair and the vertical axis represents the estimated weight on the same node pair. To facilitate visualization, Figure 1 is sorted according to the horizontal axis in an ascending manner. The red dots represent the pairs of (estimated weight, true weight), the blue line sketches the identity function ; therefore, a red dot being closer to the blue line means the better fitting accuracy. Apparently, for most of node pairs, the difference is negligible. To further verify this visual judgement, test is carried out for every node pair with the null hypothesis . Following the asymptotic normality of NPSML estimator at every , the distribution of test statistics under null hypothesis should be a distribution with degree of freedom , where is the asymptotic variance of estimator , which can be calculated by bootstrap method. We count the number of node pairs that fail to support the null hypothesis at 90% credential level; the result shows that, in both examples, only less than 10% out of all 10,000 evaluation pairs in fail to support the null hypothesis. So our estimation accuracy is quite satisfactory, which agrees with the visualization in Figure 1.

For the block network Example 2, Table 1 presents the entry-wise accuracy of estimated relative to (23), the first column presents the estimation bias, the second and third columns are the empirical standard deviation and the empirical P-values of the estimates, from which we can conclude that the fitting accuracy is relatively perfect.

For robustness check, we also consider the synthetic data generated for different and the implementation of NPSML estimation on node samples with different size and . When and are increased to and , respectively, no significant difference can be detected in terms of the estimation accuracy measured by the entry-wise bias between the true and the estimated edge weight, so we omit to plot this result. For the rejection ratio at 90% credential level of the null hypothesis that the true and estimated edge weight are identical, this ratio is lowered down a bit for the block network to less than 6%, but no significant decreasing can be detected for the general network example. This observation might be caused by the fact that for general network, there are much more free parameters to estimate, which reduces the convergence speed. As for the different , the variation of estimation accuracy is not significant in all aspects; this fact agrees with the discussion in the end of Section 3.4.

5. Experiment with Rumor Spreading on Twitter

To demonstrate the usefulness of the NPSML method in real-world applications, we carry out an experiment with the distribution flow data of a real rumor spreading process on Twitter. We collect a data set of tweet articles with regard to the famous event “Unite the Right rally”. The “Unite the Right rally”, also known as the Charlottesville rally or Charlottesville riots, was a white supremacist rally that occurred in Charlottesville, Virginia, from August 11 to 12, 2017. The rally occurred amidst the backdrop of controversy generated by the removal of Confederate monuments throughout the country in response to the Charleston church shooting in 2015. The event turned violent after protesters clashed with counter-protesters, leaving over 30 injured. The rally also attracted wide attentions on Twitter. Twitter users led vigilante campaigns on the platforms to personally identify and denounce individual marchers in the rally; following the start of the campaign, many of the marchers were shamed and vilified by the social media community, with several of the rally attendees being dismissed from their jobs as a result of the campaign.

Although the rally occurred in Charlottesville originally, messages and/or comments related to it are immediately spread out through Twitter to users in many other places, including all major cities in US, which inspired subsequent vigils and demonstrations in a number of cities across the country in the following days from Aug. 11 and 12, 2017. To this event, we collect a time series of user level information (during the time from Aug. 11 to Sep. 4, 2017) that recorded all Twitter user accounts in 20+ cities that spread, at least once, any message/comment related to the rally during the collection period. We also collect the reaction time of every user to relevant messages and the user-specific information such as the number of followers, friends that an user has and how many tweets the user has published in the past (history posts). In addition, the registration location of the Twitter account and its corresponding latitude and longitude are also collected.

Similar to most rumor spreading data, it is not possible to track how every single message is spread from user to user by our collected data; thus, there is no way to directly identify the interaction network among users. But it is possible to generate the distribution flows of users who have joined the spreading process. Formally, we can define at each time point that a user has joined the process if and only if by he/she has at least reacted once to the messages/comments related to the rally; then, the data set can be easily converted to day-by-day distribution flows where, at every time (day) since the origin (Aug. 11, 2017), we have an -dimensional -valued vector with being the number of all users in record. The th coordinate takes value if and only if the th user has reacted to the rally-message at least once by .

For such a distribution flow data set, we are interested in making inference of features of the interaction network between users, because they are useful for making prediction for the other spreading processes on Twitter regarding the similar social events. To that end, we apply the NPSML method to estimate the hidden interaction network from the flow data. Since there are 100,000+ users in our record, and it is likely that many users belong to the same latent group so that their response pattern is similar to their common group members, it is more appropriate to assume the interaction network behind our flow data is a block network and then apply the NPSML to the block network model discussed in Section 3.5.

To uncover the dependence of interaction links between users on their geographical features and/or friendship/followership relation, we embed nodes(users) of the interaction network into a -dimensional feature space with the coordinates representing the latitude, longitude of account location, the number of friends, followers and history posts, respectively. To reduce the computation burden, we adopt the bootstrap method, randomly pick 10,000 users from the full set of users for 10 times and estimate the block network on each of the subsamples. For every subsample, an estimator for membership weight function and interaction matrix can be derived. The aggregated estimator for interaction matrix is averaged over all subsample estimators; for the block membership weight , the aggregated estimator is derived by maximum a posteriori from the set of subsample estimators.

For robustness check, we select to solve (9). As block network is used, there is no need to draw the samples of node pairs, only sampled nodes are needed for evaluating . To reduce computation burden, we consider to take a much smaller than the number of all users in record (10,000+) to approximate the membership weight function and distribution function . To check the robustness of our estimation with respect to different choice of , we preliminarily run the estimation program on a set of different . The feature vector of the nodes in each trail is selected by conducting a K-means clustering on the full sample with the number of clusters equal to , then the set of cluster centres will be selected as the feature vector. Such selected feature vector for the nodes distributes asymptotically in the same way within the feature space as for the full sample of nodes. The preliminary result shows that the estimators are not sensitive to different choice of and become stable when is greater than . Therefore, we will fix and ; the cluster centres are also used as the evaluation nodes for the estimated function .

The choice of best block number is still based on maximization of BIC value. We plot the BIC for the three cases that the block number equals to 3, 4, and 5 in Figure 2 and the BIC reaches its maximum when block number is 4, so we consider a block network with 4 blocks as the final model for further analysis.

Different visualizations of the block network are provided. Figure 3 sketches the geographic range of every block/community of the Twitter network; the amount of followers, friends and history posts is plotted along with locations of every user within every community in subfigures (a), (b) and (c), respectively. Note that the 100 users in plot 3 are synthetic in the sense that their attributes are described by the centre vectors of 100 clusters yielded from applying K-means clustering to the full set of 10,000+ users. Because the clustering is taken on a 5-dimensional feature space, the location of every synthetic user may not lie exactly within a city in the US, nor around a group of neighboring cities. Although the deviation between synthetic users and real users seems to be anomalous, it does reflect the information loss when the higher-dimensional cluster is projected to a low-dimensional space; this part of lost information can play a critical role in determining the community membership of both the synthetic and real users. To see this, consider the synthetic user represented by the largest green dot in Figure 3(a); its geographic location is obviously not close to every city or cities group within our record. To be grouped into the same cluster by K-means method, all real users corresponding to this synthetic user have to have the property that they are quite far away from each other geographically but highly analogous in the other dimension of features, such as the number of followers in this case. Consequently, the community membership of the giant green-dot user and the real users represented by it is not fully determined by geographic factors, while it is more likely to depend on the extra social factors, such as the amount of followers, which are not directly related to users’ locations. This observation also justifies the necessity of including extra information into the analysis of information spreading process on Twitter.

From the mean value of every feature reported in Table 2, the four user communities can be roughly summarized by their activeness as follows: (1) big name community, within which the users are more likely to have a giant group of followers and friends; meanwhile, they are highly active on Twitter; (2) nobody community, within this community users have a fairly small number of followers and friends compared to the other three communities; their history posts are not quite active either; (3) famous inactive community, users in this community have quite a lot of followers, but only a few friends and a relatively small amount of history posts, so this group of users might be “stars” in some fields (large follower group), but they are less likely to interact with the others on Twitter and therefore are not active; (4) famous active community, users in this community do have many followers, but different from inactive community, the average number of friends and history posts is huge, which indicates that they are very active on Twitter.

If we further exam the spatial distribution of features within every community in Figure 3, it is found that (1) for the amount of followers and friends, their spatial distribution is highly uneven within every community, there are only one or two synthetic users with extremely large value, this uneven distribution pattern suggests a classical centre-periphery structure within a community, and the users with greatest amount of followers and/or friends are leaders for the spreading of opinions within their own community and across different communities; (2) the amount of history posts is much more evenly distributed within all the four communities, which reflects the important characteristics of social media that every user on it has the same right to express their own opinion, no matter whether or not they are famous or influential in the real life; (3) although users within every community are not gathered spatially, there exists a weak spatial segregation pattern of the four communities (the segregation can be better visualized in Figure 4); to better understand the source of the spatial segregation, future studies are needed.

The link strength between different communities is presented in Table 3(the “From” label in the column header indicates that values in each column representing the impact strength from the community in the column header to the other communities; the “To” label in the row name indicates that values in each row representing the impact strength from the other communities to the community in the row label) and visualized in Figure 4. Apparently, a significant hierarchical structure can be concluded from the link matrix, big name community dominates all the other communities in terms of their sensitivity to social opinions, followed by the famous active community. But compared to the famous active community, the big name community is more likely to accept arguments sourced from the nobody and famous inactive community. For famous inactive community, they only read the tweets posted by members in the big name and famous active communities and receive nothing from its insiders and users from nobody community; this observation reflects some kind of opinion discrimination. Finally, the nobody community seems to be isolated from all the other communities and only hear from its insiders, which forms another form of opinion discrimination [54].

From above analysis, there have been quite a few interesting features that can be drawn out of the information spreading process on Twitter. To better understand the formation of the four communities and the hierarchical structure of link matrix, it should be helpful to do more textual mining work on the tweet articles involved in the spreading process and add the extracted information as covariate to the spreading process and reestimate the hidden block network. To do so, a semiparametric extension of the network estimators in this paper is needed; we leave this challenge for future researches.

6. Conclusion and Future Direction

In this paper, we propose a novel approach to nonparametrically estimate the hidden interaction network behind an information spreading process. This approach is designed to handle such an important feature of information spreading processes that the specific spreading trajectory does not exist, and only the distribution flow of the spreading status is observable. To characterize the formation of distribution flows, a mean-field process/equation is proposed. A nonparametric simulation-based maximum likelihood estimator is developed to resolve the subtlety induced by the mean-field equation and the fully nonparametric network edge function. Our estimation procedure can also be applied to the block network structure, a special case of the fully nonparametric network.

To our best knowledge, our work is the first attempt to implement a fully nonparametric estimation of the network structure for distribution flow data and information spreading process. The resulting estimator is always valid if the spreading process is repeatedly observable, while for those spreading processes that are not possible to be repeatedly observed, the estimator turns out still valid in the sense that it is identifiable up to a compact convex set for a fully nonparametric network and completely identifiable for block network under a generic constraint. Therefore for block network, the consistency and asymptotic normality can always be established in the standard way, which is enough for practical use.

Numerical experiments are conducted to verify the effectiveness of our estimation procedure; its practical usefulness is illustrated by a real data application, where the spreading process of tweet articles regarding the event “Unite the Right rally” is studied, and a block network is fitted. The fitting result shows that Twitter users involved in the spreading process can be divided into four communities, which correspond to big name users, famous active and inactive users, and nobody users. Connections among these four communities display a remarkable hierarchical structure; opinion discrimination exists as expected among different communities.

There are some limitations of the current studies: first, we only show that the fast algorithm is efficient in lifting the computation speed when the number of observation times is relatively small compared to the total number of nodes, but a low observation frequency might enlarge the estimation bias. In practice, how to balance the estimation accuracy and the computation is tricky, and further studies are needed. Second, high frequent observation may not always be possible in many applications. In the Twitter data analyzed in this paper, the exact time of posting is available, which makes it possible to extract arbitrarily high frequent distribution flows from the given data. But, in many other applications, the distribution flows are stored in the form of a series of snapshots with fixed length of observational interval. In that case, the observation frequency is strictly controlled by the interval length and not stretchable at all, for which how to develop a reasonable algorithm is still an open question. Third, as mentioned in Section 3.6, the complete identifiability for the fully nonparametric network is not achievable. So constraints are needed to guarantee the desired identifiability. Although, as shown in Remark 2, sparsity is a good constraint to lead identifiability, it may not always be reasonable. Therefore, a further study on the feasible and proper identification condition should be very meaningful in both theoretical and practical aspects.

Data Availability

The data sample and Python code used in this article are available per request from the corresponding author, through xiaoqizh@buff alo.edu.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this manuscript.

Authors’ Contributions

Conceptualization was carried out by Xiaoqi Zhang, Yanqiao Zheng, and Xinyue Ye; methodology is done by Xiaoqi Zhang and Xiaobing Zhao; software is contributed by Xiaoqi Zhang; validation is done by Yanqiao Zheng and Xinyue Ye; formal analysis is carried out by Xiaoqi Zhang, Xiaobing Zhao, and Qiwen Dai; investigation is done by Yanqiao Zheng; resources are contributed by Xiaobing Zhao and Xinyue Ye; data curation is done by Xinyue Ye; original draft preparation is carried out by Xiaoqi Zhang and Yanqiao Zheng; review and editing is done by Xinyue Ye and Yanqiao Zheng; visualization is done by Qiwen Dai; supervision is provided by Xiaobing Zhao; project administration is done by Xiaobing Zhao and Xinyue Ye; funding acquisition is carried out by Xiaobing Zhao.

Acknowledgments

This work was partially supported by the China National Planning Office of Philosophy and Social Sciences (18BTJ023). This work was presented at the 15th Xiang’Zhang Economic Forum Seminar (Beijing); the (co-)authors received valuable comments from Dr. Yougui Wang and Zhigang Cao.