Complexity

Volume 2017, Article ID 4518429, 13 pages

https://doi.org/10.1155/2017/4518429

## Sparse Causality Network Retrieval from Short Time Series

^{1}Department of Computer Science, UCL, London, UK^{2}UCL Centre for Blockchain Technologies, UCL, London, UK^{3}Systemic Risk Centre, London School of Economics and Political Sciences, London, UK^{4}Department of Mathematics, King’s College London, London, UK

Correspondence should be addressed to Tomaso Aste; ku.ca.lcu@etsa.t

Received 25 May 2017; Accepted 6 September 2017; Published 6 November 2017

Academic Editor: Diego Garlaschelli

Copyright © 2017 Tomaso Aste and T. Di Matteo. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

We investigate how efficiently a known underlying sparse causality structure of a simulated multivariate linear process can be retrieved from the analysis of time series of short lengths. Causality is quantified from conditional transfer entropy and the network is constructed by retaining only the statistically validated contributions. We compare results from three methodologies: two commonly used regularization methods, Glasso and ridge, and a newly introduced technique, LoGo, based on the combination of information filtering network and graphical modelling. For these three methodologies we explore the regions of time series lengths and model-parameters where a significant fraction of true causality links is retrieved. We conclude that when time series are short, with their lengths shorter than the number of variables, sparse models are better suited to uncover true causality links with LoGo retrieving the true causality network more accurately than Glasso and ridge.

#### 1. Introduction

Establishing causal relations between variables from observation of their behaviour in time is central to scientific investigation and it is at the core of data-science where these causal relations are the basis for the construction of useful models and tools capable of prediction. The capability to predict (future) outcomes from the analytics of (past) input data is crucial in modelling and it should be the main property to take into consideration in model selection, when the validity and meaningfulness of a model is assessed. From a high-level perspective, we can say that the whole scientific method is constructed around a circular procedure consisting in observation, modelling, prediction, and testing. In such a procedure, the accuracy of prediction is used as a selection tool between models. In addition, the principle of parsimony favours the simplest model when two models have similar predictive power.

The scientific method is the rational process that, for the last 400 years, has mostly contributed to scientific discoveries, technological progress, and the advancement of human knowledge. Machine learning and data-science are nowadays pursuing the ambition to mechanize this discovery process by feeding machines with data and using different methodologies to build systems able to make models and predictions by themselves. However, the automatisation of this process requires to identify, without the help of human intuition, the relevant variables and the relations between these variables out of a large quantity of data. Predictive models are methodologies, systems, or equations which identify and make use of such relations between sets of variables in a way that the knowledge about a set of variables provides information about the values of the other set of variables. This problem is intrinsically high-dimensional with many input and output data. Any model that aims to explain the underlying system will involve a number of elements which must be of the order of magnitude of the number of relevant relations between the system’s variables. In complex systems, such as financial markets or the brain, prediction is probabilistic in nature and modelling concerns inferring the probability of the values of a set of variables given the values of another set. This requires the estimation of the joint probability of all variables in the system and, in complex systems, the number of variables with potential macroscopic effects on the whole system is very large. This poses a great challenge for the model construction/selection and its parameter estimation because the number of relations between variables scales with, at least, the square of the number of variables but, for a given fix observation window, the amount of information gathered from such variables scales, at most, linearly with the number of variables [1, 2].

For instance, a linear model for a system with variables requires the estimation from observation of parameters (the distinct elements of the covariance matrix). In order to estimate parameters one needs a comparable number of observations requiring time series of length or larger to gather a sufficient information content from a number of observations which scales as . However, the number of parameters in the model can be reduced by considering only out of the relations between the variables reducing in this way the required time series length to . Such models with reduced numbers of parameters are referred to in the literature as sparse models. In this paper we consider two instances of linear sparse modelling: Glasso [3] which penalizes nonzero parameters by introducing a norm penalization and LoGo [4] which reduces the inference network to an number of links selected by using information filtering networks [5–7]. The results from these two sparse models are compared with the norm penalization (nonsparse) ridge model [8, 9].

This paper is an exploratory attempt to map the parameter regions of time series length, number of variables, penalization parameters, and kinds of models to define the boundaries where probabilistic models can be reasonably constructed from the analytics of observation data. In particular, we investigate empirically, by means of a linear autoregressive model with sparse inference structure, the true causality link retrieval performances in the region of short time series and large number of variables which is the most critical region—and the most interesting—in many practical cases. Causality is defined in information theoretic sense as a significant reduction on uncertainty over the present values of a given variable provided by the knowledge of the past values of another variable obtained in excess to the knowledge provided by the past of the variable itself and—in the conditional case—the past of all other variables [10]. We measure such information by using transfer entropy and, within the present linear modelling, this coincides with the concept of Granger causality and conditional Granger causality [11]. The use of transfer entropy has the advantage of being a concept directly extensible to nonlinear modelling. However, nonlinearity is not tackled within the present paper. Linear models with multivariate normal distributions have the unique advantage that causality and partial correlations are directly linked, largely simplifying the computation of transfer entropy, and directly mapping the problem into the sparse inverse covariance problem [3, 4].

Results are reported for artificially generated time series from an autoregressive model of variables and time series lengths between 10 and 20,000 data points. Robustness of the results has been verified over a wider range of from 20 to 200 variables. Our results demonstrate that sparse models are superior in retrieving the true causality structure for short time series. Interestingly, this is despite considerable inaccuracies in the inference network of these sparse models. We indeed observe that statistical validation of causality is crucial in identifying the true causal links, and this identification is highly enhanced in sparse models.

The paper is structured as follows. In Section 2 we briefly review the basic concepts of mutual information and conditional transfer entropy and their estimation from data that will then be used in the rest of the paper. We also introduce the concepts of sparse inverse covariance, inference network and causality networks. Section 3 concerns the retrieval of causality network from the computation and statistical validation of conditional transfer entropy. Results are reported in Section 4 where the retrieval of the true causality network from the analytics of time series from an autoregressive process of variables is discussed. Conclusions and perspectives are given in Section 5.

#### 2. Estimation of Conditional Transfer Entropy from Data

In this paper causality is quantified by means of statistically validated transfer entropy. Transfer entropy quantifies the amount of uncertainty on a random variable, , explained by the past of another variable, , conditioned to the knowledge about the past of itself. Conditional transfer entropy, , includes an extra condition also to a set variables . These quantities are introduced in detail in Appendix A (see also [11–13]). Let us here just report the main expression for the conditional transfer entropy that we shall use in this paper:where is the conditional entropy and is a random variable at time , whereas is the lagged set of random variable “” considering previous times and are all other variables and their lags (see Appendix A, (A.5)).

In this paper we use Shannon entropy and restrict to linear modelling with multivariate normal setting (see Appendix B). In this context the conditional transfer entropy can be expressed in terms of the determinants of conditional covariances (see (B.5) in Appendix B):

Conditional covariances can be conveniently computed in terms of the inverse covariance of the whole set of variables (see Appendix C). Such inverse covariance matrix, , represents the structure of conditional dependencies among all couples of variables in the system and their lags. Each subpart of is associated with the conditional covariances of the variables in that part with respect to all others. In terms of , the expression for the conditional transfer entropy becomeswhere the indices “” and “” refer to submatrices of , respectively, associated with the variables and .

##### 2.1. Causality and Inference Networks

The inverse covariance , also known as precision matrix, represents the structure of conditional dependencies. If we interpret the structure of as a network, where nodes are the variables and nonzero entries correspond to edges of the network, then we shall see that any two subsets of nodes that are not directly connected by one or more edges are conditionally independent. Condition is with respect to all other variables.

Links between variables at different lags are associated with causality with direction going from larger to smaller lags. The network becomes therefore a directed graph. In such a network entropies can be associated with nodes, conditional mutual information can be associated with edges between variables with the same lag, and conditional transfer entropy can be associated with edges between variables with different lags. A nice property of this mapping of information measures with directed networks is that there is a simple way to aggregate information which is directly associated with topological properties of the network. Entropy, mutual information, and transfer entropies can be defined for any aggregated subset of nodes with their values directly associated with the presence, direction, and weight of network edges between these subparts.

Nonzero transfer entropies indicating, for instance, variable causing variable are associated with some nonzero entries in the inverse covariance matrix between lagged variables (i.e., , with ) and variable (i.e., ). In linear models these nonzero entries define the estimated* inference network*. However, not all edges in this inference network correspond to transfer entropies that are significantly different from zero. To extract the structure of the* causality network* we shall retain only the edges in the inference network which correspond to statistically validated transfer entropies.

Conditioning eliminates the effect of the other variables retaining only the exclusive contribution from the two variables in consideration. This should provide estimations of transfer entropy that are less affected by spurious effects from other variables. On the other hand, conditioning in itself can introduce spurious effects; indeed two independent variables can become dependent due to conditioning [13]. In this paper we explore two extreme conditioning cases: (i) conditioned to all other variables and their lags; (ii) unconditioned.

In principle, one would like to identify the maximal value of over all lags and all possible conditionings . However, the use of multiple lags and conditionings increases the dimensionality of the problem making estimation of transfer entropy very hard especially when only a limited amount of measurements is available (i.e., short time series). This is because the calculation of the conditional covariance requires the estimation of the inverse covariance of the whole set of variables and such an estimation is strongly affected by noise and uncertainty. Therefore, a standard approach is to reduce the number of variables and lags to keep dimensionality low and estimate conditional covariances with appropriate penalizers [3, 8, 9, 14]. An alternative approach is to invert the covariance matrix only locally on low dimensional subsets of variables selected by using information filtering networks [5–7] and then reconstruct the global inversion by means of the LoGo approach [4]. Let us here briefly account for these two approaches.

##### 2.2. Penalized Inversions

The estimate of the inverse covariance is a challenging task to which a large body of literature has been dedicated [14]. From an intuitive perspective, one can say that the problem lies in the fact that uncertainty is associated with nearly zero eigenvalues of the covariance matrix. Variations in these small eigenvalues have relatively small effects on the entries of the covariance matrix itself but have major effects on the estimation of its inverse. Indeed small fluctuations of small values can yield to unbounded contributions to the inverse. A way to cure such near-singular matrices is by adding finite positive terms to the diagonal which move the eigenvalues away from zero: , where is the covariance matrix of the set of variables estimated from data and is the identity matrix (where ; see later). This is what is performed in the so-called ridge regression [9], also known as shrinkage mean-square-error estimator [15] or Tikhonov regularization [8]. The effect of the additional positive diagonal elements is equivalent to compute the inverse covariance which maximizes the log-likelihood: , where the last term penalizes large off-diagonal coefficients in the inverse covariance with a norm penalization [16]. The regularizer parameter tunes the strength of this penalization. This regularization is very simple and effective. However, with this method insignificant elements in the precision matrix are penalized toward small values but they are never set to zero. By using instead norm penalization , insignificant elements are forced to zero leading to a sparse inverse covariance. This is the so-called lasso regularization [3, 14, 17]. The advantage of a sparse inverse covariance consists in the provision of a network representing a conditional dependency structure. Indeed, let us recall that in linear models zero entries in the inverse covariance are associated with couples of nonconditionally dependent variables.

##### 2.3. Information Filtering Network Approach: LoGo

An alternative approach to obtain sparse inverse covariance is by using information filtering networks generated by keeping the elements that contribute most to the covariance by means of a greedy process. This approach, named LoGo, proceeds by first constructing a chordal information filtering graph such as a Maximum Spanning Tree (MST) [18, 19] or a Triangulated Maximally Filtered Graph (TMFG) [7]. These graphs are built by retaining edges that maximally contribute to a given gain function which, in this case, is the log-likelihood or—more simply—the sum of the squared correlation coefficients [5–7]. Then, this chordal structure is interpreted as the inference structure of the joint probability distribution function with nonzero conditional dependency only between variables that are directly connected by an edge. On this structure the sparse inverse covariance is computed in such a way to preserve the values of the correlation coefficients between couples of variables that are directly connected with an information filtering graph edge. The main advantage of this approach is that inversion is performed at local level on small subsets of variables and then the global inverse is reconstructed by joining the local parts through the information filtering network. Because of this Local-Global construction this method is named LoGo. It has been shown that LoGo method yields to statistically significant sparse precision matrices that outperform the ones with the same sparsity computed with lasso method [4].

#### 3. Causality Network Retrieval

##### 3.1. Simulated Multivariate Autoregressive Linear Process

In order to be able to test if causality measures can retrieve the true causality network in the underlying process, we generated artificial multivariate normal time series with known sparse causality structure by using the following autoregressive multivariate linear process [20]:where are matrices with random entries drawn from a normal distribution. The matrices are made upper diagonal (diagonal included) by putting to zero all lower diagonal coefficients and made sparse by keeping only a total number of entries different from zero in the upper and diagonal part. are random normally distributed uncorrelated variables. This process produces autocorrelated, cross-correlated, and causally dependent time series. We chose it because it is among the simplest processes that can generate this kind of structured datasets. The dependency and causality structure is determined by the nonzero entries of the matrices . The upper-triangular structure of these matrices simplifies the causality structure eliminating causality cycles. Their sparsity reduces dependency and causality interactions among variables. The process is made autoregressive and stationary by keeping the eigenvalues of all smaller than one in absolute value. For the tests we used , and sparsity is enforced to have a number of links approximately equal to . We reconstructed the network from time series of different lengths between 5 and 20,000 points. To test statistical reliability the process was repeated 100 times with every time a different set of randomly generated matrices . We verify that the results are robust and consistent by varying sample sizes from to 200, by changing sparsity with number of links from to and for from 1 to 10. We verified that the presence of isolated nodes or highly connected hub nodes does not affect results significantly.

##### 3.2. Causality and Inference Network Retrieval

We tested the agreement between the causality structure of the underlying process and the one inferred from the analysis of time series of different lengths , with , generated by using (4) We have different variables and lags. The dimensionality of the problem is therefore variables at all lags including zero.

To estimate the inference and causality networks we started by computing the inverse covariance, , for all variables at all lags by using the following three different estimation methods:(1) norm penalization (Glasso [14]);(2) norm penalization (ridge [8]);(3)information filtering network (LoGo [4]).

We retrieved the inference network by looking at all couples of variables, with indices and , which have nonzero entries in the inverse covariance matrix between the lagged set of and the nonlagged . Clearly, for the ridge method the result is a complete graph but for the Glasso and LoGo the results are sparse networks with edges corresponding to nonzero conditional transfer entropies between variables and . For the LoGo calculation we make use of the regularizer parameter as a local shrinkage factor to improve the local inversion of the covariance of the 4-cliques and triangular separators (see [4]).

We then estimated transfer entropy between couples of variables, conditioned to all other variables in the system. This is obtained by estimating of the inverse covariance matrix (indicated with an “hat” symbol) by using (C.7) (see Appendix C.2) withwith a conditioning to all variables except , and . The result is a matrix of conditional transfer entropies . Finally, to retrieve the causality network we retained the network of statistically validated conditional transfer entropies only. Statistical validation was performed as follows.

##### 3.3. Statistical Validation of Causality

Statistical validation has been performed from likelihood ratio statistical test. Indeed, entropy and likelihood are intimately related: entropy measures uncertainty and likelihood measures the reduction in uncertainty provided by the model. Specifically, the Shannon entropy associated with a set of random variables, , with probability distribution is (see (B.1)) whereas the log-likelihood for the model associated with a set of independent observations with is which can be written as . Note that is the total available number of observations which, in practice, is the length of the time series minus the maximum number of lags. It is evident from these expressions that entropy and the log-likelihood are strictly related though this link might be nontrivial. In the case of linear modelling this connection is quite evident because the entropy estimate is and the log-likelihood is . For the three models we study in this paper we have and therefore the log-likelihood is equal to times the opposite of the entropy estimate. Transfer entropy and conditional transfer entropy are differences between two entropies: the one of a set of variables conditioned to their own past minus the one conditioned also to the past of another variable. This, in turns, is the difference of the unitary log-likelihood of two models and therefore it is the logarithm of a likelihood ratio. As Wilks pointed out [21, 22] the null distribution of such model is asymptotically quite universal. Following the likelihood ratio formalism, we have and the probability of observing a transfer entropy larger than , estimated under null hypothesis, is given by with and the chi-square the cumulative distribution function with degrees of freedom which are the difference between the number of parameters in the two models. In our case the two models have, respectively, and parameters.

##### 3.4. Statistical Validation of the Network

The procedures described in the previous two subsections produce the inference network and causality network. Such networks are then compared with the known network of true causalities in the underlying process which is defined by the nonzero elements in the matrices (see (4)). The overlapping between the retrieved links in the inference or causality networks with the ones in the true network underlying the process is an indication of a discovery of a true causality relation. However some discoveries can be obtained just by chance or some methodologies might discover more links only because they produce denser networks. We therefore tested the hypothesis that the matching links in the retrieved networks are not obtained just by chances by computing the null-hypothesis probability to obtain the same or a larger number of matches randomly. Such probability is given by the conjugate cumulative hypergeometric distribution for a number equal or larger than of “true positive” matching causality links between an inferred network of links and a process network of true causality links, from a population of possible links:Small values of indicate that the retrieved links out of are unlikely to be found by randomly picking edges from possibilities. Note that in the confusion matrix notation [23] we have and with number of true positives, number of false positives, number of false negatives, and number of true negatives. The total number of “negatives” (unlinked couples of vertices) in the true model is instead .

#### 4. Results

##### 4.1. Computation and Validation of Conditional Transfer Entropies

By using (4) we generated 100 multivariate autoregressive processes with known causality structures. We here report results for but analogous outcomes were observed for dimensionalities between and 200 variables. Conditional transfer entropies between all couples of variables, conditioned to all other variables in the system, were computed by estimating the inverse covariances by using tree methodologies, ridge, Glasso, and LoGo and applying (3). Conditional transfer entropies were statistically validated with respect to null hypothesis (no causality) at value. Results for Bonferroni adjusted value at 1% (i.e., for ) are reported in Appendix E. We also tested other values of from to 0.1 obtaining consistent results. We observe that small reduce the number of validated causality links but increase the chance that these links match with the true network in the process. Conversely large values of increase the numbers of mismatched links but also of the true links discoveries. Let us note that here we use as a thresholding criteria and we are not claiming any evidence of statistical significance of the causality. We assess the goodness of this choice a posteriori by comparing the resulting causality network with the known causality network of the process.

##### 4.2. Statistical Significance of the Recovered Causality Network

Results for the contour frontiers of significant causality links for the three models are reported in Figure 1 for a range of time series with lengths between 10 and 20,000 and regularizer parameters between and . Statistical significance is computed by using (6) and results are reported for both and (continuous and dotted lines respectively). As one can see, the overall behaviours for the three methodologies are little affected by the threshold on . We observe that LoGo significance region extends well beyond the Glasso and ridge regions.