Complexity

Volume 2018, Article ID 6047846, 16 pages

https://doi.org/10.1155/2018/6047846

## Information Processing Features Can Detect Behavioral Regimes of Dynamical Systems

^{1}Computational Science Lab, University of Amsterdam, Amsterdam, Netherlands^{2}Department of Computer Science, University of Geneva, Geneva, Switzerland^{3}ITMO University, Saint Petersburg, Russia^{4}Complexity Institute, Nanyang Technological University, Singapore

Correspondence should be addressed to Rick Quax; ln.avu@xauq.r

Received 11 September 2017; Revised 24 December 2017; Accepted 7 February 2018; Published 16 April 2018

Academic Editor: Dimitri Volchenkov

Copyright © 2018 Rick Quax et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

In dynamical systems, local interactions between dynamical units generate correlations which are stored and transmitted throughout the system, generating the macroscopic behavior. However a framework to quantify exactly how these correlations are stored, transmitted, and combined at the microscopic scale is missing. Here we propose to characterize the notion of “information processing” based on all possible Shannon mutual information quantities between a future state and all possible sets of initial states. We apply it to the 256 elementary cellular automata (ECA), which are the simplest possible dynamical systems exhibiting behaviors ranging from simple to complex. Our main finding is that only a few information features are needed for full predictability of the systemic behavior and that the “information synergy” feature is always most predictive. Finally we apply the idea to foreign exchange (FX) and interest-rate swap (IRS) time-series data. We find an effective “slowing down” leading indicator in all three markets for the 2008 financial crisis when applied to the information features, as opposed to using the data itself directly. Our work suggests that the proposed characterization of the local information processing of units may be a promising direction for predicting emergent systemic behaviors.

#### 1. Introduction

Emergent, complex behavior can arise from the interactions among (simple) dynamical units. An example is the brain whose complex behavior as a whole cannot be explained by the dynamics of a single neuron. In such a system, each dynamical unit receives input from other (upstream) units and then decides its next state, reflecting these correlated interactions. This new state is then used by (downstream) neighboring units to decide their new states and so on, eventually generating a macroscopic behavior with systemic correlations. A quantitative framework is missing to fully trace how correlations are stored, transmitted, and integrated, let alone to predict whether a given system of local interactions will eventually generate complex systemic behavior or not.

Our hypothesis is that Shannon’s information theory [1] can be used to construct, eventually, such a framework. In this viewpoint, a unit’s new state reflects its past interactions in the sense that it stores mutual information about the past states of upstream neighboring units. In the next time instant a downstream neighboring unit interacts with this state, implicitly transferring this information and integrating it together with other information into its new state and so on. In effect, each interaction among dynamical units is interpreted as a Shannon communication channel and we aim to trace the onward transmission and integration of information (synergy) through this network of “communication channels.”

In this paper we characterize the information in a single unit’s state at time by enumerating its mutual information quantities with all possible sets of initial unit states (). We generate initial unit states independently for the elementary cellular automata (ECA) application. Then we characterize “information processing” as the progression of a unit’s vector of information quantities over time (see Methods). The rationale behind this is as follows. The information in each initial unit state will be unique by construction, that is, have zero redundancy with all other initial unit states. Future unit states depend only on previous unit states and ultimately on the initial unit states (there are no outside forces). “Processing” refers, by our definition, to the fact that the initial (unique) pieces of information can be considered to disperse through the system in different directions and at different levels (synergy), while some of it dissipates and is lost. We can exactly trace all these directions and levels or every bit of information in the ECA due to the uniqueness of the initial information by construction. Therefore we would argue that we can then fully quantify the “information processing” of a system, implicitly, without knowing exactly which (physical) mechanism is actually responsible for this. We anticipate that this is a useful abstraction which will aid in distinguishing different emergent behaviors without being distracted by physical or mechanistic details. We first test whether this notion of information processing could be used to predict complex emergent behavior in the theoretical framework of ECA, under ideal conditions by construction. Next we also test if information processing could be used to detect a difference of systemic behavior in real financial time-series data, namely, the regimes before and after the 2008 crisis, despite the fact that obviously this data does not obey the strict ideal conditions.

The study of “information processing” in complex dynamical systems is a recently growing research topic. Although information theory has already been applied to dynamical systems such as elementary cellular automata, including, for instance, important work by Langton and Grassberger [2, 3], here we mean by “information processing” a more holistic perspective of capturing* all* forms of information simultaneously present in a system. As illustrative examples, Lizier et al. propose a framework to formulate dynamical systems in terms of distributed “local” computation: information storage, transfer, and modification [4] defined by individual terms of the Shannon mutual information sum (see (3)). For cellular automata they provide evidence for the long-held conjecture that so-called particle collisions are the primary mechanism for locally modifying information, and for a networked variant they show that a phase transition is characterized by the shifting balance of local information storage over transfer [5]. A crucial difference with our work is that we operate in the ensemble setting, as is usual for Shannon information theory, whereas Lizier et al. study a single realization of a dynamical system, for a particular initial state. (Although time-series data is strictly speaking a single realization, ensemble estimates are routinely made from such data by using sliding windows; see Methods.) Beer and Williams trace how task-relevant information flows through a minimally cognitive agent’s neurons and environment to ultimately be combined into a categorization decision [6] or sensorimotor behavior [7], using ensemble methods. Studying how local interactions lead to multiscale systemic behavior is also a domain which benefits from information-theoretic approaches, such as those by Bar-Yam et al. [8, 9], Quax et al. [10, 11], and Lindgren [12]. Finally, extending information theory itself to deal with complexity, multiple authors are concerned with decomposing a single information quantity into multiple constituents, such as synergistic information, including James et al. [13], Williams and Beer [14], Olbrich et al. [15], Quax et al. [16], Chliamovitch et al. [17], and Griffith et al. [18, 19]. Although a general consensus on the definition of “information synergy” is thus still elusive, in this paper we circumvent this problem by focusing on the special case of independent input variables, in which case a closed-form formula (“whole-minus-sum”) is well-known and used.

#### 2. Methods

##### 2.1. Notational Conventions

Constants and functions are denoted by lower-case Roman letters. Stochastic variables are denoted by capital Roman letters. Feature vectors are denoted by Greek letters.

##### 2.2. Model of Dynamical Systems

In general we consider discrete-time, discrete-state Markov dynamics. Let denote the stochastic variable of the system state defined as the sequence of unit states at time . Each unit chooses its new state locally according to the conditional probability distribution , encoding the microscopic system mechanics where identifies the unit. The state space of each unit is equal and denoted by the set . We assume that the number of units, the system mechanics, and the state space remain unchanged over time. Finally we assume that all unit states are initialized identically and independently (i.i.d.); that is, . The latter ensures that all correlations in future system states are generated by the interacting units and not an artifact of the initial conditions.

###### 2.2.1. Elementary Cellular Automata

Specifically we focus on the set of 256 elementary cellular automata (ECA) which are the simplest discrete spatiotemporal dynamical systems possible [20]. Each unit has two possible states and chooses its next state deterministically using the same transition rule as all other cells. The next state of a cell deterministically depends only on its own previous state and that of its two nearest neighbors, forming a line network of interactions. That is,There are 256 possible transition rules and they are numbered 0 through 255, denoted . As initial state we take the fully random state so that no correlations exist already at ; that is, for all and all . The evolution of each cellular automaton is fully deterministic for a given rule, implying that the conditional probabilities in (1) can only be either or . (This is nevertheless not a necessary condition in general.)

##### 2.3. Quantifying the Information Processing in a Dynamical Model

###### 2.3.1. Basics of Information Theory

We characterize each new unit state, determined probabilistically by , by a sequence of Shannon communication channels, where each channel communicates information from a subset of to . In general, a communication channel between two stochastic variables is defined by the one-way interaction and is characterized by the amount of information about the state which transfers to the state due to this interaction. The average amount of information stored in the sender’s state is determined by its marginal probability distribution , which is known as its Shannon entropy:After a perfect, noiseless transmission, the information at the receiver would share exactly bits with the information stored at the sender . After a failed transmission the receiver would share zero information with the sender, and for noisy transmission their mutual information is somewhere in between. This is quantified by the so-called mutual information:The conditional variant obeys the chain rule and is written explicitly as

This denotes the remaining entropy (uncertainty) of given that the value for is observed. For intuition it is easily verified that the case of statistical independence, that is, , leads to which makes , meaning that contains zero information about . At the other extreme, would make so that , meaning that contains the maximal amount of information needed to determine a unique value of .

###### 2.3.2. Characterizing the Information Stored in a Unit’s State

First we characterize the information stored in a unit’s state at time step , denoted by , as the ordered sequence of mutual information quantities with all possible sets of unit states at time ; that is,

Here denotes the (ordered) power set notation for all subsets of stochastic variables of initial cell states. (Note though that in practice not infinitely many initial cell states are needed; for instance, for an ECA at time only the nearest initial cell states are relevant.) We will refer to as the sequence of information features of unit at time . The subscript notation implies that the rule-specific (conditional) probabilities and are used to compute the mutual information. We use the subscript for generality to emphasize that this feature vector pertains to each single unit (cell) in the system, even though in the specific case of ECA this subscript could be dropped as all cells are indistinguishable.

In particular we highlight the following three types of information features. The “memory” of unit at time is defined as the feature , that is, the amount of information that the unit retains about its own initial state. The “transfer” of information is defined as nonlocal mutual information such as (). Nonlocal mutual information must be due to interactions because the initial states are independent (all pairs of units have zero mutual information). Finally we define the integration of information as “information synergy,” an active research topic in information theory [4, 14, 16, 19, 21–23]. The information synergy in about is calculated here by the well-known whole-minus-sum (WMS) formula . The WMS measure directly implements the intuition of subtracting the information carried by individual variables from the total information. However the presence of correlations among the would be problematic for this measure, in which case it can become negative. In this paper we prevent this by ensuring that the are uncorrelated. In this case it fulfills various proposed axiomatizations for synergistic information known thus far, particularly PID [14, 15] and SRV [16].

Information synergy (or “synergy” for short) is not itself a member of but it is fully redundant given since each of its terms is in . Therefore we will treat synergy features as separate single features in our results analysis while we do not add them to .

##### 2.4. Predicting the Class of Dynamical Behavior Using Information Processing Features

We require a classification of (long-term) dynamical behavior of the systems under scrutiny which is to be predicted by the information features. In this paper we choose the popular Wolfram classification for the special case of ECA.

###### 2.4.1. Behavioral Class of a Rule

Wolfram observed empirically that each rule tends to evolve from a random initial state to one of only four different classes of dynamical behavior [20]. These de facto established behavioral classes are(1)homogeneous (all cells end up in the same state);(2)periodic (a small cycle of repeating patterns);(3)chaotic (pseudorandom patterns);(4)complex (locally stable behavior and long-range interactions among patterns).

These classes are conventionally numbered 1 through 4, respectively. We obtained the class number for all 256 rules from Wolfram Alpha [24] and denote it by . When the rule number is treated as a stochastic variable it will be denoted ; similarly, if the class number is treated as stochastic variable it will be denoted .

###### 2.4.2. Predictive Power of the Information Processing Features

We are interested in the inference problem of predicting the class number based on the observed information features . Here the rule number is considered a uniformly random stochastic variable, making in turn functions such as also stochastic variables. We formalize the prediction problem by the conditional probabilities . That is, given only the sequence of information features of a specific (but unknown) rule at time , what is the probability that the ECA will eventually exhibit behavior of class ? We can interpret this problem as a communication channel and quantify the* predictive power* of using the mutual information . The predictive power is thus zero in case the information features do not reduce the uncertainty about , whereas it achieves its maximum value in case a sequence of information features always uniquely identifies the behavioral class . We will normalize the predictive power as . For the Wolfram classification, , which is lower than the maximum possible since there are relatively many rules with class 2 behavior and not many complex rules.

Note that a normalized predictive power of, say, does not necessarily mean that of the rules can be correctly classified. Our definition yields merely a relative measure where means zero predictive power, means perfect prediction, and intermediate values are ordered such that a higher value implies that a more accurate classification algorithm could in principle be constructed. The benefit of our definition based on mutual information is that it does not depend on a specific classifier algorithm; that is, it is model-free. Indeed, the use of mutual information as a predictor of classification accuracy has become the de facto standard in machine learning applications [25, 26].

###### 2.4.3. Selecting the Principal Features

Some information features are more predictive than others for determining the behavioral class of a rule. Therefore we perform a feature selection process at each time to find these “principal features” as follows. First we extend the set of information features by the following set of synergy features:

Their concatenation makes the extended ordered feature set, now written in the form of stochastic variables:

The extended feature set has no additional predictive power compared to , so for any inference task and are equivalent. That is, the synergy features are completely redundant given since each of its terms is a member of . The reason for adding them separately to form is that they have a clear meaning as information which is stored in a collection of variables while not being stored in any individual variable. We are interested to see whether this phenomenon plays a significant role in generating dynamical behaviors.

We define the first principal feature at time as maximizing its individual predictive power, quantified by a mutual information term as explained before, as

Here, again rule number is treated as a uniformly random stochastic variable with which in turn makes and stochastic variables. In words, is the single most predictive information feature about the behavioral class that will eventually be generated. More generally, the principal set of features is identified in similar spirit; namely,

###### 2.4.4. Information-Based Classification of Rules

The fact that Wolfram’s classification relies on the behavior exhibited by a particular initial configuration makes the complexity class of an automaton dependent on the initial condition. Moreover, there is no universal agreement regarding how “complexity” should be defined and various alternatives to Wolfram’s classification have been proposed, although Wolfram’s remains by far the most popular. Our hypothesis is that the complexity of a system has very much to do with the way it processes information. Therefore we also attempt to classify ECA rules using only their informational features.

We use a classification algorithm which takes as input the 256 vectors of information features and computes the Euclidean distance between these vectors. The two vectors nearest to each other are clustered together. Then the remaining nearest elements or clusters are clustered together. The distance between two clusters is defined as the distance between the most distant elements in each cluster. The result is a hierarchy of clusters with different distances which we visualize as a dendrogram.

##### 2.5. Computing Information Processing Features in Foreign Exchange Time-Series

In the previous section we define information processing features for the simplest (one-dimensional) model of discrete dynamical systems. In the second part of this paper we aim to investigate if information features can distinguish “critical” regimes in the real complex dynamical system of the foreign exchange market. Most importantly, we are interested in the behavior of the information features before, at, and after the start of the 2008 financial crisis, which is commonly taken to coincide with the bankruptcy of Lehman Brothers on September 15, 2008. We consider two types of time-series datasets in which the dynamical variables can be interpreted to form a one-dimensional system in order to stay as close as possible to the ECA modeling approach.

The information features can then be computed as discussed above, except that each mutual information term is now estimated directly from the data. This estimation is performed within a sliding window of length up to time point which enables us to see how the information measures evolve over time . For instance, the memory of variable at time will be measured as where the joint probability distribution is estimated using only the data points . Details regarding the estimation procedure are given in the following subsection. The th time-series in a dataset will be denoted by subscript as in .

###### 2.5.1. Estimating Information Processing Features from the Data

The mutual information between two financial variables (time-series) at time is estimated using the -nearest-neighbor algorithm using the typical setting [27]. This estimation is calculated using a sliding window of size leading up to and including time point , after first detrending each time-series using log-returns. For all results we will use 200 uniformly spaced values for over the dataset, starting at datapoint and ending at the length of the dataset. Thus windows partially overlap. Parameter is evaluated for robustness in the Supplementary Materials (available here).

We calculate the “memory” (M) of a time-series as and the average “transfer” (T) as . That is, whereas in the ECA model we calculated the mutual information quantities with respect to the initial state of the model, here we use consecutive time points, effectively treating as the initial state and characterizing only the single time step to .

For calculating the synergy measure (S) we apply a correction which makes this measure distributed around zero. The reason is that the WMS measure of (6) assumes independence among the stochastic (initial) state variables , which for the real data are taken to be the previous day’s time-series values. When this assumption is violated, it can become strongly negative and, more importantly, cointegrated with the memory and transfer features whose sum will then dominate the synergy feature. We remedy this by rescaling the sum of the memory and transfer features which are subtracted in (6) to equal the average value of the total information (positive term in (6)). In formula, a constant is inserted into the WMS formula, leading to for a given set of initial cell states . is fitted such that this WMS measure is on average 0 for all sliding windows over the dataset. This rejects the cointegration null-hypothesis between total information and the subtracted term at the 0.05 significance level () in this dataset. This results in the synergy feature being distributed around zero and being independent of the sum of the other two features so that it may functionally be used as part of the feature space for feature selection; however, the value itself should not be trusted as quantifying precisely the notion synergy.

###### 2.5.2. Description of the Foreign Exchange Data

The first data we consider are time-series of five foreign exchange (FX) daily closing rates (EUR/USD, USD/JPY, JPY/GBP, GBP/CHF, and CHF/EUR) for the period from January 1, 1999, to April 21, 2017 [28]. Each currency pair has a causal dependence on its direct neighbors in the order listed because they share a common currency. For instance, if the EUR/USD rate changes then USD/JPY will quickly adjust accordingly (among others) because the rate imbalance can be structurally exploited for profit. In turn, among others through the rate JPY/EUR (not observed in this dataset) the rate EUR/USD will also be adjusted due to profit-making imbalances, eventually leading to all neighboring rates returning to a balanced situation.

###### 2.5.3. Description of the Interest-Rate Swap Data

The second data are interest-rate swap (IRS) daily rates for the EUR and USD market [11]. The data spans over twelve years: the EUR data from January 12, 1998, to August 12, 2011, and the USD data from April 29, 1999, to June 6, 2011. The datasets consist of 14 and 15 times to maturity (durations), respectively, ranging from 1 year to 30 years. Rates for nearby maturities have a dependency because the higher maturity can be constructed by the lower maturity plus a delayed (“forward”) short-term swap. This basic mechanism between maturities leads to generally monotonically upward “swap curves.”

#### 3. Results

##### 3.1. Predicting the Wolfram Class of ECA Rules Using Information Processing Features

###### 3.1.1. Information Processing in the First Time Step

The information processing occurring in the first time step of each ECA rule is characterized by the corresponding feature set , consisting of 7 time-delayed mutual information quantities (, (5)) and 4 synergy quantities (, (6)). We show three selected features (memory, total transfer, and total synergy) for all 256 rules as points in a vector-space in Figure 1 along with each rule’s Wolfram class as a color code. It is apparent that the three features already partially separate the behavioral classes. Namely, it turns out that chaotic and complex rules tend to have high synergy, low information memory, and low information transfer. Figure 1 also relates intuitively to the classic categorization problem in machine learning; namely, perfect prediction would be equivalent to the existence of hyperplanes that perfectly separate all four behavior classes. In the case of ECA the information features are deterministic calculations for each rule number. Thus forms a discrete distribution of 256 points such that separability implies that no two rule numbers fall on exactly the same point in this information space.