Complexity

Volume 2017 (2017), Article ID 7046359, 14 pages

https://doi.org/10.1155/2017/7046359

## Information Integration from Distributed Threshold-Based Interactions

Programa de Engenharia de Sistemas e Computação, COPPE, Universidade Federal do Rio de Janeiro, Caixa Postal 68511, 21941-972 Rio de Janeiro, RJ, Brazil

Correspondence should be addressed to Valmir C. Barbosa

Received 7 July 2016; Accepted 28 September 2016; Published 11 January 2017

Academic Editor: Dimitri Volchenkov

Copyright © 2017 Valmir C. Barbosa. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

We consider distributed units that interact by message-passing. Each message carries a tag and causes the receiving unit to send out messages as a function of the tags it has received and a threshold. This simple model abstracts some of the essential characteristics of several artificial intelligence systems and of biological systems epitomized by the brain. We study the integration of information inside a temporal window as the dynamics unfolds. We quantify information integration by the total correlation, relative to the window’s duration, of a set of random variables valued as a function of message arrival. Total correlation refers to the rise of information gain above that which the units achieve individually, being therefore related to some models of consciousness. We report on extensive computational experiments exploring the interrelations of the model’s parameters (two probabilities and the threshold). We find that total correlation can occur at significant fractions of the maximum possible value and reinterpret the model’s parameters in terms of the current best estimates of some quantities pertaining to cortical structure and dynamics. We find the resulting possibilities to be well aligned with the time frames within which percepts are thought to be processed and eventually rendered conscious.

#### 1. Introduction

A threshold-based system is a collection of loosely coupled units, each characterized by a state function that depends on how the various inputs to the unit relate to a threshold parameter. The coupling in question refers to how the units interrelate, which is by each unit communicating its state to some of the other units whenever that state changes. Given a set of timing assumptions and how they relate to the exchange of states among units as well as to state updates, each individual unit processes inputs (states communicated to it by other units) and produces a threshold-dependent output (its own new state, which gets communicated to other units).

The quintessential threshold-based system is undoubtedly the brain, where each neuron’s state function determines whether an action potential is to be fired down the neuron’s axon. This depends on how the combined action potentials the neuron perceives through the synapses connecting other neurons’ axons to its dendrites (its synaptic potentials) relate to its threshold potential [1]. The greatly simplified model of the natural neuron known as the McCulloch-Pitts neuron [2], introduced over seventy years ago, retained this fundamental property of being threshold-based and so did generalizations thereof such as generalized Petri nets [3] and threshold automata [4]. In fact, this holds for much of the descent from the McCulloch-Pitts neuron, which has extended through the present day in a succession of ever more influential dynamical systems.

Such descent includes the essentially deterministic Hopfield networks of the 1980s [5, 6] and moves on through generalizations of those networks’ Ising-model type of energy function and the associated need for stochastic sampling. The resulting networks include the so-called Boltzmann machines [7] and Bayesian networks [8, 9], as well as the more general Markov (or Gibbs) Random Fields [10–12] and several of the probabilistic graphical models based on them [13]. A measure of the eventual success of such networks can be gained by considering, for example, the restricted form of Boltzmann machines [14, 15] used in the construction of deep belief networks [16], as well as some of the other deep networks that have led to landmark successes in the field of artificial intelligence recently [17–19].

Our interest in this paper is the study of how information gets integrated as the dynamics of a threshold-based system is played out. The meaning we attach to the term information integration is similar to the one we used previously in other contexts [20, 21]. Given a certain amount of time and a set of random variables, each related to the firing activity of each of the system’s units inside a temporal window of duration , we quantify integrated information as the amount of information the system generates as a whole (relative to a global state of maximum entropy) beyond that which accounts for the aggregated information the units generate individually (now relative to local states of maximum entropy). This surplus of information is known as total correlation [22] and is fundamentally dependent on how the units interact with one another.

Our understanding of information integration, therefore, lies in between those used by approaches that seek it in the synchronization of input/output signals (see, e.g., [23]) and those that share our view but would consider not just the whole and the individual units but all partitions in between as well [24]. By virtue of this, we remain aligned with the latter theory by acknowledging that information gets integrated only when it is generated by the whole in excess of the total its parts can generate individually. On the other hand, by sticking with total correlation as an information-theoretic quantity that requires only two partitions of the set of units to be considered (one that is fully cohesive and another that is maximally disjointed), we ensure tractability way beyond that of the all-partitions theory.

We conduct all our study on a simple model of threshold-based systems. In this model, the units are placed inside a cube and exchange messages whose delivery depends on their propagation speed and the Euclidean distance between sender and receiver. Every message is tagged and upon reaching its destination its tag is used to move a local accumulator either toward or away from the threshold. Reaching the threshold makes the unit send out messages and the accumulator is reset. There are three parameters in the model. Two of them are probabilities (that a message is tagged so that the accumulator at the destination gets decreased upon its arrival and that a unit sends a message to each of the other units), the other being the value of the threshold. Parameter values are the same for all units.

This model is by no means offered as an accurate representation of any particular threshold-based system but nevertheless summarizes some key aspects of such systems through its relatively few parameters. In particular, it gives rise to three possible expected global regimes of message traffic. One of them is perfectly balanced, in the sense that on average as much traffic reaches the units as that leaving them. In this case, message traffic is sustained indefinitely. In each of the other two regimes, by contrast, either more traffic reaches the units than that leaving them or the opposite. Message traffic dies out in the former of these two (unless the units receive further external input) but grows indefinitely in the latter one.

We find that information integration is guaranteed to occur at high levels for some window durations whenever message traffic is sustained at the perfect-balance level or grows. We also find that this happens nearly independently of parameter variations. On the other hand, we also find that information integration is strongly dependent on the model’s parameters, with significant levels occurring only for some combinations, whenever message traffic is imbalanced toward the side that prevents it from being sustained. Here we once again turn to the brain, whose cortical activity is in many accounts characterized as tending to be sparse [25, 26], as an emblematic example.

We proceed as follows. Our message-passing model is laid out in Section 2, where its geometry and distributed algorithm are detailed and the question of message imbalance is introduced. An account of our use of total correlation is given in Section 3, followed by our methodology in Section 4. This methodology is based on the carefully planned computational experiments described in Section 4.2, all based on the distributed algorithm of Section 2.2, using the analyses in Sections 2.3 and 4.1 for guidance. Results, discussion, and conclusion follow, respectively, in Sections 5, 6, and 7.

#### 2. Model

Our system model comprises a structural component and an algorithmic one. The two are described in what follows, along with some analysis of how they interrelate.

##### 2.1. Underlying Geometry

For , our model is based on simple processing units, henceforth referred to as nodes, each one placed at a fixed position inside the -dimensional cube of side . The position of node has coordinates , so the Euclidean distance between nodes and isWe assume that nodes can communicate with one another by sending messages that propagate at the fixed speed on a straight line. The delay incurred by a message sent between nodes and in either direction is therefore .

Our computational experiments will all be such that nodes are placed in the -dimensional cube uniformly at random. In this case and in the limit of infinite , the expected distance between two randomly chosen nodes and is given bywhere is the probability density for each of the variables. Letting in this equation for and yields where now has ’s in place of ’s. We then have with the expected distances in the unit cube for the numbers of dimensions of interest being well-known: , [27], and [28].

In addition to expected distances, the associated variances will also at one point be useful. Analytical expressions for most of them seem to have remained unknown thus far, but the underlying probability densities have been found to be more concentrated around the expected values given above as grows [29]. That is, variance is the greatest for .

##### 2.2. Network Algorithmics

We view nodes as running an asynchronous message-passing algorithm collectively. By asynchronous we mean that each node remains idle until a message arrives. When this happens, the arriving message is processed by the node, which may result in messages being sent out as well. Such a purely reactive stance on the part of the nodes requires at least one node to send out at least one message without having received one, for startup. We assume that this is done by all nodes initially, after which they start behaving reactively.

We assume that each message carries a signed-unit tag (i.e., either or ) with it. The specific tag to go with the message is chosen probabilistically by its sender at send time, with being chosen with probability . The processing done by node upon arrival of a message is the heart of the system’s thresholding nature and involves manipulating an accumulator , initially equal to , to which every tag received is added (unless and the tag is , in which case remains unchanged). Whenever reaches a preestablished integer value , node sends out messages of its own and is reset to . Thus, the integer acts as a threshold governing the sending of messages by (the firing of) node . The values of and are the same for all nodes.

It follows from this simple rule that the value of is perpetually confined to the interval . The expected number of messages that node has to receive in order for to be increased all the way from to is the node’s expected number of message arrivals between firings, henceforth denoted by . The value of can be calculated easily once we recognize that is simply the expected number of steps for the following discrete-time Markov chain to reach state having started at state . The chain has states and transition probability , from state to state , given by It is easily solved and yieldsfor [30, page 348]. (For we have [30, page 349], but this holds for none of our computational experiments.)

The sending of messages when a node fires is based on another parameter, , which is the probability with which the node sends a message to each of the other nodes. It follows that the expected number of messages that get sent out is . The value of is the same for all nodes as well.

##### 2.3. Local Imbalance and Global Message Traffic

At node , a balance exists between message output and message input when the expected number of messages sent out at each firing is the same as the expected number of messages received between two successive firings. That is, message traffic is locally balanced when . It is locally imbalanced otherwise, which can be quantified by the difference , defined to be

Given , clearly the instantaneous density of global message output at time , denoted by , is expected to remain constant with if or to decrease or increase exponentially with depending on whether or , respectively. This behavior is described bywhere participating in the time constant is some fundamental amount of time related to the system’s geometry and kinetics. In Section 5, we provide empirical evidence that is the expected delay undergone by a message, given byEquation (8) is of immediate solution, yieldingwhereis the expected number of messages that all nodes, collectively, send out initially.

Similarly, the cumulative global message output inside a temporal window of duration starting at time isIn the case of locally balanced message traffic (), this expression is easily seen to yield therefore independent of . Otherwise, either decreases or increases exponentially with , depending, respectively, on whether or .

##### 2.4. Graph-Theoretic Interpretations

The model described so far in Section 2 can be regarded as a directed geometric graph, that is, a graph whose nodes are positioned in some region of interest (the -dimensional cube of side ) and whose edges are directed. It is moreover a complete graph without self-loops, in the sense that an edge exists directed from every node to every node .

Our use of the model in the sequel will require the nodes to be positioned at random before each new run of the distributed algorithm of Section 2.2, so an alternative interpretation that one might wish to consider views the model as a variation of the traditional random geometric graph [31]. In this variation, an edge exists directed from to with fixed probability , independently of any other node pair. That is, aside from node positioning the graph underlying our model is an Erdős-Rényi random graph [32] as extended to the directed case [33].

This interpretation is somewhat loose, though, because it requires that we view each individual run of the algorithm as being equivalent to several runs on independent instances of the underlying random graph, with multiple further runs serving to validate any statistics that one may come up with at the end. This is hard to justify, however, particularly when one considers the nonlinearities characterizing the quantities we will average over all runs of the algorithm (see Section 3). Even so, interpreting our model in terms of random graphs remains tantalizing in some contexts. For example, it allows the parameter to be regarded as the fraction of a node’s in-neighbors from which messages with negative tags are received. In the context of networks of the brain at the neuronal level, for example, an abstract rendering of the fraction of neurons that are inhibitory is obtained (see Section 6.1).

#### 3. Total Correlation

We use the total correlation of random variables [22], each corresponding to one of nodes, as a measure of information integration. Each of these variables is relative to a temporal window of fixed duration , the variable corresponding to node being denoted by and taking up values from the set . The intended semantics is that if and only if node receives at least one message in a time interval of duration . We also use the shorthands and to denote the sequence of variables and the sequence of values , respectively. Thus, stands for the joint valuation .

Given the marginal Shannon entropy of each variable , and the joint Shannon entropy, the total correlation of variables given is defined as follows: (When this formula coincides with that for mutual information, but one is to note that in the general case the two formulas are completely different [34].) To see the significance of this definition in our context, consider the flat joint probability mass function, for all . This mass function entails maximum uncertainty of the variables’ values, hence the maximum possible value of the joint entropy, . It also implies flat marginals, for all and all , and again the maximum possible value of each marginal entropy, . The difference from the actual joint entropy to its maximum reflects a reduction of uncertainty, or an information gain, the same holding for each of the marginals,

Thus, it is possible to rewrite the expression for in such a way that That is, the total correlation of all variables is the information gain that surpasses their combined individual gains. This surplus is zero if and only if the variables are independent of one another, that is, precisely when for all , since in this case we have . It is strictly positive otherwise, with a maximum possible value of .

Achieving this maximum requires a joint probability mass function assigning no mass to any but two members of , say and , and moreover that these two be equiprobable (i.e., ) and complementary to each other (i.e., for every ). Referring back to the intended meaning of random variables, total correlation is maximized in those runs of the distributed algorithm of Section 2.2 for which a partition of the set exists with the following two properties. First, no matter which particular window of duration we concentrate on, the set of nodes that receive at least one message inside the window is either or . Second, the first property holds with for exactly half such windows.

While these are exceedingly stringent conditions both spatially and temporally, perhaps implying that values of total correlation equal to or near are practically unachievable, they serve to delineate those scenarios with a chance of generating substantial amounts of total correlation. Specifically, such scenarios will on average have a pattern of global message traffic, inside a window, that is neither too sparse nor too dense. Furthermore, sustaining such an amount of total correlation as time elapses will also require traffic patterns that deviate only negligibly from the ones yielding the average, possibly entailing some variability on window sizes. Our methodology to track and validate values of leading to noteworthy total correlation is described next. It involves computational experiments for a variety of values for and also gauging the cumulative global message output that results from each experiment against a function of the reference number of messages and reference delay embodied in and , respectively.

#### 4. Methods

Our results are based on running the distributed algorithm of Section 2.2 for a fixed geometry (i.e., fixed number of dimensions , fixed value of the cube side , and fixed positioning of nodes in the cube) and a fixed set of values for the parameters (, , and ). Each run of the algorithm terminates either when no more message is in transit (so none will ever be, thenceforth, given the algorithm’s reactive nature) or when a preestablished maximum number of messages in transit have been reached, whichever comes first. Imposing the latter upper bound is important because it serves to size the data structures where messages in transit are kept for later processing.

Node positioning is achieved uniformly at random, so multiple runs are needed for each fixed configuration . Each run leaves a trace of all events taking place as it unfolds, each event referring to the arrival of a message at a node and comprising the node’s identification and the message’s arrival time. A series of values for is then considered and for each each of the traces is analyzed, yielding the total correlation produced by the corresponding run. The average total correlation over all the runs is then reported.

Following our discussion at the end of Section 3, for each value of we gauge (12) against the approximation to it given by , according to which messages get sent at time and received at time . We do this by postulating a proportionality constant between them, that is, by assumingDoing this allows us to express as a function of for each and, whenever possible, to characterize traffic regimes giving rise to substantial amounts of total correlation.

##### 4.1. Supporting Analysis

We denote the value of upon termination of a run by and the total number of messages sent by . An approximation to (12) similar to the one above can be used to relate and as , whose right-hand side quantifies what would be expected to happen if all messages were sent at time and received at time . This leads to

Solving (20) for given yieldswhose value for is the duration of the first window,As for the duration of the last window, which we denote by , it can likewise be found by solving (20) for , now letting the window’s start time be and then letting . We obtain

The average window duration between time and time denoted by , is also of interest and comes from the indefinite integral where is the dilogarithm of . Given this, we obtain

While for we have andfor everything depends on the sign of . If , then we need in order for to be well defined. Moreover, we havewhere the first inequality holds if , this being necessary and sufficient for the last inequality to hold as well. For , on the other hand, is always well defined and we getIn this case, the constraint is necessary and sufficient for both the first and the last inequality to hold.

##### 4.2. Computational Experiments

We organize our computational experiments into settings, numbered I–IV, each comprising all configurations for which , , and are fixed. In each setting, there are three possibilities for the value of , one ensuring (), one for (extracted from (7)), and one for (). The four settings are summarized in Table 1. Each of settings II–IV is derived from setting I by a change in the value of , , or , respectively.