## Information flow on graphs

Models of complex systems built from simple, locally interacting components arise in many fields, including statistical physics, biology, artificial intelligence, communication networks, etc. The quest to understand and to quantify the fundamental limits on the ability of such systems to store and process information has led to a variety of interesting and insightful results that draw upon probability, combinatorics, information theory, discrete and continuous dynamical systems, etc. In this post, I would like to focus on a model of distributed storage that was analyzed in 1975 by Donald Dawson in a very nice paper, which deserves to be more widely known.

The system Dawson analyzed is an instance of a Probabilistic Cellular Automaton (PCA), i.e., a system consisting of a discrete collection of nodes, each of which has a discrete state. The configuration of the automaton at each time is specified by listing the state of each node. The state of each node at time is determined stochastically by its own state, as well as by the states of its neighbors, at time , independently of everything else. One of the key questions in the theory of PCAs is whether a given automaton is *ergodic*, i.e., if its configuration eventually “forgets” the initial condition. Thus, Dawson’s result is a sufficient condition for ergodicity of a PCA.

Now let’s move on to precise definitions. Consider an undirected graph . The *neighborhood* of a vertex , denoted by , is the set of all vertices that are connected to by an edge. We will also denote the set consisting of and all of its neighbors by . Let us fix a finite alphabet ; for each vertex , let denote a copy of . A *configuration* is an assignment of a symbol to each vertex . Thus, is the space of all configurations. We will denote a generic configuration by ; for any set , we will denote by the natural projection of onto : .

Now let’s consider the following model of a distributed one-bit memory cell. Let be a random variable, which is the bit we wish to store. We also pick two probability measures and on . Given , we draw the initial configuration at random according to . We allow arbitrary correlations between the different ‘s, so our storage is distributed in this sense. Now let’s suppose that the state of each vertex evolves stochastically in discrete time according to a local Markov model: for each , we have a channel (random transformation) with input alphabet and output alphabet . Then the configuration at time evolves according to the law

with the initial conditions , , and , . Thus, we end up with a Markov chain

where at each time the evolution from to is also Markov “in space:” the state of vertex at time is determined only by the states of the neighbors of at time , i.e., for each and , we have a Markov chain

Now the question is: does the configuration at an arbitrary time retain any information about ? One way to answer this is to pick an arbitrary finite set and to examine the evolution of the mutual information , where

If we can show that there exists some such that for all and at least one nonempty , then our memory cell is capable of storing one bit. If, on the other hand, as for every finite , then eventually the memory cell will forget .

**1. Dawson’s impossibility result **

Roughly speaking, Dawson proved the following: if the local channels are sufficiently noisy and if the graph is sufficiently sparse, then the mutual information will decay to zero exponentially fast. In order to state his result precisely, we need some preliminary definitions. First of all, in order to quantify the noisiness of each channel , Dawson introduces the following quantity, which he calls the *transmission coefficient* of :

where the supremum is over all Markov chains , where , the channel from to is arbitrary, and the channel from to is given by . Moreover, let . Clearly, for every , by the data processing inequality. Consequently, .

Remark 1Dawson’s definition of the transmission coefficient is reminiscent of the best constant in the data processing inequality of Erkip and Cover (corrected by Anantharam et al.): to define that constant in our setting, one would fix a distribution and letwhere the supremum is over all Markov chains satisfying the constraints and . The difference is that in (2) we constrain the marginal distribution of and the stochastic kernel , whereas in (3) we constrain the marginal of and the kernel . As shown by Anantharam et al.,

where the supremum is over all marginal distributions of distinct from , and denotes the marginal distribution of when . Both and measure the “noisiness” of , but in different ways. In particular, the motivation behind Dawson’s definition is as follows: At any time , the state of a given node at time depends only on and not on anything else. Thus, if boundary condition, i.e., the configuration of all the nodes except for and its neighbors, is “frozen” at some value , then the amount of information about that node will receive at time will be controlled by the quantity

which, by definition, is no more than . We will elaborate on this point later on.

Also, Dawson assumes that has *bounded degree*, i.e.,

Now we are in a position to state the following:

Theorem 1 (Dawson)For any finite set ,

Thus, if , then as , and the convergence is exponentially fast.

**2. The proof **

The proof of Theorem 1 consists of several steps. The first step gives an upper bound on the information value of the state at a given node if we already observed the states of some other nodes. This bound is stated in Lemmas 2 and 3 below.

Lemma 2Fix two arbitrary, possibly empty finite sets . Then, for any ,

*Proof:* We begin by noting that, since ,

is a Markov chain. Now, for any fixed choices of the configurations and , consider two probability distributions and of :

From (6), if we consider a channel from to with transition probabilities given by , and then let be the output of with input , then

and is a Markov chain satisfying the conditions in the definition (2) of the transmission coefficient . Therefore,

Taking the expectation with respect to and , we obtain (5).

Lemma 3For any finite set , any , and an arbitrary (possibly empty) finite set ,

*Proof:* Using the chain rule for mutual information and Lemma 2,

Now let’s examine the mutual information . Let us use the chain rule to write the conditional mutual information in two ways:

Therefore, we can write

Substituting this into our bound on , we obtain the statement of the lemma.

The main message of Lemma 3 can be stated more succinctly if we introduce shorthand notation

for all finite, possibly empty sets and for all , where we also let

for all finite sets . A nice way to think about the above definitions is in terms of prediction and correction steps in Bayesian filtering: Eq. (7) corresponds to prediction since it quantifies the amount of information about that the nodes in will contain at time given what we know about the state of the nodes in some other set at time , while Eq. (8) quantifies the impact of knowing the state of the nodes in at time , and so represents correction. Moreover, we can write the bound of Lemma 3 as

The second step of the proof is to show that the exponential decay of information, according to Eq. (4), is a consequence of (9). To that end, Dawson employs a clever inductive argument. Let denote the collection of all finite subsets of , and for any define a set function by

The quantity defined in Eq. (8) can be computed in terms of : indeed, by the chain rule we have

In addition, if we know , we can also compute :

Moreover, using our knowledge of and the fact that for all , we can do the following: Let us fix an arbitrary node . Then, from (9), we derive

Fixing another node , we have

where the first step uses (9); the second uses our bound on ; in the third, we add and subtract ; and the last step uses the chain rule. Continuing inductively, we see that, for any ,

In particular, using the formula with , we obtain

The final step is to complete the proof using the fact that . Since is binary, . Therefore,

This can be interpreted as follows: if we define the *specific information* at time by

then . Now, we write

where the first step follows from (10); the second uses (11); the third step is by (over)counting; the fourth step uses the fact that for every ; and the remaining steps are simple combinatorics. What we have shown is that the specific information at can be bounded as . Thus, if , then in one step the specific information will strictly decrease. Iterating, we get the bound

If , then as . This completes the proof of Theorem 1.

leave a comment