Information flow on graphs
Models of complex systems built from simple, locally interacting components arise in many fields, including statistical physics, biology, artificial intelligence, communication networks, etc. The quest to understand and to quantify the fundamental limits on the ability of such systems to store and process information has led to a variety of interesting and insightful results that draw upon probability, combinatorics, information theory, discrete and continuous dynamical systems, etc. In this post, I would like to focus on a model of distributed storage that was analyzed in 1975 by Donald Dawson in a very nice paper, which deserves to be more widely known.
The system Dawson analyzed is an instance of a Probabilistic Cellular Automaton (PCA), i.e., a system consisting of a discrete collection of nodes, each of which has a discrete state. The configuration of the automaton at each time is specified by listing the state of each node. The state of each node at time is determined stochastically by its own state, as well as by the states of its neighbors, at time , independently of everything else. One of the key questions in the theory of PCAs is whether a given automaton is ergodic, i.e., if its configuration eventually “forgets” the initial condition. Thus, Dawson’s result is a sufficient condition for ergodicity of a PCA.
Now let’s move on to precise definitions. Consider an undirected graph . The neighborhood of a vertex , denoted by , is the set of all vertices that are connected to by an edge. We will also denote the set consisting of and all of its neighbors by . Let us fix a finite alphabet ; for each vertex , let denote a copy of . A configuration is an assignment of a symbol to each vertex . Thus, is the space of all configurations. We will denote a generic configuration by ; for any set , we will denote by the natural projection of onto : .
Now let’s consider the following model of a distributed one-bit memory cell. Let be a random variable, which is the bit we wish to store. We also pick two probability measures and on . Given , we draw the initial configuration at random according to . We allow arbitrary correlations between the different ‘s, so our storage is distributed in this sense. Now let’s suppose that the state of each vertex evolves stochastically in discrete time according to a local Markov model: for each , we have a channel (random transformation) with input alphabet and output alphabet . Then the configuration at time evolves according to the law
with the initial conditions , , and , . Thus, we end up with a Markov chain
where at each time the evolution from to is also Markov “in space:” the state of vertex at time is determined only by the states of the neighbors of at time , i.e., for each and , we have a Markov chain
Now the question is: does the configuration at an arbitrary time retain any information about ? One way to answer this is to pick an arbitrary finite set and to examine the evolution of the mutual information , where
If we can show that there exists some such that for all and at least one nonempty , then our memory cell is capable of storing one bit. If, on the other hand, as for every finite , then eventually the memory cell will forget .
1. Dawson’s impossibility result
Roughly speaking, Dawson proved the following: if the local channels are sufficiently noisy and if the graph is sufficiently sparse, then the mutual information will decay to zero exponentially fast. In order to state his result precisely, we need some preliminary definitions. First of all, in order to quantify the noisiness of each channel , Dawson introduces the following quantity, which he calls the transmission coefficient of :
where the supremum is over all Markov chains , where , the channel from to is arbitrary, and the channel from to is given by . Moreover, let . Clearly, for every , by the data processing inequality. Consequently, .
Remark 1 Dawson’s definition of the transmission coefficient is reminiscent of the best constant in the data processing inequality of Erkip and Cover (corrected by Anantharam et al.): to define that constant in our setting, one would fix a distribution and let
where the supremum is over all Markov chains satisfying the constraints and . The difference is that in (2) we constrain the marginal distribution of and the stochastic kernel , whereas in (3) we constrain the marginal of and the kernel . As shown by Anantharam et al.,
where the supremum is over all marginal distributions of distinct from , and denotes the marginal distribution of when . Both and measure the “noisiness” of , but in different ways. In particular, the motivation behind Dawson’s definition is as follows: At any time , the state of a given node at time depends only on and not on anything else. Thus, if boundary condition, i.e., the configuration of all the nodes except for and its neighbors, is “frozen” at some value , then the amount of information about that node will receive at time will be controlled by the quantity
which, by definition, is no more than . We will elaborate on this point later on.
Also, Dawson assumes that has bounded degree, i.e.,
Now we are in a position to state the following:
Thus, if , then as , and the convergence is exponentially fast.
2. The proof
The proof of Theorem 1 consists of several steps. The first step gives an upper bound on the information value of the state at a given node if we already observed the states of some other nodes. This bound is stated in Lemmas 2 and 3 below.
is a Markov chain. Now, for any fixed choices of the configurations and , consider two probability distributions and of :
From (6), if we consider a channel from to with transition probabilities given by , and then let be the output of with input , then
and is a Markov chain satisfying the conditions in the definition (2) of the transmission coefficient . Therefore,
Taking the expectation with respect to and , we obtain (5).
Proof: Using the chain rule for mutual information and Lemma 2,
Now let’s examine the mutual information . Let us use the chain rule to write the conditional mutual information in two ways:
Therefore, we can write
Substituting this into our bound on , we obtain the statement of the lemma.
The main message of Lemma 3 can be stated more succinctly if we introduce shorthand notation
for all finite, possibly empty sets and for all , where we also let
for all finite sets . A nice way to think about the above definitions is in terms of prediction and correction steps in Bayesian filtering: Eq. (7) corresponds to prediction since it quantifies the amount of information about that the nodes in will contain at time given what we know about the state of the nodes in some other set at time , while Eq. (8) quantifies the impact of knowing the state of the nodes in at time , and so represents correction. Moreover, we can write the bound of Lemma 3 as
for any , any , and any .
The second step of the proof is to show that the exponential decay of information, according to Eq. (4), is a consequence of (9). To that end, Dawson employs a clever inductive argument. Let denote the collection of all finite subsets of , and for any define a set function by
The quantity defined in Eq. (8) can be computed in terms of : indeed, by the chain rule we have
In addition, if we know , we can also compute :
Moreover, using our knowledge of and the fact that for all , we can do the following: Let us fix an arbitrary node . Then, from (9), we derive
Fixing another node , we have
where the first step uses (9); the second uses our bound on ; in the third, we add and subtract ; and the last step uses the chain rule. Continuing inductively, we see that, for any ,
This can be interpreted as follows: if we define the specific information at time by
then . Now, we write
where the first step follows from (10); the second uses (11); the third step is by (over)counting; the fourth step uses the fact that for every ; and the remaining steps are simple combinatorics. What we have shown is that the specific information at can be bounded as . Thus, if , then in one step the specific information will strictly decrease. Iterating, we get the bound
If , then as . This completes the proof of Theorem 1.