# The Information Structuralist

## Information flow on graphs

Posted in Information Theory, Models of Complex Stochastic Systems, Probability by mraginsky on May 3, 2014

Models of complex systems built from simple, locally interacting components arise in many fields, including statistical physics, biology, artificial intelligence, communication networks, etc. The quest to understand and to quantify the fundamental limits on the ability of such systems to store and process information has led to a variety of interesting and insightful results that draw upon probability, combinatorics, information theory, discrete and continuous dynamical systems, etc. In this post, I would like to focus on a model of distributed storage that was analyzed in 1975 by Donald Dawson in a very nice paper, which deserves to be more widely known.

The system Dawson analyzed is an instance of a Probabilistic Cellular Automaton (PCA), i.e., a system consisting of a discrete collection of nodes, each of which has a discrete state. The configuration of the automaton at each time is specified by listing the state of each node. The state of each node at time ${t+1}$ is determined stochastically by its own state, as well as by the states of its neighbors, at time ${t}$, independently of everything else. One of the key questions in the theory of PCAs is whether a given automaton is ergodic, i.e., if its configuration eventually “forgets” the initial condition. Thus, Dawson’s result is a sufficient condition for ergodicity of a PCA.

Now let’s move on to precise definitions. Consider an undirected graph ${G = ({\cal V},{\cal E})}$. The neighborhood of a vertex ${v \in {\cal V}}$, denoted by ${\partial v}$, is the set of all vertices ${u \in {\cal V}}$ that are connected to ${v}$ by an edge. We will also denote the set consisting of ${v}$ and all of its neighbors by ${\partial_+ v}$. Let us fix a finite alphabet ${{\mathsf X}}$; for each vertex ${v \in {\cal V}}$, let ${{\mathsf X}_v}$ denote a copy of ${{\mathsf X}}$. A configuration is an assignment of a symbol ${x_v \in {\mathsf X}_v}$ to each vertex ${v \in {\cal V}}$. Thus, ${\Omega = {\mathsf X}^{\cal V}}$ is the space of all configurations. We will denote a generic configuration by ${x = (x_v)_{v \in {\cal V}}}$; for any set ${{\cal J} \subset {\cal V}}$, we will denote by ${x_{\cal J}}$ the natural projection of ${x}$ onto ${{\mathsf X}^{\cal J} = \prod_{v \in {\cal J}}{\mathsf X}_v}$: ${x_{\cal J} = (x_v)_{v \in {\cal J}}}$.

Now let’s consider the following model of a distributed one-bit memory cell. Let ${W}$ be a ${{\rm Bernoulli}(1/2)}$ random variable, which is the bit we wish to store. We also pick two probability measures ${\mu_0}$ and ${\mu_1}$ on ${\Omega}$. Given ${W=w}$, we draw the initial ${(t=1)}$ configuration ${X_1 = (X_{v,1})_{v \in {\cal V}}}$ at random according to ${\mu_w}$. We allow arbitrary correlations between the different ${X_{v,1}}$‘s, so our storage is distributed in this sense. Now let’s suppose that the state of each vertex evolves stochastically in discrete time according to a local Markov model: for each ${v}$, we have a channel (random transformation) ${K_v}$ with input alphabet ${{\mathsf X}^{\partial_+ v} = \prod_{u \in \partial_+ v} {\mathsf X}_u}$ and output alphabet ${{\mathsf X}_v}$. Then the configuration ${X_t = (X_{v,t})_{v \in {\cal V}}}$ at time ${t}$ evolves according to the law

$\displaystyle {\mathbb P}[X_{t+1}=x_{t+1}|W=w,X^t = x^t] = \prod_{v \in {\cal V}}K_v(x_{v,t+1}|x_{\partial_+ v,t}), \qquad t=1,2,\ldots \ \ \ \ \ (1)$

with the initial conditions ${{\mathbb P}[W=w]=1/2}$, ${w \in \{0,1\}}$, and ${{\mathbb P}[X_1 = x_1|W=w]=\mu_w(x_1)}$, ${x_1 \in \Omega}$. Thus, we end up with a Markov chain

$\displaystyle W \longrightarrow X_1 \longrightarrow X_2 \longrightarrow X_3 \longrightarrow \ldots,$

where at each time ${t}$ the evolution from ${X_t}$ to ${X_{t+1}}$ is also Markov “in space:” the state of vertex ${v}$ at time ${t+1}$ is determined only by the states of the neighbors of ${v}$ at time ${t}$, i.e., for each ${v}$ and ${t}$, we have a Markov chain

$\displaystyle \left( W; X^{t-1}; X_{{\cal V} \backslash \partial_+ v,t} \right) \longrightarrow X_{\partial_+ v,t} \longrightarrow X_{v,t+1}.$

Now the question is: does the configuration at an arbitrary time ${t}$ retain any information about ${W}$? One way to answer this is to pick an arbitrary finite set ${{\cal J} \subset {\cal V}}$ and to examine the evolution of the mutual information ${I(W; X_{{\cal J},t})}$, where

$\displaystyle X_{{\cal J},t} = (X_{v,t})_{v \in {\cal J}}.$

If we can show that there exists some ${\varepsilon > 0}$ such that ${I(W; X_{{\cal J},t}) \ge \varepsilon}$ for all ${t}$ and at least one nonempty ${{\cal J}}$, then our memory cell is capable of storing one bit. If, on the other hand, ${I(W; X_{{\cal J},t}) \rightarrow 0}$ as ${t \rightarrow \infty}$ for every finite ${{\cal J}}$, then eventually the memory cell will forget ${W}$.

1. Dawson’s impossibility result

Roughly speaking, Dawson proved the following: if the local channels ${K_v}$ are sufficiently noisy and if the graph ${G}$ is sufficiently sparse, then the mutual information ${I(W; X_{{\cal J},t})}$ will decay to zero exponentially fast. In order to state his result precisely, we need some preliminary definitions. First of all, in order to quantify the noisiness of each channel ${K_v}$, Dawson introduces the following quantity, which he calls the transmission coefficient of ${K_v}$:

$\displaystyle \eta_v := \sup_{W \rightarrow Y \rightarrow Z} \frac{I(W; Z)}{I(W; Y)}, \ \ \ \ \ (2)$

where the supremum is over all Markov chains ${W \rightarrow Y \rightarrow Z}$, where ${W \sim {\rm Bernoulli}(1/2)}$, the channel from ${W}$ to ${Y \in {\mathsf X}^{\partial_+ v}}$ is arbitrary, and the channel from ${Y}$ to ${Z \in {\mathsf X}_v}$ is given by ${K_v}$. Moreover, let ${\eta := \sup_{v \in {\cal V}} \eta_v}$. Clearly, ${\eta_v \le 1}$ for every ${v \in {\cal V}}$, by the data processing inequality. Consequently, ${\eta \le 1}$.

Remark 1 Dawson’s definition of the transmission coefficient is reminiscent of the best constant in the data processing inequality of Erkip and Cover (corrected by Anantharam et al.): to define that constant in our setting, one would fix a distribution ${P_Y}$ and let

$\displaystyle \delta^*_v(P_Y) := \sup_{U \rightarrow Y \rightarrow Z} \frac{I(U; Z)}{I(U;Y)}, \ \ \ \ \ (3)$

where the supremum is over all Markov chains ${U \rightarrow Y \rightarrow Z}$ satisfying the constraints ${Y \sim P_Y}$ and ${P_{Z|Y} = K_v}$. The difference is that in (2) we constrain the marginal distribution of ${W}$ and the stochastic kernel ${P_{Z|Y}}$, whereas in (3) we constrain the marginal of ${Y}$ and the kernel ${P_{Z|Y}}$. As shown by Anantharam et al.,

$\displaystyle \delta^*_v(P_Y) = \sup_{Q_Y \neq P_Y} \frac{D(Q_Z \| P_Z)}{D(Q_Y \| P_Y)},$

where the supremum is over all marginal distributions ${Q_Y}$ of ${Y}$ distinct from ${P_Y}$, and ${Q_Z}$ denotes the marginal distribution of ${Z}$ when ${Y \sim Q_Y}$. Both ${\eta_v}$ and ${\delta^*_v}$ measure the “noisiness” of ${K_v}$, but in different ways. In particular, the motivation behind Dawson’s definition is as follows: At any time ${t}$, the state ${X_{v,t+1}}$ of a given node at time ${t}$ depends only on ${X_{\partial_+v,t}}$ and not on anything else. Thus, if boundary condition, i.e., the configuration of all the nodes except for ${v}$ and its neighbors, is “frozen” at some value ${x_{{\cal V}\backslash\partial_+v}}$, then the amount of information about ${W}$ that node ${v}$ will receive at time ${t+1}$ will be controlled by the quantity

$\displaystyle \frac{I(W; X_{v,t+1} | X_{{\cal V}\backslash \partial_+v,t} = x_{{\cal V}\backslash\partial_+v})}{I(W; X_{\partial_+v,t} | X_{{\cal V}\backslash \partial_+v,t} = x_{{\cal V}\backslash\partial_+v})},$

which, by definition, is no more than ${\eta_v}$. We will elaborate on this point later on.

Also, Dawson assumes that ${G}$ has bounded degree, i.e.,

$\displaystyle \Delta := \sup_{v \in {\cal V}}|\partial_+ v| < \infty.$

Now we are in a position to state the following:

Theorem 1 (Dawson) For any finite set ${{\cal J} \subset {\cal V}}$,

$\displaystyle I(W; X_{{\cal J},t}) \le |{\cal J}| (\Delta\eta)^{t-1} \log 2. \ \ \ \ \ (4)$

Thus, if ${\Delta \eta < 1}$, then ${I(W; X_{{\cal J},t}) \rightarrow 0}$ as ${t \rightarrow \infty}$, and the convergence is exponentially fast.

2. The proof

The proof of Theorem 1 consists of several steps. The first step gives an upper bound on the information value of the state at a given node if we already observed the states of some other nodes. This bound is stated in Lemmas 2 and 3 below.

Lemma 2 Fix two arbitrary, possibly empty finite sets ${{\cal J},{\cal K} \subset {\cal V}}$. Then, for any ${v \not\in {\cal J}}$,

$\displaystyle I(W; X_{v,t+1} | X_{{\cal J},t+1}, X_{{\cal K},t}) \le \eta I(W; X_{\partial_+ v,t} | X_{{\cal J},t+1}, X_{{\cal K},t}). \ \ \ \ \ (5)$

Proof: We begin by noting that, since ${v \not\in {\cal J}}$,

$\displaystyle (W,X_{{\cal K} \, \cup\, \partial_+ v,t},X_{{\cal J},t+1}) \rightarrow X_{\partial_+ v,t} \rightarrow X_{v,t+1} \ \ \ \ \ (6)$

is a Markov chain. Now, for any fixed choices of the configurations ${x_{{\cal K},t}}$ and ${x_{{\cal J},t+1}}$, consider two probability distributions ${\nu_0}$ and ${\nu_1}$ of ${Y \in {\mathsf X}^{\partial_+ v}}$:

$\displaystyle \nu_w(y) := {\mathbb P}\Big[ Y = y\Big| W=w, X_{{\cal K},t} = x_{{\cal K},t}, X_{{\cal J},t+1} = x_{{\cal J},t+1}\Big], \quad w \in \{0,1\}.$

From (6), if we consider a channel from ${W}$ to ${Y}$ with transition probabilities given by ${\nu_w}$, and then let ${Z \in {\mathsf X}_v}$ be the output of ${K_v}$ with input ${Y}$, then

$\displaystyle \begin{array}{rcl} I(W;Y) &=& I(W; X_{\partial_+ v,t} | X_{{\cal J},t+1} = x_{{\cal J},t+1}, X_{{\cal K},t}=x_{{\cal K},t}) \\ I(W;Z) &=& I(W; X_{v,t+1} | X_{{\cal J},t+1} = x_{{\cal J},t+1}, X_{{\cal K},t} = x_{{\cal K},t}), \end{array}$

and ${W \rightarrow Y \rightarrow Z}$ is a Markov chain satisfying the conditions in the definition (2) of the transmission coefficient ${\eta_v}$. Therefore,

$\displaystyle \begin{array}{rcl} && I(W; X_{v,t+1} | X_{{\cal J},t+1} = x_{{\cal J},t+1}, X_{{\cal K},t} = x_{{\cal K},t}) \nonumber\\ &&\quad \le \eta_v I(W; X_{\partial_+ v,t} | X_{{\cal J},t+1} = x_{{\cal J},t+1}, X_{{\cal K},t}=x_{{\cal K},t}) \\ &&\quad \le \eta I(W; X_{\partial_+ v,t} | X_{{\cal J},t+1} = x_{{\cal J},t+1}, X_{{\cal K},t}=x_{{\cal K},t}) . \end{array}$

Taking the expectation with respect to ${X_{{\cal J},t+1}}$ and ${X_{{\cal K},t}}$, we obtain (5). $\Box$

Lemma 3 For any finite set ${{\cal J}\subset{\cal V}}$, any ${v \not\in {\cal J}}$, and an arbitrary (possibly empty) finite set ${{\cal K}}$,

$\displaystyle \begin{array}{rcl} I(W; X_{{\cal J} \cup \{v\},t+1}|X_{{\cal K},t}) &\le& \eta I(W; X_{{\cal J},t+1} | X_{{\cal K}\, \cup\, \partial_+ v,t}) \nonumber\\ && \qquad \qquad + \eta I(W; X_{\partial_+ v,t}|X_{{\cal K},t}) \nonumber\\ && \qquad \qquad + (1-\eta) I(W; X_{{\cal J},t+1}|X_{{\cal K},t}). \end{array}$

Proof: Using the chain rule for mutual information and Lemma 2,

$\displaystyle \begin{array}{rcl} I(W; X_{{\cal J} \cup \{v\},t+1}|X_{{\cal K},t}) &=& I(W; X_{{\cal J},t+1} | X_{{\cal K},t}) + I(W; X_{v,t+1} | X_{{\cal J},t+1}, X_{{\cal K},t}) \nonumber \\ &\le& I(W; X_{{\cal J},t+1} | X_{{\cal K},t}) + \eta I(W; X_{\partial_+ v,t} | X_{{\cal J},t+1},X_{{\cal K},t}). \end{array}$

Now let’s examine the mutual information ${ I(W; X_{\partial_+ v,t} | X_{{\cal J},t+1},X_{{\cal K},t})}$. Let us use the chain rule to write the conditional mutual information ${I(W; X_{\partial_+ v,t}, X_{{\cal J},t+1} | X_{{\cal K},t})}$ in two ways:

$\displaystyle \begin{array}{rcl} I(W; X_{\partial_+ v,t}, X_{{\cal J},t+1} | X_{{\cal K},t}) &=& I(W; X_{\partial_+ v,t} | X_{{\cal K},t}) + I(W; X_{{\cal J},t+1} | X_{{\cal K} \cup \partial_+ v,t}) \\ &=& I(W; X_{{\cal J},t+1} | X_{{\cal K},t}) + I(W; X_{\partial_+ v,t} | X_{{\cal J},t+1},X_{{\cal K},t}). \end{array}$

Therefore, we can write

$\displaystyle I(W; X_{\partial_+ v,t} | X_{{\cal J},t+1},X_{{\cal K},t}) = I(W; X_{\partial_+ v,t} | X_{{\cal K},t}) + I(W; X_{{\cal J},t+1} | X_{{\cal K} \cup \partial_+ v,t}) - I(W; X_{{\cal J},t+1} | X_{{\cal K},t}).$

Substituting this into our bound on ${I(W; X_{{\cal J} \cup \{v\},t+1}|X_{{\cal K},t})}$, we obtain the statement of the lemma. $\Box$

The main message of Lemma 3 can be stated more succinctly if we introduce shorthand notation

$\displaystyle I^+_t({\cal J} |{\cal K}) := I(W; X_{{\cal J},t} | X_{{\cal K},t-1}) \ \ \ \ \ (7)$

\and

$\displaystyle I_t({\cal J} | {\cal K}) := I(W; X_{{\cal J},t} | X_{{\cal K},t}) \ \ \ \ \ (8)$

for all finite, possibly empty sets ${{\cal J},{\cal K}\subset{\cal V}}$ and for all ${t=1,2,\ldots}$, where we also let

$\displaystyle \begin{array}{rcl} && I^+_{t}(\emptyset|{\cal K}) = I_t(\emptyset|{\cal K}) = 0;\\ && I^+_{t}({\cal J} | \emptyset) = I(W; X_{{\cal J},t}); \\ && I^+_1({\cal J} |{\cal K}) = I(W; X_{{\cal J},1}) \end{array}$

for all finite sets ${{\cal J},{\cal K} \subset {\cal V}}$. A nice way to think about the above definitions is in terms of prediction and correction steps in Bayesian filtering: Eq. (7) corresponds to prediction since it quantifies the amount of information about ${W}$ that the nodes in ${{\cal J}}$ will contain at time ${t}$ given what we know about the state of the nodes in some other set ${{\cal K}}$ at time ${t-1}$, while Eq. (8) quantifies the impact of knowing the state of the nodes in ${{\cal K}}$ at time ${t}$, and so represents correction. Moreover, we can write the bound of Lemma 3 as

$\displaystyle I^+_{t+1}({\cal J} \cup \{v\} |{\cal K}) \le (1-\eta) I^+_{t+1}({\cal J} | {\cal K}) + \eta I^+_{t+1}({\cal J} | {\cal K} \cup \partial_+ v) + \eta I_t (\partial_+ v|{\cal K}), \ \ \ \ \ (9)$

for any ${{\cal J},{\cal K}}$, any ${v \not\in{\cal J}}$, and any ${t}$.

The second step of the proof is to show that the exponential decay of information, according to Eq. (4), is a consequence of (9). To that end, Dawson employs a clever inductive argument. Let ${2^{{\cal V}}_0}$ denote the collection of all finite subsets of ${{\cal V}}$, and for any ${t}$ define a set function ${I_t : 2^{{\cal V}}_0 \rightarrow {\mathbb R}}$ by

$\displaystyle I_t({\cal J}) := I(W; X_{{\cal J},t}) \equiv I\big(W; X_{v,t}, v \in {\cal J}\big).$

The quantity ${I_t(\cdot|\cdot)}$ defined in Eq. (8) can be computed in terms of ${I_t(\cdot)}$: indeed, by the chain rule we have

$\displaystyle \begin{array}{rcl} I_t({\cal J}|{\cal K}) &=& I(W; X_{{\cal J},t} | X_{{\cal K},t}) \nonumber \\ &=& I(W; X_{{\cal J} \cup {\cal K},t}) - I(W; X_{{\cal K},t}) \nonumber \\ &=& I_t({\cal J} \cup {\cal K}) - I_t({\cal K}). \end{array}$

In addition, if we know ${I^+_{t+1}(\cdot|\cdot)}$, we can also compute ${I_{t+1}({\cal J} | {\cal K})}$:

$\displaystyle \begin{array}{rcl} I_{t+1}({\cal J} | {\cal K}) &=& I(W; X_{{\cal J},t+1} | X_{{\cal K},t+1}) \nonumber \\ &=& I(W; X_{{\cal J} \cup {\cal K},t+1}) - I(W; X_{{\cal K},t+1}) \nonumber \\ &=& I_{t+1}({\cal J} \cup {\cal K} | \emptyset) - I_{t+1}({\cal K} | \emptyset) \nonumber \\ &=& I^+_{t+1}({\cal J} \cup {\cal K} | \emptyset) - I^+_{t+1}({\cal K} | \emptyset). \end{array}$

Moreover, using our knowledge of ${I_t(\cdot)}$ and the fact that ${I^+_t(\emptyset|{\cal K}) = 0}$ for all ${{\cal K}}$, we can do the following: Let us fix an arbitrary node ${u \in {\cal V}}$. Then, from (9), we derive

$\displaystyle \begin{array}{rcl} I^+_{t+1}(\{u\} | {\cal K}) &\le& (1-\eta) \underbrace{I^+_{t+1}(\emptyset | {\cal K})}_{=0} + \eta \underbrace{I^+_{t+1}(\emptyset | {\cal K} \cup \partial_+ u)}_{=0} + \eta I_t(\partial_+ u | {\cal K}) \nonumber \\ &=& \eta I_t(\partial_+ u | {\cal K}). \end{array}$

Fixing another node ${u \neq v}$, we have

$\displaystyle \begin{array}{rcl} I^+_{t+1}(\{u,v\} | {\cal K}) &\le& (1-\eta)I^+_{t+1}(\{u\}|{\cal K}) + \eta I^+_{t+1}(\{u\} | {\cal K} \cup \partial_+ v) + \eta I_t(\partial_+ v | {\cal K}) \\ &\le& (1-\eta) \eta I_t(\partial_+ u | {\cal K}) + \eta^2 I_t(\partial_+ u | {\cal K} \cup \partial_+ v) + \eta I_t(\partial_+ v | {\cal K}) \\ &=& (1-\eta)\eta \left[I_t(\partial_+ u | {\cal K}) + I_t(\partial_+ v | {\cal K})\right] + \eta^2 I_t(\partial_+ u | {\cal K} \cup \partial_+ v) \nonumber\\ && \qquad \qquad \qquad - (1-\eta)\eta I_t(\partial_+ v | {\cal K}) + \eta I_t(\partial_+ v | {\cal K}) \\ &=& (1-\eta)\eta \left[I_t(\partial_+ u | {\cal K}) + I_t(\partial_+ v | {\cal K})\right] + \eta^2 \left[I_t(\partial_+ v | {\cal K}) + I_t(\partial_+ u | {\cal K} \cup \partial_+ v) \right] \\ &=& (1-\eta)\eta \left[I_t(\partial_+ u | {\cal K}) + I_t(\partial_+ v | {\cal K})\right] + \eta^2 I_t(\partial_+ u \cup \partial_+ v | {\cal K}), \end{array}$

where the first step uses (9); the second uses our bound on ${I^+_{t+1}(\{u\}|{\cal K})}$; in the third, we add and subtract ${(1-\eta)\eta I_t(\partial_+ v | {\cal K})}$; and the last step uses the chain rule. Continuing inductively, we see that, for any ${{\cal J}, {\cal K} \in 2^{{\cal V}}_0}$,

$\displaystyle I^+_{t+1}({\cal J} | {\cal K}) \le \displaystyle\sum^{|{\cal J}|-1}_{k=0} \eta^{|{\cal J}|-k}(1-\eta)^k \displaystyle\sum_{{\cal J}' \subseteq {\cal J}:\, |{\cal J}\backslash{\cal J}'| = k} I_t\Bigg( \bigcup_{v \in {\cal J}'} \partial_+ v \Bigg| {\cal K}\Bigg).$

In particular, using the formula ${I_{t+1}({\cal J}|{\cal K}) = I^+_{t+1}({\cal J}\cup{\cal K}|\emptyset) - I^+_{t+1}({\cal K}|\emptyset)}$ with ${{\cal K} = \emptyset}$, we obtain

$\displaystyle I(W; X_{{\cal J},t+1}) \le \displaystyle\sum^{|{\cal J}|-1}_{k=0} \eta^{|{\cal J}|-k}(1-\eta)^k \displaystyle\sum_{{\cal J}' \subseteq {\cal J}:\, |{\cal J} \backslash {\cal J}'| = k} I\Bigg(W; X_{w,t}, w \in \bigcup_{v \in {\cal J}'} \partial_+ v \Bigg). \ \ \ \ \ (10)$

The final step is to complete the proof using the fact that ${\Delta \eta < 1}$. Since ${W}$ is binary, ${I(W; X_{{\cal J},1}) \le H(W) \le \log 2}$. Therefore,

$\displaystyle I(W; X_{{\cal J},1}) \le |{\cal J}| \log 2, \qquad \forall {\cal J} \in 2^{{\cal V}}_0. \ \ \ \ \ (11)$

This can be interpreted as follows: if we define the specific information at time ${t}$ by

$\displaystyle U_t := \sup_{{\cal J}} \frac{I(W; X_{{\cal J},t})}{|{\cal J}|},$

then ${U_1 \le \log 2}$. Now, we write

$\displaystyle \begin{array}{rcl} I(W; X_{{\cal J},2}) &\le& \displaystyle\sum^{|{\cal J}|-1}_{k=0} \eta^{|{\cal J}|-k}(1-\eta)^k \displaystyle\sum_{{\cal J}' \subseteq {\cal J}:\, | {\cal J} \backslash {\cal J}' | = k}I\Bigg(W; X_{w,1}, w \in \bigcup_{v \in {\cal J}'} \partial_+ v \Bigg) \nonumber \\ &\le &\displaystyle\sum^{|{\cal J}|-1}_{k=0} \eta^{|{\cal J}|-k}(1-\eta)^k \displaystyle\sum_{{\cal J}' \subseteq {\cal J}:\, | {\cal J} \backslash {\cal J}' | = k} \left| \bigcup_{v \in {\cal J}'} \partial_+ v \right| \log 2 \nonumber \\ &\le &\displaystyle\sum^{|{\cal J}|-1}_{k=0} \eta^{|{\cal J}|-k}(1-\eta)^k \displaystyle\sum_{{\cal J}' \subseteq {\cal J}:\, | {\cal J} \backslash {\cal J}' | = k} \displaystyle\sum_{v \in {\cal J}'} |\partial_+ v| \log 2 \nonumber \\ &\le& \displaystyle\sum^{|{\cal J}|-1}_{k=0} \eta^{|{\cal J}|-k}(1-\eta)^k {|{\cal J}| \choose |{\cal J}|-k} (|{\cal J}|-k) \Delta \log 2 \nonumber \\ &=& |{\cal J}| \Delta \eta \log 2 \displaystyle\sum^{|{\cal J}|-1}_{k=0} {|{\cal J}|-1 \choose |{\cal J}|-1-k} \eta^{|{\cal J}|-1-k}(1-\eta)^k \nonumber \\ &=& |{\cal J}| \Delta \eta \log 2, \end{array}$

where the first step follows from (10); the second uses (11); the third step is by (over)counting; the fourth step uses the fact that ${|\partial_+ v| \le \Delta}$ for every ${v}$; and the remaining steps are simple combinatorics. What we have shown is that the specific information at ${t=2}$ can be bounded as ${U_2 \le \Delta \eta \log 2}$. Thus, if ${\Delta \eta < 1}$, then in one step the specific information will strictly decrease. Iterating, we get the bound

$\displaystyle I(W; X_{{\cal J},t}) \le |{\cal J}| (\Delta \eta)^{t-1} \log 2.$

If ${\Delta \eta < 1}$, then ${U_t \le (\Delta\eta)^{t-1} \rightarrow 0}$ as ${t \rightarrow \infty}$. This completes the proof of Theorem 1.