# Some notes on Massey’s “Causality, feedback and directed information”

These are some notes on the first paper for our Information Theory reading group, but they may be of use to anyone with an interest in the way information theorists deal with feedback.

Warning: lots of equations and disjointed muttering about stochastic kernels and conditional independence below the fold!

This paper by Massey appeared twenty years ago and drew attention to a number of subtle issues that arise when feedback is present in a communication system. Back then, information-theoretic results concerning feedback were few and far between, and what Massey had to say challenged the conventional wisdom of the day. But now we have the benefit of hindsight and can actually see that what Massey was getting at was the idea that the model of a communication system is an interconnection of the stochastic kernels that describe the behavior of each component (source, encoder, channel, decoder). Hmmm, haven’t we seen this somewhere?

According to this “modern” view, there are four types of events that can occur in the system: generation of the message, generation of a channel input symbol, generation of a channel output symbol, and decoding of the message. Let’s suppose that it takes ${T}$ channel uses to transmit all the input symbols. Then we have the following ordering of events, or the causal ordering:

$\displaystyle W,X_1,Y_1,\ldots,X_t,Y_t,\ldots,X_T,Y_T,\hat{W},$

where ${W}$ is the message, ${X_t}$ is the channel input symbol at time ${t}$, ${Y_t}$ is the channel output symbol at time ${t}$, and ${\hat{W}}$ is the decoded message. A complete system specification amounts to specifying a joint probability distribution on the variables associated to all the events. We start with the channel. The fact that the channel is memoryless simply means that, for each ${t}$, the conditional distribution of ${Y_t}$ given ${W}$, ${X^t}$ and ${Y^{t-1}}$, is a function of the most recent channel input symbol ${X_t}$ only: ${P_{Y_t|W,X^t,Y^{t-1}} = P_{Y_t|X_t}}$. In other words, ${(W,X^{t-1},Y^{t-1}) \rightarrow X_t \rightarrow Y_t}$ is a Markov chain. The stochastic kernel ${P_{Y_t|X_t}}$ that describes the channel is fixed. The system designer must specify the encoder policy ${\{P_{X_t|W,X^{t-1},Y^{t-1}}\}^T_{t=1}}$, i.e., the conditional distributions of the channel inputs given the message and all past inputs and outputs, as well as the decoder policy ${P_{\hat{W}|W,X^T,Y^T}}$ that specifies the conditional distribution of the decoded message given all other variables. Now, Massey says that the channel is used without feedback if

$\displaystyle P_{X_t|W,X^{t-1},Y^{t-1}} = P_{X_t|W,X^{t-1}},$

in other words, if at each ${t}$ the current channel input symbol depends only on the message and on the past channel input symbols, but not on any output symbols. Moreover, the decoder does not see the actual message ${W}$ (that would defeat the purpose of communicating) or the channel input symbols ${X^T}$, so

$\displaystyle P_{\hat{W}|W,X^T,Y^T} = P_{\hat{W}|Y^T}.$

We won’t worry about the decoder, though. Let’s see what the joint distribution of ${W}$, ${X^T}$ and ${Y^T}$ looks like. We start by writing down the factorization according to the causal ordering and then using all the conditional independence relations:

$\displaystyle \begin{array}{rcl} P_{W,X^T,Y^T}(w,x^T,y^T) &=& P_W(w) \cdot \prod^T_{t=1} P_{X_t|W,X^{t-1}}(x_t|w,x^{t-1}) \cdot \prod^T_{t=1} P_{Y_t|X_t}(y_t|x_t) \\ &=& P_W(w) \cdot P_{X^T|W}(x^T|w) \cdot \prod^T_{t=1}P_{Y_t|X_t}(y_t|x_t) \\ &=& P_{W,X^T}(w,x^T) \cdot \prod^T_{t=1} P_{Y_t|X_t}(y_t|x_t). \end{array}$

Note that it is precisely the fact that the channel was used without feedback that has allowed us to factor out the joint distribution of the message and the channel input symbols. Now if ${w}$ and ${x^T}$ are such that ${P_{W,X^T}(w,x^T) > 0}$, then we can write down the conditional distribution of the channel output sequence ${Y^T}$ given the message ${W = w}$ and input sequence ${X^T = x^T}$:

$\displaystyle P_{Y^T|W,X^T}(y^T|w,x^T) = \prod^T_{t=1}P_{Y_t|X_t}(y_t|x_t).$

At the time Massey’s paper came out, this was often used as the general definition of a DMC, even though it is actually a consequence of the fact that the channel was used without feedback. In fact, this definition appears on page 73 of Robert Gallager‘s classic textbook, Information Theory and Reliable Communication, but what it really does is rule out feedback. In all fairness, this is a fairly subtle point that can be easily overlooked if one does not think in terms of kernels and interconnections, but that framework was not yet in vogue in the early 1990’s.

Another way to think about channels without feedback is this: consider the first two input-output pairs and write the causal ordering

$\displaystyle W,X_1,Y_1,X_2,Y_2,\ldots,X_T,Y_T,\hat{W}$

Note that, by the no-feedback property, ${X_2}$ and ${Y_1}$ are conditionally independent given ${W,X_1}$. Thus we can apply our useful little lemma (cf.) on the equivalence of causal ordeirngs and exchange ${Y_1}$ and ${X_2}$ to get an equivalent ordering

$\displaystyle W,X_1,X_2,Y_1,Y_2,X_3,Y_3,\ldots,X_T,Y_T,\hat{W}.$

Now we can pull the same trick to switch ${X_3}$ with ${Y_1,Y_2}$: since the channel is used without feedback, ${X_3}$ is conditionally independent of ${(Y_1,Y_2)}$ given ${(W,X_1,X_2)}$. This gives us

$\displaystyle W,X_1,X_2,X_3,Y_1,Y_2,Y_3,X_4,Y_4,\ldots,X_T,Y_T,\hat{W}.$

Continuing in this fashion, we arrive at the causal ordering

$\displaystyle W,X_1,\ldots,X_T,Y_1,\ldots,Y_T,\hat{W}$

which is equivalent to the original one. What this means is that, without feedback, all the inputs might as well have been generated first, given the message, and then launched through the channel one at a time. This is actually important since it brings in the arrow of time and, therefore, causality into the picture. Essentially, the no-feedback set-up means that the present channel outputs cannot affect the future channel inputs — indeed, all the channel inputs might as well have been specified before the channel was even activated. On the other hand, they are certainly statistically dependent on the channel outputs. As Massey himself puts it,

The lesson to be learned here is that probabilistic dependence is quite distinct from causal dependence. Whether ${X}$ causes ${Y}$ or ${Y}$ causes ${X}$, the random variables ${X}$ and ${Y}$ will be statistically dependent. Statistical dependence, unlike causality, has no inherent directivity.

Since ${X^T}$ and ${Y^T}$ are statistically dependent on one another, their mutual information ${I(X^T; Y^T)}$ is nonzero. But how can we quantify the presence of feedback?

From now on, we do not assume that the channel is memoryless. Let’s understand what feedback is in operational terms. When feedback is present, the encoder gets to see the past channel outputs and, based on this information and on the message, decides what the next input symbol should be. Let’s look at this from the decoder’s point of view. Let’s forget about the message ${W}$ and focus simply on making sure that the channel outputs accurately convey the channel inputs. At time ${t-1}$, the decoder has already seen ${Y^{t-1}}$, while the encoder has seen ${Y^{t-2}}$ (remember, there’s feedback, but it’s delayed by one time step). At time ${t}$, the encoder receives ${Y_{t-1}}$ and generates ${X_t}$. The decoder receives ${Y_t}$. How much information has the decoder gained about ${X^t}$? This much:

$\displaystyle I(X^t ; Y_t | Y^{t-1}) = H(X^t|Y^{t-1}) - H(X^t|Y^t).$

The first term on the right is the decoder’s average uncertainty about ${X^t}$ after he has learned ${Y_{t-1}}$. The second term is the average uncertainty about ${X^t}$ given ${Y^{t-1}}$ and the additional symbol ${Y_t}$. Summing these terms from ${t=1}$ to ${t=T}$, we get what Massey called the directed information from ${X^T}$ to ${Y^T}$:

$\displaystyle I(X^T \rightarrow Y^T) = \sum^T_{t=1} I(X^t; Y_t | Y^{t-1}). \ \ \ \ \ (1)$

Note the difference between the usual mutual information, which by the chain rule can be expanded as

$\displaystyle I(X^T; Y^T) = \sum^T_{t=1} I(X^T; Y_t | Y^{t-1}). \ \ \ \ \ (2)$

The difference between (1) and (2) is seemingly minor: the term corresponding to time ${t}$ has ${X^t}$ in one and ${X^T}$ in the other. But this distinction is very important: when feedback is present, the future channel inputs ${X_{t+1},\ldots,X_T}$ will be causally determined by the past channel outputs. When there is no feedback, ${X^T}$ is generated before transmission can even begin. Thus, in the no-feedback case it makes sense to talk about uncertainty remaining about the whole ${X^T}$ as more and more channel outputs are seen.

Massey then goes on to prove that ${I(X^T \rightarrow Y^T) \le I(X^T; Y^T)}$, with equality if the channel is used without feedback. What I will do is prove much more — namely, that ${I(X^T \rightarrow Y^T) = I(X^T; Y^T)}$ if and only if the channel is used without feedback. The proof will be based on an equivalent characterization of the directed information from the paper of Sekhar Tatikonda and Sanjoy Mitter on the capacity of channels with feedback. (This result was proved by James and Peter Massey in 2005 using different techniques; they called it the conservation law for information.)

To do this, let’s consider our abstract model from before: we have ${N}$ causally ordered random variables ${Z_1,\ldots,Z_N}$. Let us split the set ${\{1,\ldots,N\}}$ into two disjoint subsets, ${S = \{i_1,\ldots,i_K\}}$ and ${S^c = \{j_1,\ldots,j_L\}}$ with ${1 \le i_1 < i_2 < \ldots < i_K \le N}$ and ${1 \le j_1 < j_2 < \ldots < j_L \le N}$. Let us also define the directed stochatsic kernel of ${Z^S = (Z_i)_{i \in S}}$ with respect to ${Z^{S^c} = (Z_i)_{i \in S^c}}$ as

$\displaystyle \vec{P}_{Z^S|Z^{S^c}}(z^S|z^{S^c}) = \prod^K_{k=1} P_{Z_{i_k}|Z^{i_k-1}}(z_{i_k}|z^{i_k-1}).$

We can define the directed kernel from ${Z^{S^c}}$ to ${Z^S}$ in exactly the same way. The first thing to notice is that we can write the joint distribution of ${Z^N}$ as

$\displaystyle P_{Z^N}(z^N) = \vec{P}_{Z^S|Z^{S^c}}(z^S|z^{S^c}) \vec{P}_{Z^{S^c}|Z^S}(z^{S^c}|z^S). \ \ \ \ \ (3)$

Now let’s define the directed information from ${Z^S}$ to ${Z^{S^c}}$ by

$\displaystyle I(Z^S \rightarrow Z^{S^c}) = D\Big( P_{Z^N} \Big\| \vec{P}_{Z^S|Z^{S^c}} \times P_{Z^{S^c}} \Big),$

where ${D(\cdot \| \cdot)}$ is the relative entropy (aka information divergence, aka Kullback–Leibler divergence). Expanding the definition, we get

$\displaystyle I(Z^S \rightarrow Z^{S^c}) = \mathop{\mathbb E} \left[ \log \frac{P_{Z^S|Z^{S^c}}(Z^S|Z^{S^c})}{\vec{P}_{Z^S|Z^{S^c}}(Z^S|Z^{S^c})}\right],$

so the directed information from ${Z^S}$ to ${Z^{S^c}}$ can be thought of as the expected log-likelihood ratio between the full posterior distribution of ${Z^S}$ given ${Z^{S^c}}$ and the causally constructed posterior, i.e., ${\vec{P}_{Z^S|Z^{S^c}}}$.

The first thing I will do is show that this definition is actually equivalent to Massey’s. Let’s write down our usual causal ordering ${X_1,Y_1,\ldots,X_T,Y_T}$ and define ${Z^S = (X_1,\ldots,X_T)}$ and ${Z^{S^c} = (Y_1,\ldots,Y_T)}$. Then

$\displaystyle \vec{P}_{X^T|Y^T}(x^T|y^T) = \prod^T_{t=1}P_{X_t|X^{t-1},Y^{t-1}}(x_t|x^{t-1}, y^{t-1}),$

and, using the definition of ${D(\cdot\|\cdot)}$ and the causal factorization, we get

$\displaystyle \begin{array}{rcl} D\Big(P_{X^T,Y^T} \Big\| \vec{P}_{X^T|Y^T} \times P_{Y^T}\Big) &=& \mathop{\mathbb E} \left[ \log \frac{P_{X^T,Y^T}(X^T,Y^T)}{\vec{P}_{X^T|Y^T}(X^T|Y^T)P_{Y^T}(Y^T)}\right] \\ &=& \sum^T_{t=1}\mathop{\mathbb E} \left[ \log \frac{P_{X_t|X^{t-1},Y^{t-1}}(X_t|X^{t-1},Y^{t-1}) P_{Y_t|X^t,Y^{t-1}}(Y_t|X^t,Y^{t-1})}{P_{X_t|X^{t-1},Y^{t-1}}(X_t|X^{t-1},Y^{t-1}) P_{Y_t|Y^{t-1}}(Y_t|Y^{t-1})}\right] \\ &=& \sum^T_{t=1} \mathop{\mathbb E} \left[ \log \frac{P_{Y_t|X^t,Y^{t-1}}(Y_t|X^t,Y^{t-1})}{P_{Y_t|Y^{t-1}}(Y_t|Y^{t-1})}\right] \\ &=& \sum^T_{t=1}\left[ H(Y_t|Y^{t-1}) - H(Y_t|X^t,Y^{t-1})\right] \\ &=& \sum^T_{t=1} I(X^t; Y_t | Y^{t-1}), \end{array}$

which is exactly Massey’s definition.

Now I will prove the following

Lemma 1

$\displaystyle I(X^T; Y^T) = I(X^T \rightarrow Y^T) + I(Y^T \rightarrow X^T),$

where the directed information from ${Y^T}$ to ${X^T}$ is zero if and only if the channel is used without feedback.

Proof: Use (3) to write

$\displaystyle \begin{array}{rcl} I(X^T; Y^T) &=& D \Big( P_{X^T,Y^T} \Big\| P_{X^T} \times P_{Y^T} \Big) \\ &=& \mathop{\mathbb E} \left[ \log \frac{P_{X^T,Y^T}(X^T,Y^T)}{P_{X^T}(X^T)P_{Y^T}(Y^T)}\right] \\ &=& \mathop{\mathbb E} \left[ \log \frac{P_{X^T,Y^T}(X^T,Y^T)}{\vec{P}_{Y^T|X^T}(Y^T|X^T)P_{X^T}(X^T)}\right] + \mathop{\mathbb E} \left[ \log \frac{P_{X^T,Y^T}(X^T,Y^T)}{\vec{P}_{X^T|Y^T}(X^T|Y^T)P_{Y^T}(Y^T)}\right] \\ &=& I(Y^T \rightarrow X^T) + I(X^T \rightarrow Y^T). \end{array}$

Now, by definition

$\displaystyle \begin{array}{rcl} I(Y^T \rightarrow X^T) &=& \sum^T_{t=1}\mathop{\mathbb E} \left[ \frac{ P_{Y_t|X^t,Y^{t-1}} (Y_t|X^t,Y^{t-1}) P_{X_t|X^{t-1},Y^{t-1}} (X_t|X^{t-1},Y^{t-1}) } {P_{Y_t|X^t,Y^{t-1}}(Y_t|X^t,Y^{t-1})P_{X_t|X^{t-1}}(X_t|X^{t-1})}\right] \\ &=& \sum^T_{t=1} \mathop{\mathbb E} \left[ \frac{P_{X_t|X^{t-1},Y^{t-1}}(X_t|X^{t-1},Y^{t-1})}{P_{X_t|X^{t-1}}(X_t|X^{t-1})}\right] \\ &=& \sum^T_{t=1} I(X_t; Y^{t-1} | X^{t-1}). \end{array}$

Since the mutual information is nonnegative, ${I(Y^T \rightarrow X^T) = 0}$ if and only if ${I(X_t; Y^{t-1}|X^{t-1}) = 0}$ for all ${t}$, or, equivalently, if ${X_t}$ and ${Y^{t-1}}$ are conditionally independent given ${X^{t-1}}$. But means precisely that the channel is used without feedback. $\Box$

And this gets us to the main result:

Theorem 2 ${I(X^T \rightarrow Y^T) \le I(X^T; Y^T)}$, with equality if and only if there is no feedback.

Proof: From the lemma,

$\displaystyle I(X^T \rightarrow Y^T) = I(X^T;Y^T) - I(Y^T \rightarrow X^T).$

Hence, ${I(X^T \rightarrow Y^T) \le I(X^T; Y^T)}$. Equality holds if and only if ${I(Y^T \rightarrow X^T) = 0}$, which in turn holds if and only if the channel is used without feedback. $\Box$

These days, feedback information theory is a cottage industry with directed information as the driving force behind it. Apart from supplying the fundamental limits on communication systems with feedback, it has already been applied to sequential investment and hypothesis testing with causal side information, as well as to inference of causality in neuronal assemblies in the brain and in regulatory biological networks.

## 4 thoughts on “Some notes on Massey’s “Causality, feedback and directed information””

1. simona says:

How can I cite you on these notes?