## Some notes on Massey’s “Causality, feedback and directed information”

These are some notes on the first paper for our Information Theory reading group, but they may be of use to anyone with an interest in the way information theorists deal with feedback.

**Warning:** lots of equations and disjointed muttering about stochastic kernels and conditional independence below the fold!

This paper by Massey appeared twenty years ago and drew attention to a number of subtle issues that arise when feedback is present in a communication system. Back then, information-theoretic results concerning feedback were few and far between, and what Massey had to say challenged the conventional wisdom of the day. But now we have the benefit of hindsight and can actually see that what Massey was getting at was the idea that the model of a communication system is an *interconnection* of the stochastic kernels that describe the behavior of each component (source, encoder, channel, decoder). Hmmm, haven’t we seen this somewhere?

According to this “modern” view, there are four types of events that can occur in the system: generation of the message, generation of a channel input symbol, generation of a channel output symbol, and decoding of the message. Let’s suppose that it takes channel uses to transmit all the input symbols. Then we have the following ordering of events, or the causal ordering:

where is the message, is the channel input symbol at time , is the channel output symbol at time , and is the decoded message. A complete system specification amounts to specifying a joint probability distribution on the variables associated to all the events. We start with the channel. The fact that the channel is *memoryless* simply means that, for each , the conditional distribution of given , and , is a function of the most recent channel input symbol only: . In other words, is a Markov chain. The stochastic kernel that describes the channel is fixed. The system designer must specify the encoder policy , i.e., the conditional distributions of the channel inputs given the message and all past inputs and outputs, as well as the decoder policy that specifies the conditional distribution of the decoded message given all other variables. Now, Massey says that the channel is *used without feedback* if

in other words, if at each the current channel input symbol depends only on the message and on the past channel input symbols, but not on any output symbols. Moreover, the decoder does not see the actual message (that would defeat the purpose of communicating) or the channel input symbols , so

We won’t worry about the decoder, though. Let’s see what the joint distribution of , and looks like. We start by writing down the factorization according to the causal ordering and then using all the conditional independence relations:

Note that it is precisely the fact that the channel was used without feedback that has allowed us to factor out the joint distribution of the message and the channel input symbols. Now if and are such that , then we can write down the *conditional distribution* of the channel output sequence given the message and input sequence :

At the time Massey’s paper came out, this was often used as the *general definition* of a DMC, even though it is actually a consequence of the fact that the channel was used without feedback. In fact, this definition appears on page 73 of Robert Gallager‘s classic textbook, *Information Theory and Reliable Communication*, but what it really does is rule out feedback. In all fairness, this is a fairly subtle point that can be easily overlooked if one does not think in terms of kernels and interconnections, but that framework was not yet in vogue in the early 1990’s.

Another way to think about channels without feedback is this: consider the first two input-output pairs and write the causal ordering

Note that, by the no-feedback property, and are conditionally independent given . Thus we can apply our useful little lemma (cf.) on the equivalence of causal ordeirngs and exchange and to get an equivalent ordering

Now we can pull the same trick to switch with : since the channel is used without feedback, is conditionally independent of given . This gives us

Continuing in this fashion, we arrive at the causal ordering

which is equivalent to the original one. What this means is that, without feedback, all the inputs might as well have been generated first, given the message, and then launched through the channel one at a time. This is actually important since it brings in the arrow of time and, therefore, causality into the picture. Essentially, the no-feedback set-up means that the present channel outputs cannot affect the future channel inputs — indeed, all the channel inputs might as well have been specified before the channel was even activated. On the other hand, they are certainly *statistically dependent* on the channel outputs. As Massey himself puts it,

The lesson to be learned here is that

probabilistic dependence is quite distinct from causal dependence. Whether causes or causes , the random variables and will be statistically dependent. Statistical dependence, unlike causality, has no inherent directivity.

Since and are *statistically dependent* on one another, their mutual information is nonzero. But how can we quantify the presence of feedback?

From now on, we do not assume that the channel is memoryless. Let’s understand what feedback is in operational terms. When feedback is present, the encoder gets to see the past channel outputs and, based on this information and on the message, decides what the next input symbol should be. Let’s look at this from the decoder’s point of view. Let’s forget about the message and focus simply on making sure that the channel outputs accurately convey the channel inputs. At time , the decoder has already seen , while the encoder has seen (remember, there’s feedback, but it’s delayed by one time step). At time , the encoder receives and generates . The decoder receives . How much information has the decoder gained about ? This much:

The first term on the right is the decoder’s average uncertainty about after he has learned . The second term is the average uncertainty about given and the additional symbol . Summing these terms from to , we get what Massey called the *directed information* from to :

Note the difference between the usual mutual information, which by the chain rule can be expanded as

The difference between (1) and (2) is seemingly minor: the term corresponding to time has in one and in the other. But this distinction is very important: when feedback is present, the future channel inputs will be causally determined by the past channel outputs. When there is no feedback, is generated before transmission can even begin. Thus, in the no-feedback case it makes sense to talk about uncertainty remaining about the whole as more and more channel outputs are seen.

Massey then goes on to prove that , with equality if the channel is used without feedback. What I will do is prove much more — namely, that *if and only if* the channel is used without feedback. The proof will be based on an equivalent characterization of the directed information from the paper of Sekhar Tatikonda and Sanjoy Mitter on the capacity of channels with feedback. (This result was proved by James and Peter Massey in 2005 using different techniques; they called it the *conservation law for information*.)

To do this, let’s consider our abstract model from before: we have causally ordered random variables . Let us split the set into two disjoint subsets, and with and . Let us also define the *directed stochatsic kernel* of with respect to as

We can define the directed kernel from to in exactly the same way. The first thing to notice is that we can write the joint distribution of as

Now let’s define the *directed information* from to by

where is the relative entropy (aka information divergence, aka Kullback–Leibler divergence). Expanding the definition, we get

so the directed information from to can be thought of as the expected log-likelihood ratio between the full posterior distribution of given and the *causally constructed* posterior, i.e., .

The first thing I will do is show that this definition is actually equivalent to Massey’s. Let’s write down our usual causal ordering and define and . Then

and, using the definition of and the causal factorization, we get

which is exactly Massey’s definition.

Now I will prove the following

Lemma 1

where the directed information from to is zero if and only if the channel is used without feedback.

*Proof:* Use (3) to write

Now, by definition

Since the mutual information is nonnegative, if and only if for all , or, equivalently, if and are conditionally independent given . But means precisely that the channel is used without feedback.

And this gets us to the main result:

Theorem 2, with equality if and only if there is no feedback.

*Proof:* From the lemma,

Hence, . Equality holds if and only if , which in turn holds if and only if the channel is used without feedback.

These days, feedback information theory is a cottage industry with directed information as the driving force behind it. Apart from supplying the fundamental limits on communication systems with feedback, it has already been applied to sequential investment and hypothesis testing with causal side information, as well as to inference of causality in neuronal assemblies in the brain and in regulatory biological networks.

assorted links and news « An Ergodic Walksaid, on September 12, 2010 at 4:11 pm[…] Max really digs in to directed information. […]

Directed stochastic kernels and causal interventions « The Information Structuralistsaid, on September 23, 2010 at 11:34 pm[…] (which we had started looking at during the last meeting of our reading group), I realized that directed stochastic kernels that feature so prominently in the general definition of directed information are known in the […]

simonasaid, on August 12, 2013 at 11:00 amHow can I cite you on these notes?

mraginskysaid, on August 12, 2013 at 1:31 pmThe best way would be by the URL.