Conditional mutual information and the best Markov approximation
I came across a neat and useful result about conditional mutual information while reading a paper on quantum information theory.
Consider three jointly distributed random variables . If the conditional mutual information , then and are conditionally independent given (i.e., is a Markov chain). In fact, if and only if . Now, what if is nonzero, but very small? Intuitively, one would expect that the joint distribution is somehow “close” to being Markov. It turns out that this intuition is, indeed, correct — in fact, the conditional mutual information is the divergence between and the nearest distribution under which is a Markov chain.
Moreover, the minimum is achieved by
Proof: To keep things simple, I will assume that all random variables take values in finite sets. Consider any distribution for which . Then
The first two terms on the right-hand side are equal, respectively, to and . For the third term, we can write
Now we observe two things:
- The right-hand side of (1) is greater than or equal to .
- The first three terms on the right-hand side are zero if and only if is such that and , for all that satisfy .
This concludes the proof.
Here is one interesting consequence. Consider three random variables , and , where . Let
be the minimum mean-square error (MMSE) achievable by any (measurable) estimator of based on , and do the same for . Then , i.e., using more information for estimation never hurts. However, how much do we gain by using both and ? The following inequality, due to Yihong Wu and Sergio Verdú, gives an upper bound on the excess MMSE in terms of conditional mutual information:
where the mutual information is measured in nats. This inequality implies (a generalization of) Tao’s inequality,
where the third line is due to the orthogonality principle, whereby
Therefore, if the addition of improves the MMSE significantly, then the joint distribution is far from the set of all distributions under which is a Markov chain.