# The Information Structuralist

## Conditional mutual information and the best Markov approximation

Posted in Information Theory by mraginsky on January 4, 2013

I came across a neat and useful result about conditional mutual information while reading a paper on quantum information theory.

Consider three jointly distributed random variables ${X,Y,Z}$. If the conditional mutual information ${I(X;Y|Z) = 0}$, then ${X}$ and ${Y}$ are conditionally independent given ${Z}$ (i.e., ${X \rightarrow Z \rightarrow Y}$ is a Markov chain). In fact, ${I(X;Y|Z) = 0}$ if and only if ${X \rightarrow Z \rightarrow Y}$. Now, what if ${I(X;Y|Z)}$ is nonzero, but very small? Intuitively, one would expect that the joint distribution ${P_{XYZ}}$ is somehow “close” to being Markov. It turns out that this intuition is, indeed, correct — in fact, the conditional mutual information ${I(X;Y|Z)}$ is the divergence between ${P_{XYZ}}$ and the nearest distribution ${Q_{XYZ}}$ under which ${X \rightarrow Z \rightarrow Y}$ is a Markov chain.

Theorem 1 Let ${X,Y,Z}$ be three jointly distributed random variables. Then $\displaystyle I(X;Y|Z) = \min \left\{ D(P_{XYZ} \| Q_{XYZ}) : X \rightarrow Z \rightarrow Y \text{ under } Q \right\}$

Moreover, the minimum is achieved by $\displaystyle Q^*_{XYZ}(x,y,z) = P_Z(z)P_{X|Z}(x|z)P_{Y|Z}(y|z).$

Proof: To keep things simple, I will assume that all random variables take values in finite sets. Consider any distribution ${Q_{XYZ}}$ for which ${X \rightarrow Z \rightarrow Y}$. Then $\displaystyle \begin{array}{rcl} D(P_{XYZ} \| Q_{XYZ}) &=&\, {\mathbb E}\left[ \log \frac{P_{XYZ}(X,Y,Z)}{Q_{XYZ}(X,Y,Z)}\right] \\ &=&\, {\mathbb E}\left[ \log \left(\frac{P_Z(Z)}{Q_Z(Z)} \cdot \frac{P_{X|Z}(X|Z)}{Q_{X|Z}(X|Z)} \cdot \frac{P_{Y|XZ}(Y|X,Z)}{Q_{Y|Z}(Y|Z)}\right)\right] \\ &=&\, {\mathbb E}\left[\log \frac{P_Z(Z)}{Q_Z(Z)}\right] + {\mathbb E}\left[\log \frac{P_{X|Z}(X|Z)}{Q_{X|Z}(X|Z)}\right] + {\mathbb E}\left[\log \frac{P_{Y|XZ}(Y|X,Z)}{Q_{Y|Z}(Y|Z)}\right]. \end{array}$

The first two terms on the right-hand side are equal, respectively, to ${D(P_Z \| Q_Z)}$ and ${D(P_{X|Z} \| Q_{X|Z} | P_Z)}$. For the third term, we can write $\displaystyle \begin{array}{rcl} {\mathbb E}\left[\log \frac{P_{Y|XZ}(Y|X,Z)}{Q_{Y|Z}(Y|Z)}\right] &=&\, {\mathbb E}\left[ \log \frac{P_{XYZ}(X,Y,Z)}{P_{XZ}(X,Z)Q_{Y|Z}(Y|Z)}\right] \\ &=&\, {\mathbb E}\left[ \log \left( \frac{P_{X|YZ}(X|Y,Z)}{P_{X|Z}(X|Z)} \cdot \frac{P_{Y|Z}(Y|Z)}{Q_{Y|Z}(Y|Z)}\right)\right] \\ &=&\, I(X;Y | Z) + D(P_{Y|Z} \| Q_{Y|Z} | P_Z). \end{array}$

Putting everything together, we have $\displaystyle D(P_{XYZ} \| Q_{XYZ}) = D(P_Z \| Q_Z) + D(P_{X|Z} \| Q_{X|Z} | P_Z) + D(P_{Y|Z} \| Q_{Y|Z} | P_Z) + I(X;Y|Z) \ \ \ \ \ (1)$

Now we observe two things:

1. The right-hand side of (1) is greater than or equal to ${I(X;Y|Z)}$.
2. The first three terms on the right-hand side are zero if and only if ${Q_{XYZ}}$ is such that ${P_Z = Q_Z}$ and ${P_{X|Z=z} = Q_{X|Z=z}}$, ${P_{Y|Z=z} = Q_{Y|Z=z}}$ for all ${z}$ that satisfy ${P_Z(z) > 0}$.

This concludes the proof. $\Box$

Here is one interesting consequence. Consider three random variables ${X}$, ${Y}$ and ${Z}$, where ${|X| \le 1}$. Let $\displaystyle {\rm MMSE}(X|Y) = \inf_{f} {\mathbb E}\left[(X-f(Y))^2\right] = {\mathbb E}\left[\left(X - {\mathbb E}[X|Y]\right)^2\right]$

be the minimum mean-square error (MMSE) achievable by any (measurable) estimator of ${X}$ based on ${Y}$, and do the same for ${{\rm MMSE}(X|Y,Z)}$. Then ${{\rm MMSE}(X|Y) \ge {\rm MMSE}(X|Y,Z)}$, i.e., using more information for estimation never hurts. However, how much do we gain by using both ${Y}$ and ${Z}$? The following inequality, due to Yihong Wu and Sergio Verdú, gives an upper bound on the excess MMSE in terms of conditional mutual information: $\displaystyle {\rm MMSE}(X|Y) - {\rm MMSE}(X|Y,Z) \le 2 I(X;Z|Y), \ \ \ \ \ (2)$

where the mutual information is measured in nats. This inequality implies (a generalization of) Tao’s inequality, $\displaystyle {\mathbb E}\big|{\mathbb E}[X|Y]-{\mathbb E}[X|Y,Z]\big| \le \sqrt{2 I(X;Z|Y)} \ \ \ \ \ (3)$

— indeed, the square root of the left-hand side of (2) is greater than or equal to the left-hand side of (3): $\displaystyle \begin{array}{rcl} {\rm MMSE}(X|Y) - {\rm MMSE}(X|Y,Z) &=&\, {\mathbb E}\left[(X-{\mathbb E}[X|Y])^2\right] - {\mathbb E}\left[(X-{\mathbb E}[X|Y,Z])^2\right] \\ &=&\, {\mathbb E}({\mathbb E}[X|Y,Z]^2) - {\mathbb E}({\mathbb E}[X|Y]^2) \\ &=&\, {\mathbb E}\left[\left( {\mathbb E}[X|Y] - {\mathbb E}[X|Y,Z] \right)^2\right] \\ &\ge&\, \left( {\mathbb E}\big|{\mathbb E}[X|Y] - {\mathbb E}[X|Z]\big|\right)^2, \end{array}$

where the third line is due to the orthogonality principle, whereby $\displaystyle {\mathbb E}\big[(X-{\mathbb E}[X|Y,Z]){\mathbb E}[X|Y]\big] = 0,$

and the last line uses Jensen’s inequality. Combining (2) and (3) with Theorem 1, we see that the divergence between ${P_{XYZ}}$ and any other distribution ${Q_{XYZ}}$ under which ${X \rightarrow Y \rightarrow Z}$ is a Markov chain can be lower-bounded as $\displaystyle \begin{array}{rcl} D(P_{XYZ} \| Q_{XYZ}) &\ge&\, \frac{1}{2}\left[ {\rm MMSE}(X|Y) - {\rm MMSE}(X|Y,Z) \right] \\ &\ge&\, \frac{1}{2}\left( {\mathbb E}|{\mathbb E}[X|Y] - {\mathbb E}[X|Y,Z]| \right)^2. \end{array}$

Therefore, if the addition of ${Z}$ improves the MMSE significantly, then the joint distribution ${P_{XYZ}}$ is far from the set of all distributions ${Q_{XYZ}}$ under which ${X \rightarrow Y \rightarrow Z}$ is a Markov chain.