The Information Structuralist

Conditional mutual information and the best Markov approximation

Posted in Information Theory by mraginsky on January 4, 2013

I came across a neat and useful result about conditional mutual information while reading a paper on quantum information theory.

Consider three jointly distributed random variables {X,Y,Z}. If the conditional mutual information {I(X;Y|Z) = 0}, then {X} and {Y} are conditionally independent given {Z} (i.e., {X \rightarrow Z \rightarrow Y} is a Markov chain). In fact, {I(X;Y|Z) = 0} if and only if {X \rightarrow Z \rightarrow Y}. Now, what if {I(X;Y|Z)} is nonzero, but very small? Intuitively, one would expect that the joint distribution {P_{XYZ}} is somehow “close” to being Markov. It turns out that this intuition is, indeed, correct — in fact, the conditional mutual information {I(X;Y|Z)} is the divergence between {P_{XYZ}} and the nearest distribution {Q_{XYZ}} under which {X \rightarrow Z \rightarrow Y} is a Markov chain.

Theorem 1 Let {X,Y,Z} be three jointly distributed random variables. Then

\displaystyle  		I(X;Y|Z) = \min \left\{ D(P_{XYZ} \| Q_{XYZ}) : X \rightarrow Z \rightarrow Y \text{ under } Q \right\}

Moreover, the minimum is achieved by

\displaystyle  		Q^*_{XYZ}(x,y,z) = P_Z(z)P_{X|Z}(x|z)P_{Y|Z}(y|z).

Proof: To keep things simple, I will assume that all random variables take values in finite sets. Consider any distribution {Q_{XYZ}} for which {X \rightarrow Z \rightarrow Y}. Then

\displaystyle  \begin{array}{rcl}  		D(P_{XYZ} \| Q_{XYZ}) &=&\, {\mathbb E}\left[ \log \frac{P_{XYZ}(X,Y,Z)}{Q_{XYZ}(X,Y,Z)}\right] \\ 		&=&\, {\mathbb E}\left[ \log \left(\frac{P_Z(Z)}{Q_Z(Z)} \cdot \frac{P_{X|Z}(X|Z)}{Q_{X|Z}(X|Z)} \cdot \frac{P_{Y|XZ}(Y|X,Z)}{Q_{Y|Z}(Y|Z)}\right)\right] \\ 		&=&\, {\mathbb E}\left[\log \frac{P_Z(Z)}{Q_Z(Z)}\right] + {\mathbb E}\left[\log \frac{P_{X|Z}(X|Z)}{Q_{X|Z}(X|Z)}\right] + {\mathbb E}\left[\log \frac{P_{Y|XZ}(Y|X,Z)}{Q_{Y|Z}(Y|Z)}\right]. 	\end{array}

The first two terms on the right-hand side are equal, respectively, to {D(P_Z \| Q_Z)} and {D(P_{X|Z} \| Q_{X|Z} | P_Z)}. For the third term, we can write

\displaystyle  \begin{array}{rcl}  		{\mathbb E}\left[\log \frac{P_{Y|XZ}(Y|X,Z)}{Q_{Y|Z}(Y|Z)}\right] &=&\, {\mathbb E}\left[ \log \frac{P_{XYZ}(X,Y,Z)}{P_{XZ}(X,Z)Q_{Y|Z}(Y|Z)}\right] \\ 		&=&\, {\mathbb E}\left[ \log \left( \frac{P_{X|YZ}(X|Y,Z)}{P_{X|Z}(X|Z)} \cdot \frac{P_{Y|Z}(Y|Z)}{Q_{Y|Z}(Y|Z)}\right)\right] \\ 		&=&\, I(X;Y | Z) + D(P_{Y|Z} \| Q_{Y|Z} | P_Z). 	\end{array}

Putting everything together, we have

\displaystyle  		D(P_{XYZ} \| Q_{XYZ}) = D(P_Z \| Q_Z) + D(P_{X|Z} \| Q_{X|Z} | P_Z) + D(P_{Y|Z} \| Q_{Y|Z} | P_Z) + I(X;Y|Z) 	\ \ \ \ \ (1)

Now we observe two things:

  1. The right-hand side of (1) is greater than or equal to {I(X;Y|Z)}.
  2. The first three terms on the right-hand side are zero if and only if {Q_{XYZ}} is such that {P_Z = Q_Z} and {P_{X|Z=z} = Q_{X|Z=z}}, {P_{Y|Z=z} = Q_{Y|Z=z}} for all {z} that satisfy {P_Z(z) > 0}.

This concludes the proof. \Box

Here is one interesting consequence. Consider three random variables {X}, {Y} and {Z}, where {|X| \le 1}. Let

\displaystyle  		{\rm MMSE}(X|Y) = \inf_{f} {\mathbb E}\left[(X-f(Y))^2\right] = {\mathbb E}\left[\left(X - {\mathbb E}[X|Y]\right)^2\right]

be the minimum mean-square error (MMSE) achievable by any (measurable) estimator of {X} based on {Y}, and do the same for {{\rm MMSE}(X|Y,Z)}. Then {{\rm MMSE}(X|Y) \ge {\rm MMSE}(X|Y,Z)}, i.e., using more information for estimation never hurts. However, how much do we gain by using both {Y} and {Z}? The following inequality, due to Yihong Wu and Sergio Verdú, gives an upper bound on the excess MMSE in terms of conditional mutual information:

\displaystyle  	{\rm MMSE}(X|Y) - {\rm MMSE}(X|Y,Z) \le 2 I(X;Z|Y), \ \ \ \ \ (2)

where the mutual information is measured in nats. This inequality implies (a generalization of) Tao’s inequality,

\displaystyle  	{\mathbb E}\big|{\mathbb E}[X|Y]-{\mathbb E}[X|Y,Z]\big| \le \sqrt{2 I(X;Z|Y)} \ \ \ \ \ (3)

— indeed, the square root of the left-hand side of (2) is greater than or equal to the left-hand side of (3):

\displaystyle  \begin{array}{rcl}  	{\rm MMSE}(X|Y) - {\rm MMSE}(X|Y,Z) &=&\, {\mathbb E}\left[(X-{\mathbb E}[X|Y])^2\right] - {\mathbb E}\left[(X-{\mathbb E}[X|Y,Z])^2\right] \\ 	&=&\, {\mathbb E}({\mathbb E}[X|Y,Z]^2) - {\mathbb E}({\mathbb E}[X|Y]^2) \\ 	&=&\, {\mathbb E}\left[\left( {\mathbb E}[X|Y] - {\mathbb E}[X|Y,Z] \right)^2\right] \\ 	&\ge&\, \left( {\mathbb E}\big|{\mathbb E}[X|Y] - {\mathbb E}[X|Z]\big|\right)^2, \end{array}

where the third line is due to the orthogonality principle, whereby

\displaystyle  {\mathbb E}\big[(X-{\mathbb E}[X|Y,Z]){\mathbb E}[X|Y]\big] = 0,

and the last line uses Jensen’s inequality. Combining (2) and (3) with Theorem 1, we see that the divergence between {P_{XYZ}} and any other distribution {Q_{XYZ}} under which {X \rightarrow Y \rightarrow Z} is a Markov chain can be lower-bounded as

\displaystyle  \begin{array}{rcl}  D(P_{XYZ} \| Q_{XYZ}) &\ge&\, \frac{1}{2}\left[ {\rm MMSE}(X|Y) - {\rm MMSE}(X|Y,Z) \right] \\ &\ge&\, \frac{1}{2}\left( {\mathbb E}|{\mathbb E}[X|Y] - {\mathbb E}[X|Y,Z]| \right)^2. \end{array}

Therefore, if the addition of {Z} improves the MMSE significantly, then the joint distribution {P_{XYZ}} is far from the set of all distributions {Q_{XYZ}} under which {X \rightarrow Y \rightarrow Z} is a Markov chain.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: