# The Information Structuralist

## Divergence in everything: Cramér-Rao from data processing

Posted in Information Theory, Statistical Learning and Inference by mraginsky on July 27, 2011

Gather round for another tale of the mighty divergence and its adventures!

(image yoinked from Sergio Verdú‘s 2007 Shannon Lecture slides)

This time I want to show how the well-known Cramér–Rao lower bound (or the information inequality) for parameter estimation can be derived from one of the most fundamental results in information theory: the data processing theorem for divergence. What is particularly nice about this derivation is that, along the way, we get to see how the Fisher information controls local curvature of the divergence between two members of a parametric family of distributions.

This will be our starting point:

Theorem 1 (Data processing theorem for divergence) Let ${P}$ and ${Q}$ be two probability distributions for a random variable ${W}$ taking values in some space, and let ${P_{Z|W}}$ be a conditional probability distribution (channel, random transformation) that takes ${W}$ to some other random variable ${Z}$. Then

$\displaystyle D(\bar{P} \| \bar{Q}) \le D(P \| Q),$

where ${\bar{P}}$ and ${\bar{Q}}$ are the marginal distributions of ${Z}$ induced by the channel ${P_{Z|W}}$ and input distributions ${P}$ and ${Q}$.

I will keep things simple. Let ${\{ P_\theta : \theta \in \Theta\}}$ be a parametric family of probability distributions of a real-valued random variable. I will make several assumptions:

1. The parameter space ${\Theta}$ is an open interval on the real line.
2. Each ${P_\theta}$ has a density ${p_\theta}$.
3. The log-likelihood ratios

$\displaystyle \log \frac{p_\theta(x)}{p_{\theta'}(x)}, \qquad \theta, \theta' \in \Theta; x \in {\mathbb R}$

are sufficiently well-behaved to permit things like interchanging limits and expectations or derivatives and expectations. The precise regularity conditions are listed, e.g., in Chapter 2, Section 6 of Solomon Kullback‘s Information Theory and Statistics.

The parameter estimation problem is as follows: We observe a random sample ${X \sim P_{\theta^*}}$, where the parameter ${\theta^* \in \Theta}$ is unknown, and we wish to estimate ${\theta^*}$ by means of a possibly randomized estimator ${P_{Z|X}}$, ${Z \in {\mathbb R}}$. We are interested in the variance of ${Z}$:

$\displaystyle \sigma^2_{\theta^*}(Z) = {\mathbb E}_{\theta^*}\left[ (Z - {\mathbb E}_{\theta^*}Z)^2\right]$

Note, by the way, that if the estimator is unbiased, i.e., ${{\mathbb E}_{\theta^*}Z = \theta^*}$, then the variance of ${Z}$ is simply its mean squared error (MSE):

$\displaystyle \sigma^2_{\theta^*}(Z) = {\mathbb E}_{\theta^*}[(Z - \theta^*)^2].$

We would like to derive a lower bound on the variance of any such estimator ${P_{Z|X}}$.

Step 1: data processing for divergence. For any ${\theta \in \Theta}$, the estimator ${P_{Z|X}}$ induces a distribution ${\bar{P}_\theta}$:

$\displaystyle \bar{P}_\theta(A) = \int_A \int_{{\mathbb R}} P_{Z|X}(dz|x) p_\theta(x) dx$

for all Borel sets ${A \subseteq {\mathbb R}}$. In the following, I will assume that ${P_{Z|X}}$ is also sufficiently well-behaved, so that each ${\bar{P}_\theta}$ has a density ${\bar{p}_\theta}$, and the log-likelihood ratios ${\bar{p}_\theta/\bar{p}_{\theta'}}$ obey the same set of regularity conditions as above.

By the data processing theorem we have

$\displaystyle D(\bar{P}_{\theta^*} \| \bar{P}_\theta) \le D(P_{\theta^*} \| P_\theta), \qquad \forall \theta \in \Theta. \ \ \ \ \ (1)$

Since ${\Theta}$ is an open interval, this inequality holds in particular for all ${\theta}$ in an arbitrarily small neighborhood of ${\theta^*}$.

Step 2: data processing for Fisher information. Now let us look at the second-order Taylor expansions of the two divergences in (1) as ${\theta}$ gets closer to ${\theta^*}$. I will show the details only for ${D(\bar{P}_{\theta^*} \| \bar{P}_\theta)}$. To keep the notation simple, I will write ${\Delta(\theta)}$ for ${D(\bar{P}_{\theta^*} \| \bar{P}_\theta)}$. Then

$\displaystyle \begin{array}{rcl} \Delta(\theta) &=& \Delta(\theta^*) + \Delta'(\theta^*)(\theta - \theta^*) + \frac{1}{2} \Delta''(\theta^*)(\theta - \theta^*)^2 + o(|\theta - \theta^*|^2), \end{array}$

where the regularity conditions ensure that the ${o(|\theta - \theta^*|^2)}$ error term converges to zero uniformly over a sufficiently small neighborhood of ${\theta^*}$. First of all, ${\Delta(\theta^*) = D(\bar{P}_{\theta^*} \| \bar{P}_{\theta^*}) = 0}$. Next,

$\displaystyle \begin{array}{rcl} \Delta'(\theta) &=& \frac{\partial}{\partial \theta} D(\bar{P}_{\theta^*} \| \bar{P}_\theta) \\ &=& \frac{\partial}{\partial \theta} \bar{{\mathbb E}}_{\theta^*}\left[ \log \frac{\bar{p}_{\theta^*}(Z)}{\bar{p}_\theta(Z)}\right] \\ &=& -\int \bar{p}_{\theta^*}(z) \frac{\partial}{\partial \theta} \log \bar{p}_{\theta}(z) dz \\ &=& - \int \bar{p}_{\theta^*}(z) \frac{\frac{\partial}{\partial \theta} \bar{p}_\theta(z)}{\bar{p}_\theta(z)}dz, \end{array}$

which implies that

$\displaystyle \Delta'(\theta^*) = - \frac{\partial}{\partial \theta} \underbrace{\int \bar{p}_\theta(z) dz}_{=1} \Big|_{\theta = \theta^*} = 0.$

Finally,

$\displaystyle \begin{array}{rcl} \Delta''(\theta) &=& - \frac{\partial}{\partial \theta} \int \bar{p}_{\theta^*}(z) \frac{\frac{\partial}{\partial \theta} \bar{p}_\theta(z)}{\bar{p}_\theta(z)}dz \\ &=& - \int \bar{p}_{\theta^*}(z) \frac{\partial}{\partial \theta} \left( \frac{\frac{\partial}{\partial \theta} \bar{p}_\theta(z)}{\bar{p}_\theta(z)} \right) dz \\ &=& \int \bar{p}_{\theta^*}(z) \left(\frac{\frac{\partial}{\partial \theta}\bar{p}_\theta(z)}{\bar{p}_\theta(z)}\right)^2 dz - \int \frac{\bar{p}_{\theta^*}(z)}{\bar{p}_\theta(z)} \frac{\partial^2}{\partial \theta^2} \bar{p}_{\theta}(z) dz \\ &=& \bar{{\mathbb E}}_{\theta^*} \left[ \left( \frac{\partial}{\partial \theta}\log \bar{p}_\theta(Z)\right)^2\right] - \int \frac{\bar{p}_{\theta^*}(z)}{\bar{p}_\theta(z)} \frac{\partial^2}{\partial \theta^2} \bar{p}_{\theta}(z) dz \end{array}$

(by the way, the partial derivative of the log-likelihood ${\log \bar{p}_\theta(z)}$ with respect to ${\theta}$ is known in statistics as the score function). When ${\theta = \theta^*}$, the first term is precisely the Fisher information ${\bar{J}(\theta^*)}$ contained in the estimate ${Z}$ obtained from ${X \sim P_{\theta^*}}$, while the second term is zero. Putting all of this together, we can write

$\displaystyle D(\bar{P}_{\theta^*} \| \bar{P}_\theta) = \frac{1}{2} \bar{J}(\theta^*) (\theta - \theta^*)^2 + o(|\theta - \theta^*|^2). \ \ \ \ \ (2)$

Similarly,

$\displaystyle D(P_{\theta^*} \| P_\theta) = \frac{1}{2} J(\theta^*) (\theta - \theta^*)^2 + o(|\theta - \theta^*|^2), \ \ \ \ \ (3)$

where ${J(\theta^*)}$ is the Fisher information contained in the sample ${X \sim P_{\theta^*}}$.

Substituting (2) and (3) and taking the limit ${\theta \rightarrow \theta^*}$, we obtain the data processing theorem for Fisher’s information:

$\displaystyle \bar{J}(\theta^*) \le J(\theta^*). \ \ \ \ \ (4)$

Step 3: use the Schwarz! Once we have (4), the rest is standard. For any ${\theta \in \Theta}$, let us define

$\displaystyle m(\theta) = {\mathbb E}_\theta Z = \int z \bar{p}_\theta(z) dz.$

Then

$\displaystyle \begin{array}{rcl} m'(\theta) &=& \frac{\partial}{\partial \theta} \int z \bar{p}_\theta(z) dz \\ &=& \int z \frac{\partial}{\partial \theta} \bar{p}_\theta(z) dz \\ &=& \int (z - m(\theta)) \frac{\partial}{\partial \theta} \bar{p}_\theta(z) dz, \end{array}$

where the last step uses the fact (which we have already proved) that ${\int \left(\partial \bar{p}_\theta(z)/\partial \theta\right) dz = 0}$. Dividing and multiplying the integrand by ${\bar{p}_\theta(z)}$, we can write

$\displaystyle \begin{array}{rcl} m'(\theta) &=& \int (z - m(\theta)) \frac{\partial}{\partial \theta} \left(\log \bar{p}_\theta(z)\right) \bar{p}_\theta(z) dz \\ &=& \bar{\mathbb E}_\theta \left[ (Z - m(\theta)) \frac{\partial}{\partial \theta}\log \bar{p}_\theta(Z)\right] \end{array}$

Then by the Cauchy–Schwarz inequality we get

$\displaystyle \left[m'(\theta)\right]^2 \le \bar{\mathbb E}_\theta [(Z - m(\theta))^2] \cdot \bar{\mathbb E}_\theta \left[ \left( \frac{\partial}{\partial \theta} \log \bar{p}_\theta(Z) \right)^2\right] = \sigma^2_\theta(Z) \cdot \bar{J}(\theta).$

Using this with ${\theta = \theta^*}$ in (4), we obtain the desired result:

$\displaystyle \sigma^2_{\theta^*}(Z) \ge \frac{\left[m'(\theta^*)\right]^2}{J(\theta^*)}.$

If the estimator is unbiased, then ${m(\theta) = \theta}$ for all ${\theta}$, and the inequality particularizes to

$\displaystyle \sigma^2_{\theta*}(Z) \ge \frac{1}{J(\theta^*)},$

which is the usual Cramér–Rao lower bound.

Note: I have not been able to find an original reference for deriving the CRLB via data processing for divergence, but I first came across it in this paper.