## Divergence in everything: Cramér-Rao from data processing

Gather round for another tale of the mighty divergence and its adventures!

(image yoinked from Sergio Verdú‘s 2007 Shannon Lecture slides)

This time I want to show how the well-known Cramér–Rao lower bound (or the information inequality) for parameter estimation can be derived from one of the most fundamental results in information theory: the *data processing theorem for divergence*. What is particularly nice about this derivation is that, along the way, we get to see how the Fisher information controls local curvature of the divergence between two members of a parametric family of distributions.

This will be our starting point:

Theorem 1 (Data processing theorem for divergence)Let and be two probability distributions for a random variable taking values in some space, and let be a conditional probability distribution (channel, random transformation) that takes to some other random variable . Then

where and are the marginal distributions of induced by the channel and input distributions and .

I will keep things simple. Let be a parametric family of probability distributions of a real-valued random variable. I will make several assumptions:

- The
*parameter space*is an open interval on the real line. - Each has a density .
- The log-likelihood ratios
are sufficiently well-behaved to permit things like interchanging limits and expectations or derivatives and expectations. The precise regularity conditions are listed, e.g., in Chapter 2, Section 6 of Solomon Kullback‘s

*Information Theory and Statistics*.

The parameter estimation problem is as follows: We observe a random sample , where the parameter is unknown, and we wish to estimate by means of a possibly randomized estimator , . We are interested in the *variance* of :

Note, by the way, that if the estimator is unbiased, i.e., , then the variance of is simply its mean squared error (MSE):

We would like to derive a lower bound on the variance of any such estimator .

**Step 1: data processing for divergence.** For any , the estimator induces a distribution :

for all Borel sets . In the following, I will assume that is also sufficiently well-behaved, so that each has a density , and the log-likelihood ratios obey the same set of regularity conditions as above.

By the data processing theorem we have

Since is an open interval, this inequality holds in particular for all in an arbitrarily small neighborhood of .

**Step 2: data processing for Fisher information.** Now let us look at the second-order Taylor expansions of the two divergences in (1) as gets closer to . I will show the details only for . To keep the notation simple, I will write for . Then

where the regularity conditions ensure that the error term converges to zero uniformly over a sufficiently small neighborhood of . First of all, . Next,

which implies that

Finally,

(by the way, the partial derivative of the log-likelihood with respect to is known in statistics as the *score function*). When , the first term is precisely the *Fisher information* contained in the estimate obtained from , while the second term is zero. Putting all of this together, we can write

where is the Fisher information contained in the sample .

Substituting (2) and (3) and taking the limit , we obtain the *data processing theorem for Fisher’s information*:

**Step 3: use the Schwarz!** Once we have (4), the rest is standard. For any , let us define

Then

where the last step uses the fact (which we have already proved) that . Dividing and multiplying the integrand by , we can write

Then by the Cauchy–Schwarz inequality we get

Using this with in (4), we obtain the desired result:

If the estimator is unbiased, then for all , and the inequality particularizes to

which is the usual Cramér–Rao lower bound.

**Note:** I have not been able to find an original reference for deriving the CRLB via data processing for divergence, but I first came across it in this paper.

leave a comment