Divergence in everything: Cramér-Rao from data processing
Gather round for another tale of the mighty divergence and its adventures!
This time I want to show how the well-known Cramér–Rao lower bound (or the information inequality) for parameter estimation can be derived from one of the most fundamental results in information theory: the data processing theorem for divergence. What is particularly nice about this derivation is that, along the way, we get to see how the Fisher information controls local curvature of the divergence between two members of a parametric family of distributions.
This will be our starting point:
Theorem 1 (Data processing theorem for divergence) Let and be two probability distributions for a random variable taking values in some space, and let be a conditional probability distribution (channel, random transformation) that takes to some other random variable . Then
where and are the marginal distributions of induced by the channel and input distributions and .
I will keep things simple. Let be a parametric family of probability distributions of a real-valued random variable. I will make several assumptions:
- The parameter space is an open interval on the real line.
- Each has a density .
- The log-likelihood ratios
are sufficiently well-behaved to permit things like interchanging limits and expectations or derivatives and expectations. The precise regularity conditions are listed, e.g., in Chapter 2, Section 6 of Solomon Kullback‘s Information Theory and Statistics.
The parameter estimation problem is as follows: We observe a random sample , where the parameter is unknown, and we wish to estimate by means of a possibly randomized estimator , . We are interested in the variance of :
Note, by the way, that if the estimator is unbiased, i.e., , then the variance of is simply its mean squared error (MSE):
We would like to derive a lower bound on the variance of any such estimator .
Step 1: data processing for divergence. For any , the estimator induces a distribution :
for all Borel sets . In the following, I will assume that is also sufficiently well-behaved, so that each has a density , and the log-likelihood ratios obey the same set of regularity conditions as above.
Since is an open interval, this inequality holds in particular for all in an arbitrarily small neighborhood of .
Step 2: data processing for Fisher information. Now let us look at the second-order Taylor expansions of the two divergences in (1) as gets closer to . I will show the details only for . To keep the notation simple, I will write for . Then
where the regularity conditions ensure that the error term converges to zero uniformly over a sufficiently small neighborhood of . First of all, . Next,
which implies that
(by the way, the partial derivative of the log-likelihood with respect to is known in statistics as the score function). When , the first term is precisely the Fisher information contained in the estimate obtained from , while the second term is zero. Putting all of this together, we can write
where is the Fisher information contained in the sample .
where the last step uses the fact (which we have already proved) that . Dividing and multiplying the integrand by , we can write
Then by the Cauchy–Schwarz inequality we get
Using this with in (4), we obtain the desired result:
If the estimator is unbiased, then for all , and the inequality particularizes to
which is the usual Cramér–Rao lower bound.
Note: I have not been able to find an original reference for deriving the CRLB via data processing for divergence, but I first came across it in this paper.