Divergence in everything: Cramér-Rao from data processing
Gather round for another tale of the mighty divergence and its adventures!

(image yoinked from Sergio Verdú‘s 2007 Shannon Lecture slides)
This time I want to show how the well-known Cramér–Rao lower bound (or the information inequality) for parameter estimation can be derived from one of the most fundamental results in information theory: the data processing theorem for divergence. What is particularly nice about this derivation is that, along the way, we get to see how the Fisher information controls local curvature of the divergence between two members of a parametric family of distributions.
This will be our starting point:
Theorem 1 (Data processing theorem for divergence) Let
and
be two probability distributions for a random variable
taking values in some space, and let
be a conditional probability distribution (channel, random transformation) that takes
to some other random variable
. Then
where
and
are the marginal distributions of
induced by the channel
and input distributions
and
.
I will keep things simple. Let be a parametric family of probability distributions of a real-valued random variable. I will make several assumptions:
- The parameter space
is an open interval on the real line.
- Each
has a density
.
- The log-likelihood ratios
are sufficiently well-behaved to permit things like interchanging limits and expectations or derivatives and expectations. The precise regularity conditions are listed, e.g., in Chapter 2, Section 6 of Solomon Kullback‘s Information Theory and Statistics.
The parameter estimation problem is as follows: We observe a random sample , where the parameter
is unknown, and we wish to estimate
by means of a possibly randomized estimator
,
. We are interested in the variance of
:
Note, by the way, that if the estimator is unbiased, i.e., , then the variance of
is simply its mean squared error (MSE):
We would like to derive a lower bound on the variance of any such estimator .
Step 1: data processing for divergence. For any , the estimator
induces a distribution
:
for all Borel sets . In the following, I will assume that
is also sufficiently well-behaved, so that each
has a density
, and the log-likelihood ratios
obey the same set of regularity conditions as above.
By the data processing theorem we have
is an open interval, this inequality holds in particular for all
in an arbitrarily small neighborhood of
.
Step 2: data processing for Fisher information. Now let us look at the second-order Taylor expansions of the two divergences in (1) as gets closer to
. I will show the details only for
. To keep the notation simple, I will write
for
. Then
where the regularity conditions ensure that the error term converges to zero uniformly over a sufficiently small neighborhood of
. First of all,
. Next,
which implies that
Finally,
(by the way, the partial derivative of the log-likelihood with respect to
is known in statistics as the score function). When
, the first term is precisely the Fisher information
contained in the estimate
obtained from
, while the second term is zero. Putting all of this together, we can write
is the Fisher information contained in the sample
.
Substituting (2) and (3) and taking the limit , we obtain the data processing theorem for Fisher’s information:
Step 3: use the Schwarz! Once we have (4), the rest is standard. For any , let us define
Then
where the last step uses the fact (which we have already proved) that . Dividing and multiplying the integrand by
, we can write
Then by the Cauchy–Schwarz inequality we get
Using this with in (4), we obtain the desired result:
If the estimator is unbiased, then for all
, and the inequality particularizes to
which is the usual Cramér–Rao lower bound.
Note: I have not been able to find an original reference for deriving the CRLB via data processing for divergence, but I first came across it in this paper.

leave a comment