Divergence in everything: erasure divergence and concentration inequalities
It’s that time again, the time to savor the dreamy delights of divergence!
In this post, we will look at a powerful information-theoretic method for deriving concentration-of-measure inequalities (i.e., tail bounds) for general functions of independent random variables.
The basic problem is as follows. Let be a discrete space (everything below will work for arbitrary measurable spaces, but I will spare you the unilluminating technical details). Let be independent (not necessarily identically distributed) -valued random variables. We have some function , and we would like to see how sharply the random variable concentrates around its mean . In other words, we are interested in the tail bounds for , i.e., bounds on the probability that , for every . The first step involves what is known as the Chernoff bounding trick. For simplicity, let us assume that is zero-mean. Then for any we have
where the last step uses Markov’s inequality. Now everything turns on the availability of good estimates for the moment-generating function of . When is almost surely bounded, i.e., for some , we have the classic Hoeffding bound:
Hence, for any ; optimizing over , we get and
But what if is not bounded?
It turns out that a very general technique that relies heavily on some information-theoretic results can be used to develop tail bounds in a large number of situations. This technique, known as the “entropy method,” was inspired by the work of Michel Ledoux on log-Sobolev inequalities (there are interesting connections to convex geometry and statistical mechanics as well). I have first learned about this technique a few years ago from Gábor Lugosi‘s excellent lecture notes on the concentration-of-measure inequalities. A couple of days ago I was re-reading his notes and then realized that an alternate route to the key result underlying the entropy method, the so-called tensorization inequality, lies through the quantity called the erasure divergence, which was recently introduced and extensively studied by Sergio Verdú and Tsachy Weissman.
1. Detour into information theory: erasure divergence
Let be a collection of discrete random variables (not necessarily independent!), each taking values in some space . For each let denote the -tuple
where the terms on the right-hand side are the conditional divergences. If is a finite set, then we can let be the uniform distribution on ; then the erasure divergence between an arbitrary and this is equal to
where the quantity is the erasure entropy of . Note that for every we have
Observe, by the way, that Han’s inequality follows from the fact that the erasure entropy is always lower than the ordinary Shannon entropy (and they are equal for product distributions): . Indeed,
Theorem 1 (Verdú–Weissman) The erasure divergence (2) satisfies
Proof: The proof is simple: for each , we can factorize as
and the same can be done for . Then
Summing this over all from to and rearranging, we get (4).
The ordinary divergence between and may be larger or smaller than the erasure divergence. However, it is always smaller when is a product distribution, i.e., when are independent (but not necessarily identically distributed) under . To see this, we use the fact that, when the ‘s are independent, for every . Then
2. Tensorization inequality
Now let us focus on the setting of interest: concentration of measure. Let be an arbitrary probability distribution on and consider a positive real-valued function , such that . Then
is a probability distribution. For brevity, we will denote by . Then we have the following:
Remark 1 Since is convex, each of the terms on the right-hand side of (5) is nonnegative by Jensen’s inequality.
Proof: Let us look at the erasure divergence between and . First of all,
Moreover, for any , we have
Now note that
(the second step follows from the fact that ), and suppose that is a product distribution, i.e., are independent. Then
and, applying Theorem 2, we obtain the so-called tensorization inequality. It is called that because of the appearance of the tensor product of measures defining . Here goes:
Proof: First we note that if the bound (6) holds for , then it also holds for the rescaled function , where is an arbitrary positive constant. Then we may as well assume that and apply the foregoing results.
In Lugosi’s notes, is upper-bounded by the right-hand side of (6) directly by appealing to Han’s inequality. However, what is lost is that the upper bound is exactly equal to the erasure divergence between and regardless of whether or not is a product distribution.
3. Back to concentration
Now let us return to the original problem of deriving tail bounds for . We want to get a handle on the moment-generating function of the random variable , where with a product distribution. Let us therefore apply the tensorization inequality of Theorem 3 to . Then
Then the tensorization inequality (6) can be rewritten as
Then all we need to do is to further bound the right-hand side of (7). Let us consider a specific example.
Proof: I will give the essentials, but skip over some of the technical details. First of all, let be an independent copy of and for each define
Then from (8) we see that
From this, it can be shown that
Let us divide both sides by to get
But the left-hand side is just the derivative of , i.e., , so we can integrate to get . Since (an application of L’Hopital’s rule), we finally obtain the bound . Using this in (1) and optimizing over , we get . The same argument shows that , so we get (9).
An astute reader has probably noted the similarity of the condition (8) with the bounded difference condition needed for McDiarmid’s inequality to hold. Note, however, that (8) is a much weaker condition than McDiarmid’s.
More examples can be found in a nice paper by Boucheron, Lugosi, and Massart.