## Divergence in everything: erasure divergence and concentration inequalities

It’s that time again, the time to savor the dreamy delights of divergence!

(image yoinked from Sergio Verdú‘s 2007 Shannon Lecture slides)

In this post, we will look at a powerful information-theoretic method for deriving concentration-of-measure inequalities (i.e., tail bounds) for general functions of independent random variables.

The basic problem is as follows. Let be a discrete space (everything below will work for arbitrary measurable spaces, but I will spare you the unilluminating technical details). Let be independent (not necessarily identically distributed) -valued random variables. We have some function , and we would like to see how sharply the random variable concentrates around its mean . In other words, we are interested in the *tail bounds* for , i.e., bounds on the probability that , for every . The first step involves what is known as the Chernoff bounding trick. For simplicity, let us assume that is zero-mean. Then for any we have

where the last step uses Markov’s inequality. Now everything turns on the availability of good estimates for the *moment-generating function* of . When is almost surely bounded, i.e., for some , we have the classic Hoeffding bound:

Hence, for any ; optimizing over , we get and

But what if is not bounded?

It turns out that a very general technique that relies heavily on some information-theoretic results can be used to develop tail bounds in a large number of situations. This technique, known as the “entropy method,” was inspired by the work of Michel Ledoux on log-Sobolev inequalities (there are interesting connections to convex geometry and statistical mechanics as well). I have first learned about this technique a few years ago from Gábor Lugosi‘s excellent lecture notes on the concentration-of-measure inequalities. A couple of days ago I was re-reading his notes and then realized that an alternate route to the key result underlying the entropy method, the so-called *tensorization inequality*, lies through the quantity called the *erasure divergence*, which was recently introduced and extensively studied by Sergio Verdú and Tsachy Weissman.

**1. Detour into information theory: erasure divergence **

Let be a collection of discrete random variables (not necessarily independent!), each taking values in some space . For each let denote the -tuple

Now let and be two probability distributions on . The *erasure divergence* between them is defined as

where the terms on the right-hand side are the conditional divergences. If is a finite set, then we can let be the uniform distribution on ; then the erasure divergence between an arbitrary and this is equal to

where the quantity is the *erasure entropy* of . Note that for every we have

where the inequality is due to the fact that conditioning reduces entropy. Summing from to and rearranging, we get *Han’s inequality*

Observe, by the way, that Han’s inequality follows from the fact that the erasure entropy is always lower than the ordinary Shannon entropy (and they are equal for product distributions): . Indeed,

Rearranging, we get (3). Lugosi actually uses (3) to arrive at Ledoux’s tensorization inequality. However, I will use the erasure divergence.

Theorem 1 (Verdú–Weissman)The erasure divergence (2) satisfies

*Proof:* The proof is simple: for each , we can factorize as

and the same can be done for . Then

Summing this over all from to and rearranging, we get (4).

The ordinary divergence between and may be larger or smaller than the erasure divergence. However, it is always smaller when is a *product distribution*, i.e., when are independent (but not necessarily identically distributed) under . To see this, we use the fact that, when the ‘s are independent, for every . Then

**2. Tensorization inequality **

Now let us focus on the setting of interest: concentration of measure. Let be an arbitrary probability distribution on and consider a positive real-valued function , such that . Then

is a probability distribution. For brevity, we will denote by . Then we have the following:

Theorem 2Define the function by . Then

Remark 1Since is convex, each of the terms on the right-hand side of (5) is nonnegative by Jensen’s inequality.

*Proof:* Let us look at the erasure divergence between and . First of all,

Moreover, for any , we have

Therefore,

Applying Theorem 1, we obtain (5).

Now note that

(the second step follows from the fact that ), and suppose that is a *product* distribution, i.e., are independent. Then

and, applying Theorem 2, we obtain the so-called *tensorization inequality*. It is called that because of the appearance of the tensor product of measures defining . Here goes:

Theorem 3 (Tensorization inequality (Ledoux))Let be a product distribution on . Let be a positive function, and denote . Then

*Proof:* First we note that if the bound (6) holds for , then it also holds for the rescaled function , where is an arbitrary positive constant. Then we may as well assume that and apply the foregoing results.

In Lugosi’s notes, is upper-bounded by the right-hand side of (6) directly by appealing to Han’s inequality. However, what is lost is that the upper bound is exactly equal to the erasure divergence between and *regardless* of whether or not is a product distribution.

**3. Back to concentration **

Now let us return to the original problem of deriving tail bounds for . We want to get a handle on the moment-generating function of the random variable , where with a product distribution. Let us therefore apply the tensorization inequality of Theorem 3 to . Then

and

Then the tensorization inequality (6) can be rewritten as

Then all we need to do is to further bound the right-hand side of (7). Let us consider a specific example.

Theorem 4Assume that the function satisfies the following condition:

*Proof:* I will give the essentials, but skip over some of the technical details. First of all, let be an independent copy of and for each define

Then from (8) we see that

From this, it can be shown that

Let us divide both sides by to get

But the left-hand side is just the derivative of , i.e., , so we can integrate to get . Since (an application of L’Hopital’s rule), we finally obtain the bound . Using this in (1) and optimizing over , we get . The same argument shows that , so we get (9).

An astute reader has probably noted the similarity of the condition (8) with the bounded difference condition needed for McDiarmid’s inequality to hold. Note, however, that (8) is a much weaker condition than McDiarmid’s.

More examples can be found in a nice paper by Boucheron, Lugosi, and Massart.

ITA Workshop 2012 : More Talks « An Ergodic Walksaid, on February 15, 2012 at 3:40 pm[…] but I am getting lazy, and one of them I wanted to mention was Max’s, but he already blogged a lot of it. I still don’t get what the “Herbst argument” is, […]

Anand Sarwatesaid, on March 26, 2013 at 6:31 pmSo is there an issue with extending this approach to continuous variables? I was looking at concentration for Laplace variables but you’re explicitly making X discrete here.

mraginskysaid, on March 26, 2013 at 6:44 pmIf you work directly with erasure divergence, there shouldn’t be any problem (modulo usual measure-theoretic murmurings and incantations). I made X discrete to keep things simple, mostly.

Linkage | An Ergodic Walksaid, on October 17, 2013 at 1:00 pm[…] I am traveling all over India at the moment so I’m not really able to write contentful posts. Here are even more links instead, sigh. Maybe later I’ll talk about log-Sobolev inequalities so I can be cool like Max. […]