The Information Structuralist

Divergence in everything: erasure divergence and concentration inequalities

Posted in Information Theory, Probability, Statistical Learning and Inference by mraginsky on March 18, 2011

It’s that time again, the time to savor the dreamy delights of divergence!

(image yoinked from Sergio Verdú‘s 2007 Shannon Lecture slides)

In this post, we will look at a powerful information-theoretic method for deriving concentration-of-measure inequalities (i.e., tail bounds) for general functions of independent random variables.

The basic problem is as follows. Let {{\mathsf X}} be a discrete space (everything below will work for arbitrary measurable spaces, but I will spare you the unilluminating technical details). Let {X_1,\ldots,X_n} be independent (not necessarily identically distributed) {{\mathsf X}}-valued random variables. We have some function {g : {\mathsf X}^n \rightarrow {\mathbb R}}, and we would like to see how sharply the random variable {Z = g(X_1,\ldots,X_n)} concentrates around its mean {\mathop{\mathbb E} Z}. In other words, we are interested in the tail bounds for {Z}, i.e., bounds on the probability that {|Z - \mathop{\mathbb E} Z| \ge t}, for every {t > 0}. The first step involves what is known as the Chernoff bounding trick. For simplicity, let us assume that {Z} is zero-mean. Then for any {t,\lambda > 0} we have

\displaystyle  \Pr (Z \ge t) = \Pr (e^{\lambda Z} \ge e^{\lambda t}) \le e^{-\lambda t} \mathop{\mathbb E}[e^{\lambda Z}], \ \ \ \ \ (1)

where the last step uses Markov’s inequality. Now everything turns on the availability of good estimates for the moment-generating function {\Lambda(\lambda) = \mathop{\mathbb E}[e^{\lambda Z}]} of {Z}. When {Z} is almost surely bounded, i.e., {\Pr(|Z| \le c) = 1} for some {c > 0}, we have the classic Hoeffding bound:

\displaystyle  \Lambda(\lambda) \le e^{\lambda^2 c^2/2}.

Hence, {\Pr(Z \ge t) \le e^{-\lambda t + \lambda^2 c^2/2}} for any {\lambda > 0}; optimizing over {\lambda}, we get {\Pr(Z \ge t) \le e^{-c^2t^2/2}} and

\displaystyle  \Pr\left( |Z| \ge t\right) \le 2e^{-c^2t^2/2}.

But what if {Z} is not bounded?

It turns out that a very general technique that relies heavily on some information-theoretic results can be used to develop tail bounds in a large number of situations. This technique, known as the “entropy method,” was inspired by the work of Michel Ledoux on log-Sobolev inequalities (there are interesting connections to convex geometry and statistical mechanics as well). I have first learned about this technique a few years ago from Gábor Lugosi‘s excellent lecture notes on the concentration-of-measure inequalities. A couple of days ago I was re-reading his notes and then realized that an alternate route to the key result underlying the entropy method, the so-called tensorization inequality, lies through the quantity called the erasure divergence, which was recently introduced and extensively studied by Sergio Verdú and Tsachy Weissman.

1. Detour into information theory: erasure divergence

Let {X_1,\ldots,X_n} be a collection of discrete random variables (not necessarily independent!), each taking values in some space {{\mathsf X}}. For each {i \in \{1,\ldots,n\}} let {X_{\backslash i}} denote the {(n-1)}-tuple

\displaystyle  (X_1,\ldots,X_{i-1},X_{i+1},\ldots,X_n).

Now let {P_{X^n}} and {Q_{X^n}} be two probability distributions on {{\mathsf X}^n}. The erasure divergence between them is defined as

\displaystyle  	D^-(P_{X^n} \| Q_{X^n}) = \sum^n_{i=1} D(P_{X_i|X_{\backslash i}} \| Q_{X_i|X_{\backslash i}} | P_{X_{\backslash i}}), \ \ \ \ \ (2)

where the terms on the right-hand side are the conditional divergences. If {{\mathsf X}} is a finite set, then we can let {Q_{X^n}} be the uniform distribution on {{\mathsf X}^n}; then the erasure divergence between an arbitrary {P_{X^n}} and this {Q_{X^n}} is equal to

\displaystyle  n \log |{\mathsf X}| - H^-(X^n) \equiv n \log |{\mathsf X}| - \sum^n_{i=1} H(X_i|X_{\backslash i}),

where the quantity {H^-(X^n)} is the erasure entropy of {P_{X^n}}. Note that for every {i} we have

\displaystyle  \begin{array}{rcl}  H(X^n) &=& H(X_i,X_{\backslash i}) \\ &=& H(X_{\backslash i}) + H(X_i|X_{\backslash i}) \\ &=& H(X_{\backslash i}) + H(X_i|X^{i-1},X^n_{i+1}) \\ &\le& H(X_{\backslash i}) + H(X_i|X^{i-1}), \end{array}

where the inequality is due to the fact that conditioning reduces entropy. Summing from {i=1} to {i=n} and rearranging, we get Han’s inequality

\displaystyle  H(X^n) \le \frac{1}{n-1}\sum^n_{i=1}H(X_{\backslash i}). \ \ \ \ \ (3)

Observe, by the way, that Han’s inequality follows from the fact that the erasure entropy is always lower than the ordinary Shannon entropy (and they are equal for product distributions): {H^-(X^n) \le H(X^n)}. Indeed,

\displaystyle  H^-(X^n) = \sum^n_{i=1} \left(H(X^n) - H(X_{\backslash i})\right) \le H(X^n).

Rearranging, we get (3). Lugosi actually uses (3) to arrive at Ledoux’s tensorization inequality. However, I will use the erasure divergence.

Theorem 1 (Verdú–Weissman) The erasure divergence (2) satisfies

\displaystyle  		D^-(P_{X^n} \| Q_{X^n}) = \sum^n_{i=1} \left( D(P_{X^n} \| Q_{X^n}) - D(P_{X_{\backslash i}} \| Q_{X_{\backslash i}}) \right). 	\ \ \ \ \ (4)

Proof: The proof is simple: for each {i}, we can factorize {P_{X^n}} as

\displaystyle  	P_{X^n} = P_{X_{\backslash i}} P_{X_i|X_{\backslash i}},

and the same can be done for {Q_{X^n}}. Then

\displaystyle  \begin{array}{rcl}  	D(P_{X^n} \| Q_{X^n}) &=& \mathop{\mathbb E} \left[ \log \frac{P_{X_{\backslash i}}P_{X_i|X_{\backslash i}}}{Q_{X_{\backslash i}} Q_{X_i|X_{{\backslash i}}}} \right] \\ 	&=& \mathop{\mathbb E}\left[ \log \frac{P_{X_{\backslash i}}}{Q_{X_{\backslash i}}}\right] + \mathop{\mathbb E}\left[ \log \frac{P_{X_i|X_{\backslash i}}}{Q_{X_i|X_{\backslash i}}}\right] \\ 	&=& D(P_{X_{\backslash i}} \| Q_{X_{\backslash i}}) + D(P_{X_i|X_{{\backslash i}}} \| Q_{X_i|X_{\backslash i}} | P_{X_{\backslash i}}). \end{array}

Summing this over all {i} from {1} to {n} and rearranging, we get (4). \Box

The ordinary divergence {D(P_{X^n} \| Q_{X^n})} between {P_{X^n}} and {Q_{X^n}} may be larger or smaller than the erasure divergence. However, it is always smaller when {Q_{X^n}} is a product distribution, i.e., when {X_1,\ldots,X_n} are independent (but not necessarily identically distributed) under {Q_{X^n}}. To see this, we use the fact that, when the {X_i}‘s are independent, {Q_{X_i|X^{i-1}} = Q_{X_i|X_{\backslash i}} = Q_{X_i}} for every {i}. Then

\displaystyle  \begin{array}{rcl}  	D(P_{X^n} \| Q_{X^n}) - D^-(P_{X^n} \| Q_{X^n}) &=& \sum^n_{i=1}\mathop{\mathbb E}_P \left[ \log\frac{P_{X_i|X^{i-1}}}{Q_{X_i}} - \log \frac{P_{X_i|X_{\backslash i}}}{Q_{X_i}}\right] \\ 	&=& \sum^n_{i=1} \mathop{\mathbb E}_P \left[ \log \frac{P_{X_i|X^{i-1}}}{P_{X_i|X_{\backslash i}}}\right] \\ 	&=& -\sum^n_{i=1} D\big( P_{X_i|X_{\backslash i}} \big\| P_{X_i|X^{i-1}} \big | P_{X_{\backslash i}}\big) \\ 	&\le& 0. \end{array}

2. Tensorization inequality

Now let us focus on the setting of interest: concentration of measure. Let {Q_{X^n}} be an arbitrary probability distribution on {{\mathsf X}^n} and consider a positive real-valued function {h : {\mathsf X}^n \rightarrow {\mathbb R}_+}, such that {\mathop{\mathbb E}_Q[h(X^n)] = 1}. Then

\displaystyle  P_{X^n}(x^n) = h(x^n)Q_{X^n}(x^n)

is a probability distribution. For brevity, we will denote {h(X^n)} by {Y}. Then we have the following:

Theorem 2 Define the function {\phi : R_+ \rightarrow {\mathbb R}} by {\phi(u) = u \log u}. Then

\displaystyle  	D^-(P_{X^n} \| Q_{X^n}) = \sum^n_{i=1} \mathop{\mathbb E}\left[ \mathop{\mathbb E}[\phi(Y)|X_{\backslash i}] - \phi(\mathop{\mathbb E}[Y|X_{\backslash i}])\right] \ \ \ \ \ (5)

Remark 1 Since {\phi} is convex, each of the terms on the right-hand side of (5) is nonnegative by Jensen’s inequality.

Proof: Let us look at the erasure divergence between {P_{X^n}} and {Q_{X^n}}. First of all,

\displaystyle  \begin{array}{rcl}  	D(P_{X^n} \| Q_{X^n}) &=& \sum_{x^n} h(x^n)Q_{X^n}(x^n) \log h(x^n) \\ 	&=& \mathop{\mathbb E}[Y\log Y]. \end{array}

Moreover, for any {i \in \{1,\ldots,n\}}, we have

\displaystyle  \begin{array}{rcl}  D(P_{X_{\backslash i}} \| Q_{X_{\backslash i}}) &=& \sum_{x^n} Q_{X^n}(x^n)h(x^n) \log \frac{\sum_{x_i} P_{X^n}(x^n)}{\sum_{x_i}Q_{X^n}(x^n)} \\ 	&=& \sum_{x^n} Q_{X^n}(x^n)h(x^n) \log \frac{\sum_{x_i}h(x^n)Q_{X_i|X_{\backslash i}}(x_i|x_{{\backslash i}})Q_{X_{\backslash i}}(x_{\backslash i})}{\sum_{x_i}Q_{X_i|X_{\backslash i}}(x_i|x_{{\backslash i}})Q_{X_{\backslash i}}(x_{\backslash i})} \\ 	&=& \sum_{x^n} Q_{X^n}(x^n)h(x^n) \log \left(\sum_{x_i}h(x^n)Q_{X_i|X_{\backslash i}}(x_i|x_{\backslash i})\right) \\ 	&=& \mathop{\mathbb E}\big[Y \log \mathop{\mathbb E}[Y|X_{\backslash i}] \big] \\ 	&=& \mathop{\mathbb E}\big[\mathop{\mathbb E}[Y|X_{\backslash i}] \log \mathop{\mathbb E}[Y|X_{\backslash i}] \big]. \end{array}

Therefore,

\displaystyle  D(P_{X^n} \| Q_{X^n}) = \mathop{\mathbb E}[\phi(Y)] \qquad \text{and} \qquad D(P_{X_{\backslash i}} \| Q_{X_{\backslash i}}) = \mathop{\mathbb E}[\phi(\mathop{\mathbb E}[Y|X_{\backslash i}])].

Applying Theorem 1, we obtain (5). \Box

Now note that

\displaystyle  D(P_{X^n} \| Q_{X^n}) = \mathop{\mathbb E}[\phi(Y)] \equiv \mathop{\mathbb E}[\phi(Y)] - \phi(\mathop{\mathbb E}[Y])

(the second step follows from the fact that {\phi(\mathop{\mathbb E}[Y]) = \phi(1) = 0}), and suppose that {Q} is a product distribution, i.e., {X_1,\ldots,X_n} are independent. Then

\displaystyle  D(P_{X^n} \| Q_{X^n}) \le D^-(P_{X^n} \| Q_{X^n}),

and, applying Theorem 2, we obtain the so-called tensorization inequality. It is called that because of the appearance of the tensor product of measures defining {Q}. Here goes:

Theorem 3 (Tensorization inequality (Ledoux)) Let {Q_{X^n}} be a product distribution on {{\mathsf X}^n}. Let {h : {\mathsf X}^n \rightarrow {\mathbb R}_+} be a positive function, and denote {Y = h(X^n)}. Then

\displaystyle  		\mathop{\mathbb E}[\phi(Y)] - \phi(\mathop{\mathbb E}[Y]) \le \sum^n_{i=1} \mathop{\mathbb E}\left[ \mathop{\mathbb E}[\phi(Y)|X_{\backslash i}] - \phi(\mathop{\mathbb E}[Y|X_{\backslash i}])\right]. 	\ \ \ \ \ (6)

Proof: First we note that if the bound (6) holds for {h}, then it also holds for the rescaled function {c \cdot h}, where {c} is an arbitrary positive constant. Then we may as well assume that {\mathop{\mathbb E}[Y] = \mathop{\mathbb E}[h(X^n)] = 1} and apply the foregoing results. \Box

In Lugosi’s notes, {D(P_{X^n} \| Q_{X^n})} is upper-bounded by the right-hand side of (6) directly by appealing to Han’s inequality. However, what is lost is that the upper bound is exactly equal to the erasure divergence between {P_{X^n}} and {Q_{X^n}} regardless of whether or not {Q_{X^n}} is a product distribution.

3. Back to concentration

Now let us return to the original problem of deriving tail bounds for {Z = g(X^n)}. We want to get a handle on the moment-generating function {\Lambda(\lambda) = \mathop{\mathbb E}[e^{\lambda Z}]} of the random variable {Z = g(X_1,\ldots,X_n)}, where {X^n \sim Q_{X^n}} with {Q} a product distribution. Let us therefore apply the tensorization inequality of Theorem 3 to {Y = e^{\lambda Z}}. Then

\displaystyle  \mathop{\mathbb E}[\phi(Y)] = \mathop{\mathbb E}[Y \log Y] = \lambda\mathop{\mathbb E}[Z e^{\lambda Z}] = \lambda \Lambda'(\lambda)

and

\displaystyle  \phi(\mathop{\mathbb E}[Y]) = \mathop{\mathbb E}[Y]\log \mathop{\mathbb E}[Y] = \Lambda(\lambda) \log \Lambda(\lambda).

Then the tensorization inequality (6) can be rewritten as

\displaystyle  	\lambda \Lambda'(\lambda) - \Lambda(\lambda) \log \Lambda(\lambda) \le \sum^n_{i=1} \mathop{\mathbb E}\left[ \mathop{\mathbb E}[\phi(Y)|X_{\backslash i}] - \phi(\mathop{\mathbb E}[Y|X_{\backslash i}])\right]. \ \ \ \ \ (7)

Then all we need to do is to further bound the right-hand side of (7). Let us consider a specific example.

Theorem 4 Assume that the function {g} satisfies the following condition:

\displaystyle  		\sup_{x^n,(x')^n \in {\mathsf X}^n} \sum^n_{i=1} \left|g(x^n) - g(x^{i-1},x'_i,x^n_{i+1})\right|^2 \le C 	\ \ \ \ \ (8)

for some {C > 0}. Then

\displaystyle  		\Pr\left( |Z| \ge t\right) \le 2e^{-C t^2/4} 	\ \ \ \ \ (9)

Proof: I will give the essentials, but skip over some of the technical details. First of all, let {X'_1,\ldots,X'_n} be an independent copy of {X_1,\ldots,X_n} and for each {i} define

\displaystyle  Z_i = g(X^{i-1},X'_i,X^n_{i+1}).

Then from (8) we see that

\displaystyle  \sum^n_{i=1}(Z - Z_i)^2 \le C.

From this, it can be shown that

\displaystyle  \lambda \Lambda'(\lambda) - \Lambda(\lambda) \log \Lambda(\lambda) \le \mathop{\mathbb E} \left[ e^{\lambda Z} \sum^n_{i=1} \lambda^2 (Z - Z_i)^2 \right] \le \lambda^2 C \Lambda(\lambda).

Let us divide both sides by {\lambda^2 \Lambda(\lambda)} to get

\displaystyle  \frac{\Lambda'(\lambda)}{\lambda \Lambda(\lambda)} - \frac{\log \Lambda(\lambda)}{\lambda^2} \le C.

But the left-hand side is just the derivative of {H(\lambda) = (1/\lambda) \log \Lambda(\lambda)}, i.e., {\frac{d}{d\lambda} H(\lambda) \le C}, so we can integrate to get {H(\lambda) - H(0) \le C\lambda}. Since {H(0) = \lim_{\lambda \rightarrow 0} \frac{\log \Lambda(\lambda)}{\lambda} = \mathop{\mathbb E} Z = 0} (an application of L’Hopital’s rule), we finally obtain the bound {\Lambda(\lambda) \le e^{\lambda^2 C}}. Using this in (1) and optimizing over {\lambda}, we get {\Pr(Z \ge t) \le e^{-Ct^2/4}}. The same argument shows that {\Pr(Z \le - t) \le e^{-Ct^2/4}}, so we get (9). \Box

An astute reader has probably noted the similarity of the condition (8) with the bounded difference condition needed for McDiarmid’s inequality to hold. Note, however, that (8) is a much weaker condition than McDiarmid’s.

More examples can be found in a nice paper by Boucheron, Lugosi, and Massart.

Advertisements

4 Responses

Subscribe to comments with RSS.

  1. […] but I am getting lazy, and one of them I wanted to mention was Max’s, but he already blogged a lot of it. I still don’t get what the “Herbst argument” is, […]

  2. Anand Sarwate said, on March 26, 2013 at 6:31 pm

    So is there an issue with extending this approach to continuous variables? I was looking at concentration for Laplace variables but you’re explicitly making X discrete here.

    • mraginsky said, on March 26, 2013 at 6:44 pm

      If you work directly with erasure divergence, there shouldn’t be any problem (modulo usual measure-theoretic murmurings and incantations). I made X discrete to keep things simple, mostly.

  3. Linkage | An Ergodic Walk said, on October 17, 2013 at 1:00 pm

    […] I am traveling all over India at the moment so I’m not really able to write contentful posts. Here are even more links instead, sigh. Maybe later I’ll talk about log-Sobolev inequalities so I can be cool like Max. […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: