ISIT 2013: two plenaries on concentration of measure

Of the five plenary talks at this year’s ISIT, two were about concentration of measure: Katalin Marton’s Shannon lecture on “Distance-divergence inequalities” and Gabor Lugosi’s talk on “Concentration inequalities and the entropy method” the next morning. Since the topic of measure concentration is dear to my heart, I thought I would write down a few unifying themes.

When we speak of concentration of measure, we mean the following. Let {X} be a random variable taking values in some metric space {{\mathsf X}}. Then we say that {X} (or, more correctly, the distribution of {X}) has the concentration property if, for any set {A \subset {\mathsf X}} such that {{\mathbb P}(X \in A) \ge 1/2}, we have

\displaystyle  {\mathbb P}\left( d(X,A) \le r\right) \xrightarrow{r \rightarrow \infty} 1. \ \ \ \ \ (1)

Here, {d(x,A)} is the distance from the point {x \in {\mathsf X}} to the set {A}:

\displaystyle  d(x,A) := \inf_{y \in A} d(x,y).

Another way to express (1) is as follows: for any set {A \subset {\mathsf X}} and any {r \ge 0}, define the {r}-blowup of {A} by

\displaystyle  A_r := \left\{ y \in {\mathsf X}: d(y,A) \le r \right\} \equiv \left\{ y \in {\mathsf X}: \exists x \in A \text{ such that } d(x,y) \le r \right\}.

Then {X} has the concentration property if

\displaystyle  {\mathbb P}(X \in A) \ge 1/2 \quad \Longrightarrow \quad \lim_{r \rightarrow \infty} {\mathbb P}(X \in A_r) = 1.

In other words, {X} has the concentration property if any set containing {X} with not too small a probability can be “blown up” to contain {X} with near-certainty.

Here are two classic examples of concentration:

  1. Gaussian distribution in Euclidean space. Let {{\mathsf X} = {\mathbb R}^n} and take {d(x,y) = \| x - y \|_2} — the usual Euclidean distance. Let {X} be a standard {n}-dimensional Gaussian random variable, i.e., {X \sim N(0,I_n)}, where {I_n} is the {n\times n} identity matrix. Then for any {r \ge 0} we have

    \displaystyle  	{\mathbb P}(X \in A) \ge 1/2 \quad \Longrightarrow \quad {\mathbb P}(X \in A_r) \ge 1 - e^{-r^2/2}.

  2. Uniform distribution in Hamming space. Let {{\mathsf X}} be the Hamming cube {\{0,1\}^n} equipped with the normalized Hamming distance

    \displaystyle  	d(x,y) = \frac{1}{n}\sum^n_{i=1}1_{\{x_i \neq y_i\}}

    that counts the fraction of bits in which {x = (x_1,\ldots,x_n)} and {y = (y_1,\ldots,y_n)} disagree. Let {X} have the uniform distribution on {\{0,1\}^n}, i.e., {{\mathbb P}(X=x) = 2^{-n}} for all {x}. Then

    \displaystyle  	{\mathbb P}(X \in A) \ge 1/2 \quad \Longrightarrow \quad {\mathbb P}(X \in A_r) \ge 1 - e^{-2nr^2}.

These two examples suggest that we should aim for “hard” statements in the form of sharp bounds on the concentration function

\displaystyle  \alpha_X(r) := \sup_{A:\, {\mathbb P}(X \in A) \ge 1/2} {\mathbb P}(X \not\in A_r)

as opposed to merely “soft” statements of the form {\alpha_X(r) \rightarrow 0} as {r \rightarrow \infty}. The $64,000 question is: how do we get such bounds?

There are two ways to accomplish this goal, and the main idea underlying these two ways is to replace sets with some other objects that are hopefully easier to handle. The first way is to replace sets by probability measures, the second is to replace them by functions. Here is what I mean:

Fix a set {A \subset {\mathsf X}} with {\Pr(X \in A) > 0}. Let {P} denote the distribution of {X}, and let {P_A} denote the conditional distribution of {X} given {X \in A}. That is, for any (measurable) set {B \subset {\mathsf X}} we have

\displaystyle  P_A(B) := \frac{P(A \cap B)}{P(A)}.

I am using the subscript notation {P_A} instead of the more usual {P(\cdot|A)} to indicate the fact that {P_A} is a probability measure in its own right. In this way, we can associate to each non-null set {A} a probability measure {P_A}. Now, here is a very simple observation that turns out to be very consequential:

\displaystyle  	D(P_A \| P) = \log \frac{1}{P(A)}. \ \ \ \ \ (2)

This is very easy to prove: for any set {B} we have

\displaystyle  	P_A(B) = \frac{1}{P(A)}\int_B 1_A(x) P(dx),

so {P_A} is absolutely continuous with respect to {P} with the Radon-Nikodym derivative {dP_A/dP = 1_A/P(A)}. Therefore, by definition of the divergence,

\displaystyle  D(P_A \| P) = \int dP_A \log \frac{dP_A}{dP} = \frac{1}{P(A)}\int_A dP \log\frac{1}{P(A)} = \log \frac{1}{P(A)}.

So if we are interested in bounding the probabilities of various sets {A}, we may hope to get some mileage out of the relationship (2).

On the other hand, we may also associate to a set {A} with {P(A) > 0} the function

\displaystyle  f_A(x) := d(x,A) \equiv \inf_{y \in A} d(x,y).

This function is Lipschitz: for any {x,x' \in {\mathsf X}},

\displaystyle  f_A(x) - f_A(x') = \inf_{y \in A}d(x,y) - \inf_{y \in A} d(x',y) \le \sup_{y \in A} \left[d(x,y) - d(x',y)\right] \le d(x,x'),

where the last step is by the triangle inequality. Interchanging the roles of {x} and {x'}, we get the Lipschitz property. Moreover, let us consider the random variable {Z = f_A(X)}, where {X} is our original {{\mathsf X}}-valued random variable. Then we immediately notice two things:

  1. For any {r \ge 0}, {{\mathbb P}(Z \le r) = {\mathbb P}\left(d(X,A) \le r\right) = {\mathbb P}(A_r)}.
  2. If {P(A) = {\mathbb P}(X \in A) \ge 1/2}, then {0} is a median of {Z}, in the sense that

    \displaystyle  	{\mathbb P}(Z \le 0) = P(A) \ge 1/2 \qquad \text{and} \qquad {\mathbb P}(Z > 0) \ge 1/2

    (the second inequality is obviously true since {Z} is nonnegative with probability {1}).

These two observations suggest that we may obtain concentration bounds by bounding the probability that a given Lipschitz function of {X} deviates from its median by more than {r}. In fact, it is easy to derive an alternative expression for the concentration function {\alpha_X}:

\displaystyle  	\alpha_X(r) = \sup_{{\rm 1-Lipschitz\,} f} {\mathbb P}\Big( f(X) > m_f + r\Big), \ \ \ \ \ (3)

where {m_f} denotes any median of {f(X)}. We already showed, by passing from {A} to {f_A = d(\cdot,A)}, that {\alpha_X} is bounded from above by the quantity on the right-hand side of (3):

\displaystyle  \alpha_X(r) = \sup_{A :\, P(A) \ge 1/2} {\mathbb P} \Big( f_A(X) > \underbrace{m_{f_A}}_{=0} + r \Big ) \le \sup_{{\rm 1-Lipschitz\,}f} {\mathbb P}\Big( f(X) > m_f + r\Big)

To prove the reverse inequality, fix any {1}-Lipschitz function {f} and consider the set {A_f : = \left\{ x \in {\mathsf X} : f(x) \le m_f \right\}}, where {m_f} is any median of {f}. Then, by definition,

\displaystyle  {\mathbb P}(X \in A_f) = {\mathbb P}\Big( f(X) \le m_f \Big) \ge 1/2.

Moreover, if we consider the {r}-blowup

\displaystyle  [A_f]_r = \Big\{ x \in {\mathsf X}: d(x,A_f) \le r \Big\},

then for any {x \in {\mathsf X}} and any {y \in A_f} we must have

\displaystyle  f(x) - m_f \le f(x) - f(y) \le d(x,y),

where the last step is by the Lipschitz property of {f}. Consequently, by definition of the concentration function,

\displaystyle  {\mathbb P}\Big( f(X) > m_f + r \Big) \le {\mathbb P}\Big( d(X,A_f) > r \Big) = 1 - P\left([A_f]_r\right) \le \alpha_X(r).

By passing to the functional viewpoint, we obtain another equivalent characterization of the concentration property: a random variable {X} taking values in a metric space {({\mathsf X},d)} has the concentration property if real-valued Lipschitz functions {X} are “nearly constant.”

Let’s look at the first, probabilistic viewpoint, which was born out of a 1996 breakthrough paper by Marton. Given a metric space {({\mathsf X},d)}, let us define the {L_1} Wasserstein distance (or transportation distance) between any two probability measures {P} and {Q} on it:

\displaystyle  W_1(P,Q) := \inf_{X \sim P, Y \sim Q} {\mathbb E}[d(X,Y)],

where the infimum is over all jointly distributed random variables {X,Y \in {\mathsf X}}, such that {P_X = P} and {P_Y = Q}. Now consider a random variable {X \in {\mathsf X}}, for which we wish to establish concentration. What Marton showed is the following: Suppose the distribution {P} of {X} satisfies the {L_1} transportation inequality

\displaystyle  W_1(P,Q) \le \sqrt{2c\, D(Q \| P)} \ \ \ \ \ (4)

for some constant {c > 0}. Then {X} has the concentration property, and moreover

\displaystyle  P(A) \ge 1/2 \quad \Longrightarrow \quad P(A_r) \ge 1 - \exp\left(- \frac{1}{2c}\left( r - \sqrt{2c \log 2}\right)^2 \right), \forall r > \sqrt{2 c \log 2}.

Marton’s proof is breathtakingly beautiful. Consider any two sets {A,B} with {P(A),P(B) \neq 0}. Recalling our notation for conditional distributions, we can write

\displaystyle  \begin{array}{rcl}  W_1(P_A,P_B) &\le& W_1(P_A,P) + W_1(P_B,P) \\ &\le& \sqrt{2c\, D(P_A \| P)} + \sqrt{2c\, D(P_B \| P)} \\ &=& \sqrt{2c \log \frac{1}{P(A)}} + \sqrt{2c \log \frac{1}{P(B)}}, \end{array}

where in the first step we have used the triangle inequality, in the second we have used the fact that {P} satisfies the transportation inequality (4), and in the last step we have used the formula (2). Now suppose that {P(A) \ge 1/2} and let {B = A^c_r} for some {r}, where {c} denotes set-theoretic complement. Then we can show that {W_1(P_A,P_B) \ge d(A,B) \ge r}. On the other hand,

\displaystyle  \log \frac{1}{P(A)} \le \log 2 \qquad \text{and} \qquad \log\frac{1}{P(B)} = \log\frac{1}{1-P(A_r)}.

Combining these facts gives us the bound

\displaystyle  r \le \sqrt{2c \log 2} + \sqrt{2c \log \frac{1}{1-P(A_r)}}

that holds for all {r}. If {r > \sqrt{2 c \log 2}}, then we get

\displaystyle  P(A_r) \ge 1 - \exp\left( - \frac{1}{2c}\left(r - \sqrt{2c \log 2}\right)^2\right),

so we indeed have concentration and a sharp bound on {\alpha_X(r)}, at least for large enough values of {r}. The main message here is that, in order to study concentration, it suffices to work on the level of probability measures and to focus one’s effort on showing that the distribution of {X} satisfies a suitable transportation inequality. Since Marton’s original work, there have been many refinements and extensions, which I will not go into here. One such result, due to Sergey Bobkov and Friedrich Götze, says that {P} satisfying a transportation inequality (4) is equivalent to the Gaussian concentration property

\displaystyle  \alpha_X(r) \le e^{-r^2/2c}, \qquad \forall r \ge 0.

Now let’s look at the complementary functional viewpoint. Recall that we seek tight upper bounds on deviation probabilities of the form

\displaystyle  {\mathbb P}\Big( f(X) \ge m_f + r \Big), \qquad \forall r > 0.

It is easier to work with means instead of medians, and indeed it can be shown that concentration around the mean is equivalent to concentration around any median. So let’s focus on the mean. Let {X}, as before, be a random variable over some metric space {({\mathsf X},d)}, and consider a Lipschitz function {f : {\mathsf X} \rightarrow {\mathbb R}} such that {{\mathbb E}[f(X)] = 0}. We can apply the well-known Chernoff trick: for any {r,\lambda > 0} we have

\displaystyle  {\mathbb P}\Big( f(X) \ge r \Big) = {\mathbb P}\Big( e^{\lambda f(X)} \ge e^{\lambda r} \Big) \le e^{-\lambda r} {\mathbb E}[e^{\lambda f(X)}].

Now the whole affair hinges on the availability of tight upper bounds on the logarithmic moment-generating function {\Lambda(\lambda) := \log {\mathbb E}[e^{\lambda f(X)}]}. The entropy method, which was the subject of Gabor Lugosi’s plenary, is the name for a set of techniques for bounding {\Lambda(\lambda)} by means of various inequalities involving the relative entropy between various tilted distributions derived from {P} and {P} itself. The entropy method has roots in the work of Michel Ledoux, who in turn distilled it from some very deep results of Michel Talagrand.

The simplest version of the entropy method goes something like this. Let us define, for any {t \in {\mathbb R}}, the tilted distribution {P^{(t)}} via

\displaystyle  \frac{dP^{(t)}}{dP}(x) = \frac{e^{tf(x)}}{e^{\Lambda(t)}}

(assuming, of course, that {\Lambda(t)} exists and is finite). Then we have

\displaystyle  \begin{array}{rcl}  	D(P^{(t)} \| P) &=& \int dP^{(t)} \left[ tf - \Lambda(t)\right] \\ 	&=& \frac{1}{e^{\Lambda(t)}} t\, {\mathbb E}\left[f(X)e^{tf(X)}\right] - \Lambda(t) \\ 	&=& t \Lambda'(t) - \Lambda (t) \\ 	&=& t^2 \frac{d}{dt} \left(\frac{\Lambda(t)}{t}\right). \end{array}

Integrating and using the fact that {\Lambda(0) = 0}, we get

\displaystyle  \Lambda(\lambda) = \lambda \int^\lambda_0 \frac{D(P^{(t)} \| P)}{t^2} dt. \ \ \ \ \ (5)

Now suppose that we can bound {D(P^{(t)} \| P) \le ct^2/2} for some {c > 0}. Then from (5) we have

\displaystyle  \Lambda(\lambda) \le \frac{c\lambda^2}{2}, \qquad \forall t

which in turn gives the Gaussian tail bound

\displaystyle  {\mathbb P}\Big( f(X) \ge r \Big) \le \inf_{\lambda > 0} \exp\left(-\lambda r + \frac{c\lambda^2}{2}\right) = \exp\left(-\frac{r^2}{2c}\right).

This is the so-called Herbst argument. Of course, I have glossed over the most nontrivial part — namely, showing that we can bound the relative entropy {D(P^{(t)} \| P)} by a quadratic function of {t}. Gabor’s talk contained numerous examples of how one could get such bounds in the special case when {X} is an {n}-tuple of independent random variables taking values in some base space {{\mathcal X}}, {X = (X_1,\ldots,X_n) \in {\mathcal X}^n}, and the function {f} is not “too sensitive” to local modifications of its arguments. This takes us back to probability on metric spaces.

Here is one classic example. Suppose that our function {f} has the bounded difference property, i.e., there exist some constants {c_1,\ldots,c_n \ge 0}, such that changing the {i}th argument of {f} while keeping others constant will change the value of {f} by at most {c_i}:

\displaystyle  \sup_{x_1,\ldots,x_n}\sup_{x'_i}\left| f(x_1,\ldots,x_i,\ldots,x_n) - f(x_1,\ldots,x'_i,\ldots,x_n) \right| \le c_i.

We can express this more succinctly as a Lipschitz property of {f} if we define the weighted Hamming metric

\displaystyle  d(x,y) = \sum^n_{i=1}c_i 1_{\{x_i \neq y_i\}} \ \ \ \ \ (6)

(we can assume without loss of generality that all the {c_i}‘s are strictly positive, because we can simply ignore those coordinates of {x} that do not affect the value of {f}). Then the bounded difference property is equivalent to saying that {f} is {1}-Lipschitz with respect to this weighted Hamming metric. Moreover, it is possible to show that any product probability measure {P = P_1 \otimes \ldots \otimes P_n} on the product space {{\mathsf X} = {\mathcal X}^n} satisfies the transportation inequality

\displaystyle  W_1(Q,P) \le \sqrt{2c\, D(Q \| P)},

where the Wasserstein distance is computed with respect to the weighted Hamming metric (6), and {c = \frac{1}{4}\sum^n_{i=1}c^2_i}. By the Bobkov–Götze result quoted above, this is equivalent to the concentration bound

\displaystyle  {\mathbb P}\left( f(X) \ge {\mathbb E}[f(X)] + r \right) \le \exp\left( - \frac{r^2}{2c}\right) = \exp\left( - \frac{2r^2}{\sum^n_{i=1}c^2_i}\right).

This is the well-known McDiarmid’s inequality, which was originally derived using martingale techniques, but here we have arrived at it through a completely different route that took us back to where we started: concentration of Lipschitz functions around their means and/or medians, which (as we saw) is the same thing as the original, “stochastic-geometric” view of the concentration of measure phenomenon.

5 thoughts on “ISIT 2013: two plenaries on concentration of measure”

  1. dear Prof Raginsky,

    the peasant concentration-of-measure results I learned were, e.g., of LLN and large deviation-type. can you explain how this is directly related to the r-fattening of a set? i understand you will get to it in the later after (1). but intuitively how is (1) related to how a random variable concentrates around its mean?

    1. Dear Salazar —

      at least to me, there is no immediate intuitive route from (1) to concentration around the mean. You have to first spot the equivalence between (1) and concentration of Lipschitz functions around their medians (which is neither too obvious nor too complicated). Once you get there, there is one more step to take: the absolute difference between the mean and any median of a Lipschitz function can be uniformly bounded (in case of Gaussian-like concentration, by a quantity proportional to the Lipschitz constant). With these preliminaries out of the way, consider the sample mean. If the sample size is n, the sample mean is Lipschitz with constant 1/n.

  2. Prof. Raginsky,

    Thank you for writing this neat tour through these ideas. I have a very simple question that I am stumped by early in the post. That is, in example (1), I am having trouble seeing how to prove the Gaussian tail-like bound for the r-enlargement of A for all A. I see that the bound is plausible by Chernoff bounds for A = (-r,r), but I don’t see how to easily move to general A with P(A) >= 1/2. Is there an easy way to see this or does it require a more technical, perhaps geometric, analysis?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: