Of the five plenary talks at this year’s ISIT, two were about concentration of measure: Katalin Marton’s Shannon lecture on “Distance-divergence inequalities” and Gabor Lugosi’s talk on “Concentration inequalities and the entropy method” the next morning. Since the topic of measure concentration is dear to my heart, I thought I would write down a few unifying themes.
When we speak of concentration of measure, we mean the following. Let be a random variable taking values in some metric space . Then we say that (or, more correctly, the distribution of ) has the concentration property if, for any set such that , we have
Here, is the distance from the point to the set :
Another way to express (1) is as follows: for any set and any , define the -blowup of by
Then has the concentration property if
In other words, has the concentration property if any set containing with not too small a probability can be “blown up” to contain with near-certainty.
Here are two classic examples of concentration:
- Gaussian distribution in Euclidean space. Let and take — the usual Euclidean distance. Let be a standard -dimensional Gaussian random variable, i.e., , where is the identity matrix. Then for any we have
- Uniform distribution in Hamming space. Let be the Hamming cube equipped with the normalized Hamming distance
that counts the fraction of bits in which and disagree. Let have the uniform distribution on , i.e., for all . Then
These two examples suggest that we should aim for “hard” statements in the form of sharp bounds on the concentration function
as opposed to merely “soft” statements of the form as . The $64,000 question is: how do we get such bounds?
There are two ways to accomplish this goal, and the main idea underlying these two ways is to replace sets with some other objects that are hopefully easier to handle. The first way is to replace sets by probability measures, the second is to replace them by functions. Here is what I mean:
Fix a set with . Let denote the distribution of , and let denote the conditional distribution of given . That is, for any (measurable) set we have
I am using the subscript notation instead of the more usual to indicate the fact that is a probability measure in its own right. In this way, we can associate to each non-null set a probability measure . Now, here is a very simple observation that turns out to be very consequential:
This is very easy to prove: for any set we have
so is absolutely continuous with respect to with the Radon-Nikodym derivative . Therefore, by definition of the divergence,
So if we are interested in bounding the probabilities of various sets , we may hope to get some mileage out of the relationship (2).
On the other hand, we may also associate to a set with the function
This function is Lipschitz: for any ,
where the last step is by the triangle inequality. Interchanging the roles of and , we get the Lipschitz property. Moreover, let us consider the random variable , where is our original -valued random variable. Then we immediately notice two things:
- For any , .
- If , then is a median of , in the sense that
(the second inequality is obviously true since is nonnegative with probability ).
These two observations suggest that we may obtain concentration bounds by bounding the probability that a given Lipschitz function of deviates from its median by more than . In fact, it is easy to derive an alternative expression for the concentration function :
where denotes any median of . We already showed, by passing from to , that is bounded from above by the quantity on the right-hand side of (3):
To prove the reverse inequality, fix any -Lipschitz function and consider the set , where is any median of . Then, by definition,
Moreover, if we consider the -blowup
then for any and any we must have
where the last step is by the Lipschitz property of . Consequently, by definition of the concentration function,
By passing to the functional viewpoint, we obtain another equivalent characterization of the concentration property: a random variable taking values in a metric space has the concentration property if real-valued Lipschitz functions are “nearly constant.”
Let’s look at the first, probabilistic viewpoint, which was born out of a 1996 breakthrough paper by Marton. Given a metric space , let us define the Wasserstein distance (or transportation distance) between any two probability measures and on it:
where the infimum is over all jointly distributed random variables , such that and . Now consider a random variable , for which we wish to establish concentration. What Marton showed is the following: Suppose the distribution of satisfies the transportation inequality
for some constant . Then has the concentration property, and moreover
Marton’s proof is breathtakingly beautiful. Consider any two sets with . Recalling our notation for conditional distributions, we can write
where in the first step we have used the triangle inequality, in the second we have used the fact that satisfies the transportation inequality (4), and in the last step we have used the formula (2). Now suppose that and let for some , where denotes set-theoretic complement. Then we can show that . On the other hand,
Combining these facts gives us the bound
that holds for all . If , then we get
so we indeed have concentration and a sharp bound on , at least for large enough values of . The main message here is that, in order to study concentration, it suffices to work on the level of probability measures and to focus one’s effort on showing that the distribution of satisfies a suitable transportation inequality. Since Marton’s original work, there have been many refinements and extensions, which I will not go into here. One such result, due to Sergey Bobkov and Friedrich Götze, says that satisfying a transportation inequality (4) is equivalent to the Gaussian concentration property
Now let’s look at the complementary functional viewpoint. Recall that we seek tight upper bounds on deviation probabilities of the form
It is easier to work with means instead of medians, and indeed it can be shown that concentration around the mean is equivalent to concentration around any median. So let’s focus on the mean. Let , as before, be a random variable over some metric space , and consider a Lipschitz function such that . We can apply the well-known Chernoff trick: for any we have
Now the whole affair hinges on the availability of tight upper bounds on the logarithmic moment-generating function . The entropy method, which was the subject of Gabor Lugosi’s plenary, is the name for a set of techniques for bounding by means of various inequalities involving the relative entropy between various tilted distributions derived from and itself. The entropy method has roots in the work of Michel Ledoux, who in turn distilled it from some very deep results of Michel Talagrand.
The simplest version of the entropy method goes something like this. Let us define, for any , the tilted distribution via
(assuming, of course, that exists and is finite). Then we have
Integrating and using the fact that , we get
Now suppose that we can bound for some . Then from (5) we have
which in turn gives the Gaussian tail bound
This is the so-called Herbst argument. Of course, I have glossed over the most nontrivial part — namely, showing that we can bound the relative entropy by a quadratic function of . Gabor’s talk contained numerous examples of how one could get such bounds in the special case when is an -tuple of independent random variables taking values in some base space , , and the function is not “too sensitive” to local modifications of its arguments. This takes us back to probability on metric spaces.
Here is one classic example. Suppose that our function has the bounded difference property, i.e., there exist some constants , such that changing the th argument of while keeping others constant will change the value of by at most :
We can express this more succinctly as a Lipschitz property of if we define the weighted Hamming metric
(we can assume without loss of generality that all the ‘s are strictly positive, because we can simply ignore those coordinates of that do not affect the value of ). Then the bounded difference property is equivalent to saying that is -Lipschitz with respect to this weighted Hamming metric. Moreover, it is possible to show that any product probability measure on the product space satisfies the transportation inequality
where the Wasserstein distance is computed with respect to the weighted Hamming metric (6), and . By the Bobkov–Götze result quoted above, this is equivalent to the concentration bound
This is the well-known McDiarmid’s inequality, which was originally derived using martingale techniques, but here we have arrived at it through a completely different route that took us back to where we started: concentration of Lipschitz functions around their means and/or medians, which (as we saw) is the same thing as the original, “stochastic-geometric” view of the concentration of measure phenomenon.
dear Prof Raginsky,
the peasant concentration-of-measure results I learned were, e.g., of LLN and large deviation-type. can you explain how this is directly related to the r-fattening of a set? i understand you will get to it in the later after (1). but intuitively how is (1) related to how a random variable concentrates around its mean?
Dear Salazar —
at least to me, there is no immediate intuitive route from (1) to concentration around the mean. You have to first spot the equivalence between (1) and concentration of Lipschitz functions around their medians (which is neither too obvious nor too complicated). Once you get there, there is one more step to take: the absolute difference between the mean and any median of a Lipschitz function can be uniformly bounded (in case of Gaussian-like concentration, by a quantity proportional to the Lipschitz constant). With these preliminaries out of the way, consider the sample mean. If the sample size is n, the sample mean is Lipschitz with constant 1/n.
Prof. Raginsky,
Thank you for writing this neat tour through these ideas. I have a very simple question that I am stumped by early in the post. That is, in example (1), I am having trouble seeing how to prove the Gaussian tail-like bound for the r-enlargement of A for all A. I see that the bound is plausible by Chernoff bounds for A = (-r,r), but I don’t see how to easily move to general A with P(A) >= 1/2. Is there an easy way to see this or does it require a more technical, perhaps geometric, analysis?
This particular tail bound is a consequence of the Gaussian isoperimetric inequality: https://en.wikipedia.org/wiki/Gaussian_isoperimetric_inequality