Larry Wasserman‘s recent post about misinterpretation of p-values is a good reminder about a fundamental distinction anyone working in information theory, control or machine learning should be aware of — namely, the distinction between stochastic kernels and conditional probability distributions.
Igal Sason and I have just posted to arXiv our tutorial paper “Concentration of Measure Inequalities in Information Theory, Communications and Coding”, which was submitted to Foundations and Trends in Communications and Information Theory. Here is the abstract:
This tutorial article is focused on some of the key modern mathematical tools that are used for the derivation of concentration inequalities, on their links to information theory, and on their various applications to communications and coding.
The first part of this article introduces some classical concentration inequalities for martingales, and it also derives some recent refinements of these inequalities. The power and versatility of the martingale approach is exemplified in the context of binary hypothesis testing, codes defined on graphs and iterative decoding algorithms, and some other aspects that are related to wireless communications and coding.
The second part of this article introduces the entropy method for deriving concentration inequalities for functions of many independent random variables, and it also exhibits its multiple connections to information theory. The basic ingredients of the entropy method are discussed first in conjunction with the closely related topic of logarithmic Sobolev inequalities. This discussion is complemented by a related viewpoint based on probability in metric spaces. This viewpoint centers around the so-called transportation-cost inequalities, whose roots are in information theory. Some representative results on concentration for dependent random variables are briefly summarized, with emphasis on their connections to the entropy method.
Finally, the tutorial addresses several applications of the entropy method and related information-theoretic tools to problems in communications and coding. These include strong converses for several source and channel coding problems, empirical distributions of good channel codes with non-vanishing error probability, and an information-theoretic converse for concentration of measure.
There are already many excellent sources on concentration of measure; what makes ours different is the emphasis on information-theoretic aspects, both in the general theory and in applications. Comments, suggestions, thoughts are very welcome.
Every once in a while you come across a mathematical argument of such incredible beauty that you feel compelled to tell the whole world about it. This post is about one such gem: David Blackwell’s 1946 proof of Wald’s identity on the expected value of a randomly stopped random walk. In fact, even forty years after the publication of that paper, in a conversation with Morris DeGroot, Blackwell said: “That’s a paper I’m still very proud of. It just gives me pleasant feelings every time I think about it.”
It’s that time again, the time to savor the dreamy delights of divergence!
In this post, we will look at a powerful information-theoretic method for deriving concentration-of-measure inequalities (i.e., tail bounds) for general functions of independent random variables.
The idea of developing statistical procedures that minimize an expected loss goes back to Laplace … [and] reappears in papers of Edgeworth. According to Neyman in his Lectures and Conferences: “After Edgeworth, the idea of the loss function was lost from sight for more than two decades …” It was truly revived only by the appearance on the statistical scene of Wald. Wald’s books Sequential Analysis and Statistical Decision Functions are based on that very idea of describing experiments by families of probability measures either on one given -field or on sequence of -fields to be chosen by the statistician. The idea seems logical enough if one is used to it. However, there is a paper by Fisher where he seems to express the opinion that such concepts are misleading and good enough only for Russian or American engineers.
Follow the link to Fisher’s paper for more curmudgeonly remarks about “Russians” and their “five-year plans.”
This passage from “The Glivenko-Cantelli problem, ten years later” by Michel Talagrand (J. Theoretical Probability, vol. 9, no. 2, pp. 371-384, 1996) will most likely be remembered forever as the best example of wry self-deprecating wit in an academic paper:
Over 10 years ago I wrote a paper that describes in great detail Glivenko-Cantelli classes. Despite the fact that Glivenko-Cantelli classes are certainly natural and important, this paper apparently has not been understood. The two main likely reasons are that the proofs are genuinely difficult; and that the paper displays bad taste of monumental proportion, in the sense that a lot of energy is devoted to extremely arcane measurability questions, which increases the difficulty of the proofs even more.
This is the first post in a series aobut the versatility and the power of the good old Kullback-Leibler divergence.
Today I will describe a beautiful application of the divergence to the problem of determining mixing rates of finite Markov chains. This argument is quite recent, and comes from a nice paper by Sergey Bobkov and Prasad Tetali (“Modified logarithmic Sobolev inequalities in discrete settings,” Journal of Theoretical Probability, vol. 19, no. 2, pp. 209-336, 2006; preprint). Since my interest here is information-theoretic, I will take for granted certain facts from the theory of finite Markov chains.