## Concentrate, concentrate!

Igal Sason and I have just posted to arXiv our tutorial paper “Concentration of Measure Inequalities in Information Theory, Communications and Coding”, which was submitted to Foundations and Trends in Communications and Information Theory. Here is the abstract:

This tutorial article is focused on some of the key modern mathematical tools that are used for the derivation of concentration inequalities, on their links to information theory, and on their various applications to communications and coding.

The first part of this article introduces some classical concentration inequalities for martingales, and it also derives some recent refinements of these inequalities. The power and versatility of the martingale approach is exemplified in the context of binary hypothesis testing, codes defined on graphs and iterative decoding algorithms, and some other aspects that are related to wireless communications and coding.

The second part of this article introduces the entropy method for deriving concentration inequalities for functions of many independent random variables, and it also exhibits its multiple connections to information theory. The basic ingredients of the entropy method are discussed first in conjunction with the closely related topic of logarithmic Sobolev inequalities. This discussion is complemented by a related viewpoint based on probability in metric spaces. This viewpoint centers around the so-called transportation-cost inequalities, whose roots are in information theory. Some representative results on concentration for dependent random variables are briefly summarized, with emphasis on their connections to the entropy method.

Finally, the tutorial addresses several applications of the entropy method and related information-theoretic tools to problems in communications and coding. These include strong converses for several source and channel coding problems, empirical distributions of good channel codes with non-vanishing error probability, and an information-theoretic converse for concentration of measure.

There are already many excellent sources on concentration of measure; what makes ours different is the emphasis on information-theoretic aspects, both in the general theory and in applications. Comments, suggestions, thoughts are very welcome.

## ITA 2012: some evanescent reflections

Getting settled in at my new academic home, with all the attendant administrivia, has sucked up almost two months of missed blogging opportunities. So, here goes.

(more…)

## Lower bounds for passive and active learning

Sasha Rakhlin and I will be presenting our paper “Lower bounds for passive and active learning” at this year’s NIPS, which will be taking place in Granada, Spain from December 12 to December 15. The proofs of our main results rely heavily on information-theoretic techniques, specifically the data processing inequality for -divergences and a certain type of constant-weight binary codes.

## Updates! Get your updates here!

Just a couple of short items, while I catch my breath.

1. First of all, starting January 1, 2012 I will find myself amidst the lovely cornfields of Central Illinois, where I will be an assistant professor in the Department of Electrical and Computer Engineering at UIUC. This will be a homecoming of sorts, since I have spent three years there as a Beckman Fellow. My new home will be in the Coordinated Science Laboratory, where I will continue doing (and blogging about) the same things I do (and blog about).

2. Speaking of Central Illinois, last week I was at the Allerton Conference, where I had tried my best to preach Uncle Judea‘s gospel to ~~anyone willing to listen~~ information theorists and their fellow travelers. The paper, entitled “Directed information and Pearl’s causal calculus,” is now up on arxiv, and here is the abstract:

Probabilistic graphical models are a fundamental tool in statistics, machine learning, signal processing, and control. When such a model is defined on a directed acyclic graph (DAG), one can assign a partial ordering to the events occurring in the corresponding stochastic system. Based on the work of Judea Pearl and others, these DAG-based “causal factorizations” of joint probability measures have been used for characterization and inference of functional dependencies (causal links). This mostly expository paper focuses on several connections between Pearl’s formalism (and in particular his notion of “intervention”) and information-theoretic notions of causality and feedback (such as causal conditioning, directed stochastic kernels, and directed information). As an application, we show how conditional directed information can be used to develop an information-theoretic version of Pearl’s “back-door” criterion for identifiability of causal effects from passive observations. This suggests that the back-door criterion can be thought of as a causal analog of statistical sufficiency.

If you had seen my posts on stochastic kernels, directed information, and causal interventions, you will, more or less, know what to expect.

Incidentally, due to my forthcoming move to UIUC, this will be my last Allerton paper!

## ISIT 2011: favorite talks

Obligatory disclaimer: YMMV, “favorite” does not mean “best,” etc. etc.

- Emmanuel Abbe and Andrew Barron, “Polar coding schemes for the AWGN channel” (pdf)
- Tom Cover, “On the St. Petersburg paradox”
- Paul Cuff, Tom Cover, Gowtham Kumar, Lei Zhao, “A lattice of gambles”
- Ioanna Ioannou, Charalambos Charalambous, Sergey Loyka, “Outage probability under channel distribution uncertainty” (pdf; longer version: arxiv:1102.1103)
- Mohammad Naghshvar, Tara Javidi, “Performance bounds for active sequential hypothesis testing”
- Chris Quinn, Negar Kiyavash, Todd Coleman, “Equivalence between minimal generative model graphs and directed information graphs” (pdf)
- Ofer Shayevitz, “On Rényi measures and hypothesis testing” (long version: arxiv:1012.4401)

The problem of constructing polar codes for channels with continuous input and output alphabets can be reduced, in a certain sense, to the problem of constructing finitely supported approximations to capacity-achieving distributions. This work analyzes several such approximations for the AWGN channel. In particular, one approximation uses quantiles and approaches capacity at a rate that decays *exponentially* with support size. The proof of this fact uses a neat trick of upper-bounding the Kullback-Leibler divergence by the chi-square distance and then exploiting the law of large numbers.

A fitting topic, since this year’s ISIT took place in St. Petersburg! Tom has presented a reformulation of the problem underlying this (in)famous paradox in terms of finding the best allocation of initial capital so as to optimize various notions of relative wealth. This reformulation obviates the need for various extra assumptions, such as diminishing marginal returns (i.e., concave utilities), and thus provides a means of resolving the paradox from first principles.

There is a well-known correspondence between martingales and “fair” gambling systems. Paul and co-authors explore another correspondence, between fair gambles and Lorenz curves used in econometric modeling, to study certain stochastic orderings and transformations of martingales. There are nice links to the theory of majorization and, through that, to Blackwell’s framework for comparing statistical experiments in terms of their expected risks.

The outage probability of a general channel with stochastic fading is the probability that the conditional input-output mutual information given the fading state falls below the given rate. In this paper, it is assumed that the state distribution is not known exactly, but there is an upper bound on its divergence from some fixed “nominal” distribution (this model of statistical uncertainty has been used previously in the context of robust control). The variational representation of the divergence (as a Legendre-Fenchel transform of the moment-generating function) then allows for a clean asymptotic analysis of the outage probability.

Mohammad and Tara show how dynamic programming techniques can be used to develop tight converse bounds for sequential hypothesis testing problems with feedback, in which it is possible to adaptively control the quality of the observation channel. This viewpoint is a lot cleaner and more conceptually straightforward than “classical” proofs based on martingales (à la Burnashev). This new technique is used to analyze asymptotically optimal strategies for sequential -ary hypothesis testing, variable-length coding with feedback, and noisy dynamic search.

For networks of interacting discrete-time stochastic processes possessing a certain conditional independence structure (motivating example: discrete-time approximations of smooth dynamical systems), Chris, Negar and Todd show the equivalence between two types of graphical models for these networks: (1) generative models that are minimal in a certain “combinatorial” sense and (2) information-theoretic graphs, in which the edges are drawn based on directed information.

Ofer obtained a new variational characterization of Rényi entropy and divergence that considerably simplifies their analysis, in many cases completely replacing delicate arguments based on Taylor expansions with purely information-theoretic proofs. He also develops a new operational characterization of these information measures in terms of distributed composite hypothesis testing.

## Linkage

In lieu of serious posting, which will resume in the new year, a few links:

- Several videos of David Blackwell, over at The Inherent Uncertainty
- The (in)famous Witsenhausen counterexample in decentralized control theory now has its own Wikipedia entry (and I think I know who is behind it).
- Emmanuel Abbe, guest-blogging at Combinatorics and More, presents his perspective on Erdal Arikan’s polar codes. Much of what he says makes me think of Terence Tao‘s work on structure and randomness in “large” combinatorial objects.
- Markov decision processes make a surprising appearance in a paper on subexponential lower bounds for certain randomized pivot rules for the simplex algorithm.
- Computation and Control: a new blog by Jerome Le Ny.
- And, last but not least, exorcising Laplace’s demon.

## Bad taste of monumental proportions

This passage from “The Glivenko-Cantelli problem, ten years later” by Michel Talagrand (*J. Theoretical Probability*, vol. 9, no. 2, pp. 371-384, 1996) will most likely be remembered forever as the best example of wry self-deprecating wit in an academic paper:

Over 10 years ago I wrote a paper that describes in great detail Glivenko-Cantelli classes. Despite the fact that Glivenko-Cantelli classes are certainly natural and important, this paper apparently has not been understood. The two main likely reasons are that the proofs are genuinely difficult; and that the paper displays bad taste of monumental proportion, in the sense that a lot of energy is devoted to extremely arcane measurability questions, which increases the difficulty of the proofs even more.

## Sincerely, your biggest Fano

It’s time to fire up the Shameless Self-Promotion Engine again, for I am about to announce a preprint and a paper to be published. Both deal with more or less the same problem — i.e., fundamental limits of certain sequential procedures — and both rely on the same set of techniques: metric entropy, Fano’s inequality, and bounds on the mutual information through divergence with auxiliary probability measures.

So, without further ado, I give you: (more…)

## Typical, just typical

In the spirit of shameless self-promotion, I would like to announce a new preprint (preliminary version was presented in July at ISIT 2010 in Austin, TX):

Maxim Raginsky, “Empirical processes, typical sequences and coordinated actions in standard Borel spaces”, arXiv:1009.0282, submitted to *IEEE Transactions on Information Theory*

Abstract:This paper proposes a new notion of typical sequences on a wide class of abstract alphabets (so-called standard Borel spaces), which is based on approximations of memoryless sources by empirical distributions uniformly over a class of measurable “test functions.” In the finite-alphabet case, we can take all uniformly bounded functions and recover the usual notion of strong typicality (or typicality under the total variation distance). For a general alphabet, however, this function class turns out to be too large, and must be restricted. With this in mind, we define typicality with respect to any Glivenko–Cantelli function class (i.e., a function class that admits a Uniform Law of Large Numbers) and demonstrate its power by giving simple derivations of the fundamental limits on the achievable rates in several source coding scenarios, in which the relevant operational criteria pertain to reproducing empirical averages of a general-alphabet stationary memoryless source with respect to a suitable function class.

1comment