The Information Structuralist

Sampling Using Diffusion Processes, from Langevin to Schrödinger

Posted in Control, Feedback, Models of Complex Stochastic Systems, Probability by mraginsky on September 2, 2021

These notes are based on the tutorial I gave at the Geometric Methods in Optimization and Sampling Boot Camp at the Simons Institute in Berkeley.

Suppose we wish to obtain samples from some probability measure {\mu} on {{\mathbb R}^d}. If {\mu} has a sufficiently well-behaved density {f} with respect to the Lebesgue measure, i.e., {\mu(dx) = f(x) dx}, then we can use the (overdamped) continuous-time Langevin dynamics, governed by the Ito stochastic differential equation (SDE)

\displaystyle  d X_t = \frac{1}{2}\nabla \log f(X_t) dt + dW_t, \qquad t \ge 0 \ \ \ \ \ (1)

where the initial condition {X_0} is generated according to some probability law {\mu_0}, and {(W_t)_{t \ge 0}} is the standard {d}-dimensional Brownian motion. Let {\mu_t} denote the probability law of {X_t}. Then, under appropriate regularity conditions on {f}, one can establish the following:

  • {\mu} is the unique invariant distribution of (1), i.e., if {\mu_0 = \mu}, then {\mu_t = \mu} for all {t}.
  • {\mu_t} converges to {\mu} in a suitable sense as {t \rightarrow \infty} — in fact, it is often possible to show that there exists a constant {c > 0} that depends only on {\mu}, such that one has the exponential convergence to equilibrium

    \displaystyle  		{\rm dist}(\mu_t, \mu) \le e^{-t/c}{\rm dist}(\mu_0, \mu)

    for some distance between probability measures on {{\mathbb R}^d}.

In this sense, the Langevin process (1) gives only approximate samples from {\mu}. I would like to discuss an alternative approach that uses diffusion processes to obtain exact samples in finite time. This approach is based on ideas that appeared in two papers from the 1930s by Erwin Schrödinger in the context of physics, and is now referred to as the Schrödinger bridge problem.


Information flow on graphs

Posted in Information Theory, Models of Complex Stochastic Systems, Probability by mraginsky on May 3, 2014

Models of complex systems built from simple, locally interacting components arise in many fields, including statistical physics, biology, artificial intelligence, communication networks, etc. The quest to understand and to quantify the fundamental limits on the ability of such systems to store and process information has led to a variety of interesting and insightful results that draw upon probability, combinatorics, information theory, discrete and continuous dynamical systems, etc. In this post, I would like to focus on a model of distributed storage that was analyzed in 1975 by Donald Dawson in a very nice paper, which deserves to be more widely known.


ISIT 2013: two plenaries on concentration of measure

Posted in Conference Blogging, Information Theory, Mathematics, Probability by mraginsky on July 29, 2013

Of the five plenary talks at this year’s ISIT, two were about concentration of measure: Katalin Marton’s Shannon lecture on “Distance-divergence inequalities” and Gabor Lugosi’s talk on “Concentration inequalities and the entropy method” the next morning. Since the topic of measure concentration is dear to my heart, I thought I would write down a few unifying themes.


Stochastic kernels vs. conditional probability distributions

Posted in Control, Feedback, Information Theory, Probability by mraginsky on March 17, 2013

Larry Wasserman‘s recent post about misinterpretation of p-values is a good reminder about a fundamental distinction anyone working in information theory, control or machine learning should be aware of — namely, the distinction between stochastic kernels and conditional probability distributions.


Concentrate, concentrate!

Posted in Information Theory, Mathematics, Narcissism, Papers and Preprints, Probability by mraginsky on December 19, 2012

Igal Sason and I have just posted to arXiv our tutorial paper “Concentration of Measure Inequalities in Information Theory, Communications and Coding”, which was submitted to Foundations and Trends in Communications and Information Theory. Here is the abstract:

This tutorial article is focused on some of the key modern mathematical tools that are used for the derivation of concentration inequalities, on their links to information theory, and on their various applications to communications and coding.

The first part of this article introduces some classical concentration inequalities for martingales, and it also derives some recent refinements of these inequalities. The power and versatility of the martingale approach is exemplified in the context of binary hypothesis testing, codes defined on graphs and iterative decoding algorithms, and some other aspects that are related to wireless communications and coding.

The second part of this article introduces the entropy method for deriving concentration inequalities for functions of many independent random variables, and it also exhibits its multiple connections to information theory. The basic ingredients of the entropy method are discussed first in conjunction with the closely related topic of logarithmic Sobolev inequalities. This discussion is complemented by a related viewpoint based on probability in metric spaces. This viewpoint centers around the so-called transportation-cost inequalities, whose roots are in information theory. Some representative results on concentration for dependent random variables are briefly summarized, with emphasis on their connections to the entropy method.

Finally, the tutorial addresses several applications of the entropy method and related information-theoretic tools to problems in communications and coding. These include strong converses for several source and channel coding problems, empirical distributions of good channel codes with non-vanishing error probability, and an information-theoretic converse for concentration of measure.

There are already many excellent sources on concentration of measure; what makes ours different is the emphasis on information-theoretic aspects, both in the general theory and in applications. Comments, suggestions, thoughts are very welcome.

Blackwell’s proof of Wald’s identity

Posted in Mathematics, Probability by mraginsky on April 29, 2011

Every once in a while you come across a mathematical argument of such incredible beauty that you feel compelled to tell the whole world about it. This post is about one such gem: David Blackwell’s 1946 proof of Wald’s identity on the expected value of a randomly stopped random walk. In fact, even forty years after the publication of that paper, in a conversation with Morris DeGroot, Blackwell said: “That’s a paper I’m still very proud of. It just gives me pleasant feelings every time I think about it.”


Divergence in everything: erasure divergence and concentration inequalities

Posted in Information Theory, Probability, Statistical Learning and Inference by mraginsky on March 18, 2011

It’s that time again, the time to savor the dreamy delights of divergence!

(image yoinked from Sergio Verdú‘s 2007 Shannon Lecture slides)

In this post, we will look at a powerful information-theoretic method for deriving concentration-of-measure inequalities (i.e., tail bounds) for general functions of independent random variables.


In Soviet Russia, the sigma-field conditions on you

Quote of the day, from Asymptotics in Statistics: Some Basic Concepts by Lucien Le Cam and Grace Yang (emphasis mine):

The idea of developing statistical procedures that minimize an expected loss goes back to Laplace … [and] reappears in papers of Edgeworth. According to Neyman in his Lectures and Conferences: “After Edgeworth, the idea of the loss function was lost from sight for more than two decades …” It was truly revived only by the appearance on the statistical scene of Wald. Wald’s books Sequential Analysis and Statistical Decision Functions are based on that very idea of describing experiments by families of probability measures either on one given \sigma-field or on sequence of \sigma-fields to be chosen by the statistician. The idea seems logical enough if one is used to it. However, there is a paper by Fisher where he seems to express the opinion that such concepts are misleading and good enough only for Russian or American engineers.

Follow the link to Fisher’s paper for more curmudgeonly remarks about “Russians” and their “five-year plans.”

Bad taste of monumental proportions

Posted in Papers and Preprints, Probability by mraginsky on October 16, 2010

This passage from “The Glivenko-Cantelli problem, ten years later” by Michel Talagrand (J. Theoretical Probability, vol. 9, no. 2, pp. 371-384, 1996) will most likely be remembered forever as the best example of wry self-deprecating wit in an academic paper:

Over 10 years ago I wrote a paper that describes in great detail Glivenko-Cantelli classes. Despite the fact that Glivenko-Cantelli classes are certainly natural and important, this paper apparently has not been understood. The two main likely reasons are that the proofs are genuinely difficult; and that the paper displays bad taste of monumental proportion, in the sense that a lot of energy is devoted to extremely arcane measurability questions, which increases the difficulty of the proofs even more.

Divergence in everything: mixing rates for finite Markov chains

Posted in Information Theory, Probability by mraginsky on October 14, 2010

This is the first post in a series aobut the versatility and the power of the good old Kullback-Leibler divergence.

(image yoinked from Sergio Verdú‘s 2007 Shannon Lecture slides)

Today I will describe a beautiful application of the divergence to the problem of determining mixing rates of finite Markov chains. This argument is quite recent, and comes from a nice paper by Sergey Bobkov and Prasad Tetali (“Modified logarithmic Sobolev inequalities in discrete settings,” Journal of Theoretical Probability, vol. 19, no. 2, pp. 209-336, 2006; preprint). Since my interest here is information-theoretic, I will take for granted certain facts from the theory of finite Markov chains.