# The Information Structuralist

## Sampling Using Diffusion Processes, from Langevin to Schrödinger

Posted in Control, Feedback, Models of Complex Stochastic Systems, Probability by mraginsky on September 2, 2021

These notes are based on the tutorial I gave at the Geometric Methods in Optimization and Sampling Boot Camp at the Simons Institute in Berkeley.

Suppose we wish to obtain samples from some probability measure ${\mu}$ on ${{\mathbb R}^d}$. If ${\mu}$ has a sufficiently well-behaved density ${f}$ with respect to the Lebesgue measure, i.e., ${\mu(dx) = f(x) dx}$, then we can use the (overdamped) continuous-time Langevin dynamics, governed by the Ito stochastic differential equation (SDE)

$\displaystyle d X_t = \frac{1}{2}\nabla \log f(X_t) dt + dW_t, \qquad t \ge 0 \ \ \ \ \ (1)$

where the initial condition ${X_0}$ is generated according to some probability law ${\mu_0}$, and ${(W_t)_{t \ge 0}}$ is the standard ${d}$-dimensional Brownian motion. Let ${\mu_t}$ denote the probability law of ${X_t}$. Then, under appropriate regularity conditions on ${f}$, one can establish the following:

• ${\mu}$ is the unique invariant distribution of (1), i.e., if ${\mu_0 = \mu}$, then ${\mu_t = \mu}$ for all ${t}$.
• ${\mu_t}$ converges to ${\mu}$ in a suitable sense as ${t \rightarrow \infty}$ — in fact, it is often possible to show that there exists a constant ${c > 0}$ that depends only on ${\mu}$, such that one has the exponential convergence to equilibrium

$\displaystyle {\rm dist}(\mu_t, \mu) \le e^{-t/c}{\rm dist}(\mu_0, \mu)$

for some distance between probability measures on ${{\mathbb R}^d}$.

In this sense, the Langevin process (1) gives only approximate samples from ${\mu}$. I would like to discuss an alternative approach that uses diffusion processes to obtain exact samples in finite time. This approach is based on ideas that appeared in two papers from the 1930s by Erwin Schrödinger in the context of physics, and is now referred to as the Schrödinger bridge problem.

## Information flow on graphs

Posted in Information Theory, Models of Complex Stochastic Systems, Probability by mraginsky on May 3, 2014

Models of complex systems built from simple, locally interacting components arise in many fields, including statistical physics, biology, artificial intelligence, communication networks, etc. The quest to understand and to quantify the fundamental limits on the ability of such systems to store and process information has led to a variety of interesting and insightful results that draw upon probability, combinatorics, information theory, discrete and continuous dynamical systems, etc. In this post, I would like to focus on a model of distributed storage that was analyzed in 1975 by Donald Dawson in a very nice paper, which deserves to be more widely known.

Just a couple of short items, while I catch my breath.

1. First of all, starting January 1, 2012 I will find myself amidst the lovely cornfields of Central Illinois, where I will be an assistant professor in the Department of Electrical and Computer Engineering at UIUC. This will be a homecoming of sorts, since I have spent three years there as a Beckman Fellow. My new home will be in the Coordinated Science Laboratory, where I will continue doing (and blogging about) the same things I do (and blog about).

2. Speaking of Central Illinois, last week I was at the Allerton Conference, where I had tried my best to preach Uncle Judea‘s gospel to anyone willing to listen information theorists and their fellow travelers. The paper, entitled “Directed information and Pearl’s causal calculus,” is now up on arxiv, and here is the abstract:

Probabilistic graphical models are a fundamental tool in statistics, machine learning, signal processing, and control. When such a model is defined on a directed acyclic graph (DAG), one can assign a partial ordering to the events occurring in the corresponding stochastic system. Based on the work of Judea Pearl and others, these DAG-based “causal factorizations” of joint probability measures have been used for characterization and inference of functional dependencies (causal links). This mostly expository paper focuses on several connections between Pearl’s formalism (and in particular his notion of “intervention”) and information-theoretic notions of causality and feedback (such as causal conditioning, directed stochastic kernels, and directed information). As an application, we show how conditional directed information can be used to develop an information-theoretic version of Pearl’s “back-door” criterion for identifiability of causal effects from passive observations. This suggests that the back-door criterion can be thought of as a causal analog of statistical sufficiency.

If you had seen my posts on stochastic kernels, directed information, and causal interventions, you will, more or less, know what to expect.

Incidentally, due to my forthcoming move to UIUC, this will be my last Allerton paper!

## ISIT 2011: favorite talks

Obligatory disclaimer: YMMV, “favorite” does not mean “best,” etc. etc.

• Emmanuel Abbe and Andrew Barron, “Polar coding schemes for the AWGN channel” (pdf)
• The problem of constructing polar codes for channels with continuous input and output alphabets can be reduced, in a certain sense, to the problem of constructing finitely supported approximations to capacity-achieving distributions. This work analyzes several such approximations for the AWGN channel. In particular, one approximation uses quantiles and approaches capacity at a rate that decays exponentially with support size. The proof of this fact uses a neat trick of upper-bounding the Kullback-Leibler divergence by the chi-square distance and then exploiting the law of large numbers.

• Tom Cover, “On the St. Petersburg paradox”
• A fitting topic, since this year’s ISIT took place in St. Petersburg! Tom has presented a reformulation of the problem underlying this (in)famous paradox in terms of finding the best allocation of initial capital so as to optimize various notions of relative wealth. This reformulation obviates the need for various extra assumptions, such as diminishing marginal returns (i.e., concave utilities), and thus provides a means of resolving the paradox from first principles.

• Paul Cuff, Tom Cover, Gowtham Kumar, Lei Zhao, “A lattice of gambles”
• There is a well-known correspondence between martingales and “fair” gambling systems. Paul and co-authors explore another correspondence, between fair gambles and Lorenz curves used in econometric modeling, to study certain stochastic orderings and transformations of martingales. There are nice links to the theory of majorization and, through that, to Blackwell’s framework for comparing statistical experiments in terms of their expected risks.

• Ioanna Ioannou, Charalambos Charalambous, Sergey Loyka, “Outage probability under channel distribution uncertainty” (pdf; longer version: arxiv:1102.1103)
• The outage probability of a general channel with stochastic fading is the probability that the conditional input-output mutual information given the fading state falls below the given rate. In this paper, it is assumed that the state distribution is not known exactly, but there is an upper bound on its divergence from some fixed “nominal” distribution (this model of statistical uncertainty has been used previously in the context of robust control). The variational representation of the divergence (as a Legendre-Fenchel transform of the moment-generating function) then allows for a clean asymptotic analysis of the outage probability.

• Mohammad Naghshvar, Tara Javidi, “Performance bounds for active sequential hypothesis testing”
• Mohammad and Tara show how dynamic programming techniques can be used to develop tight converse bounds for sequential hypothesis testing problems with feedback, in which it is possible to adaptively control the quality of the observation channel. This viewpoint is a lot cleaner and more conceptually straightforward than “classical” proofs based on martingales (à la Burnashev). This new technique is used to analyze asymptotically optimal strategies for sequential $M$-ary hypothesis testing, variable-length coding with feedback, and noisy dynamic search.

• Chris Quinn, Negar Kiyavash, Todd Coleman, “Equivalence between minimal generative model graphs and directed information graphs” (pdf)
• For networks of interacting discrete-time stochastic processes possessing a certain conditional independence structure (motivating example: discrete-time approximations of smooth dynamical systems), Chris, Negar and Todd show the equivalence between two types of graphical models for these networks: (1) generative models that are minimal in a certain “combinatorial” sense and (2) information-theoretic graphs, in which the edges are drawn based on directed information.

• Ofer Shayevitz, “On Rényi measures and hypothesis testing” (long version: arxiv:1012.4401)
• Ofer obtained a new variational characterization of Rényi entropy and divergence that considerably simplifies their analysis, in many cases completely replacing delicate arguments based on Taylor expansions with purely information-theoretic proofs. He also develops a new operational characterization of these information measures in terms of distributed composite hypothesis testing.

## Deadly ninja weapons: Blackwell’s principle of irrelevant information

Having more information when making decisions should always help, it seems. However, there are situations in which this is not the case. Suppose that you observe two pieces of information, ${x}$ and ${y}$, which you can use to choose an action ${u}$. Suppose also that, upon choosing ${u}$, you incur a cost ${c(x,u)}$. For simplicity let us assume that ${x}$, ${y}$, and ${u}$ take values in finite sets ${{\mathsf X}}$, ${{\mathsf Y}}$, and ${{\mathsf U}}$, respectively. Then it is obvious that, no matter which “strategy” for choosing ${u}$ you follow, you cannot do better than ${u^*(x) = \displaystyle{\rm arg\,min}_{u \in {\mathsf U}} c(x,u)}$. More formally, for any strategy ${\gamma : {\mathsf X} \times {\mathsf Y} \rightarrow {\mathsf U}}$ we have

$\displaystyle c(x,u^*(x)) = \min_{u \in {\mathsf U}} c(x,u) \le c(x,\gamma(x,y)).$

Thus, the extra information ${y}$ is irrelevant. Why? Because the cost you incur does not depend on ${y}$ directly, though it may do so through ${u}$.

Interestingly, as David Blackwell has shown in 1964 in a three-page paper, this seemingly innocuous argument does not go through when ${{\mathsf X}}$, ${{\mathsf Y}}$, and ${{\mathsf U}}$ are Borel subsets of Euclidean spaces, the cost function ${c}$ is bounded and Borel-measurable, and the strategies ${\gamma}$ are required to be measurable as well. However, if ${x}$ and ${y}$ are random variables with a known joint distribution ${P}$, then ${y}$ is indeed irrelevant for the purpose of minimizing expected cost.

Warning: lots of measure-theoretic noodling below the fold; if that is not your cup of tea, you can just assume that all sets are finite and go with the poor man’s version stated in the first paragraph. Then all the results below will hold.

## Autumn travels

I have been on the road for the past few days. First I went to Washington DC to visit University of Maryland at College Park and to present my work on empirical processes and typical sequences at their Information and Coding Theory Seminar. A scientist’s dream — two hours in front of a blackboard, no slides!

And now I find myself amid the luscious cornfields of Central Illinois. That’s right, until Friday I’m in Urbana-Champaign for the annual Allerton conference. This year, Todd Coleman (UIUC), Giacomo Como (MIT), and I have co-organized a session on Information Divergence and Stochastic Dynamical Systems, which promises to be quite interesting — it will feature invited talks on Bayesian inference and evolutionary dynamics, reinforcement learning, optimal experimentation, opinion dynamics in social networks, signaling in decentralized control, and optimization of observation channels in control problems. If you happen to be attending Allerton this year, come on by!

## Directed stochastic kernels and causal interventions

As I was thinking more about Massey’s paper on directed information and about the work of Touchette and Lloyd on the information-theoretic study of control systems (which we had started looking at during the last meeting of our reading group), I realized that directed stochastic kernels that feature so prominently in the general definition of directed information are known in the machine learning and AI communities under another name, due to Judea Pearlinterventional distributions.

Tagged with:

## Some notes on Massey’s “Causality, feedback and directed information”

Posted in Feedback, Information Theory, Models of Complex Stochastic Systems by mraginsky on September 12, 2010

These are some notes on the first paper for our Information Theory reading group, but they may be of use to anyone with an interest in the way information theorists deal with feedback.

Warning: lots of equations and disjointed muttering about stochastic kernels and conditional independence below the fold!

Tagged with:

## Probability-schmobability, just gimme a good model

In the past few days I have come across two short but thought-provoking papers that seem to voice some of the same concerns, yet arrive at very different conclusions.

One is by Jan Willems (among other things, the founder of the behavioral approach to systems and control):

JW, Probability in control? (to appear in European Journal of Control)

Probability is one of the success stories of applied mathematics. It is universally used, from statistical physics to quantum mechanics, from econometrics to financial math- ematics, from information theory to control, from psychology and social sciences to medicine. Unfortunately, in many applications of probability, very little attention is paid to the modeling aspect. That is, the interpretation of the probability used in the model is seldom discussed, and it is rarely explained how one comes to the numeri- cal values of the distributions of the random variables used in the model. The aim of this communication is to put forward some remarks related to the use of probability in Systems and Control.

The other is by Andrew Gelman and Christian Robert (two high priests of the Temple of Bayes):

William Feller has a Note on Bayes’ rule in his classic probability book in which he expresses doubts about the Bayesian approach to statistics and decries it as a method of the past. We analyze in this note the motivations for Feller’s attitude, without aiming at a complete historical coverage of the reasons for this dismissal.

What are we to make of this?

## Information theory reading group at Duke

Posted in Control, Information Theory, Models of Complex Stochastic Systems by mraginsky on August 31, 2010

To shake things up a bit, I have started an information theory reading group in my department at Duke. We will meet every Monday to read and discuss papers on information theory, both recent and the classics, particularly those pertaining to estimation and inference, learning, decision-making, and control.

We’ve had our first meeting yesterday, where it was decided that the theme for the fall semester of 2010 will be Control. We have also settled on the following papers (all available for download from the reading group website):

1. J. Massey, “Causality, feedback and directed information”
2. H. Touchette and S. Lloyd, “Information-theoretic approach to the study of control systems”
3. J. Walrand and P. Varaiya, “Optimal causal coding-decoding problems”
4. N. Elia, “When Bode meets Shannon: control-oriented feedback communication schemes”
5. N. Martins and M. Dahleh, “Feedback control in the presence of noisy channels: ‘Bode-like’ fundamental limitations of performance”
6. V. Anantharam and V. Borkar, “An information-theoretic view of stochastic resonance”

Most likely, as the meetings continue, I will be posting various blurbs, notes and musings on these papers and any related matters. We may also shuffle things around a bit — for example, spend some time discussing the basics of stochastic control for the benefit of those not familiar with it.

Tagged with: