# The Information Structuralist

## Sampling Using Diffusion Processes, from Langevin to Schrödinger

Posted in Control, Feedback, Models of Complex Stochastic Systems, Probability by mraginsky on September 2, 2021

These notes are based on the tutorial I gave at the Geometric Methods in Optimization and Sampling Boot Camp at the Simons Institute in Berkeley.

Suppose we wish to obtain samples from some probability measure ${\mu}$ on ${{\mathbb R}^d}$. If ${\mu}$ has a sufficiently well-behaved density ${f}$ with respect to the Lebesgue measure, i.e., ${\mu(dx) = f(x) dx}$, then we can use the (overdamped) continuous-time Langevin dynamics, governed by the Ito stochastic differential equation (SDE) $\displaystyle d X_t = \frac{1}{2}\nabla \log f(X_t) dt + dW_t, \qquad t \ge 0 \ \ \ \ \ (1)$

where the initial condition ${X_0}$ is generated according to some probability law ${\mu_0}$, and ${(W_t)_{t \ge 0}}$ is the standard ${d}$-dimensional Brownian motion. Let ${\mu_t}$ denote the probability law of ${X_t}$. Then, under appropriate regularity conditions on ${f}$, one can establish the following:

• ${\mu}$ is the unique invariant distribution of (1), i.e., if ${\mu_0 = \mu}$, then ${\mu_t = \mu}$ for all ${t}$.
• ${\mu_t}$ converges to ${\mu}$ in a suitable sense as ${t \rightarrow \infty}$ — in fact, it is often possible to show that there exists a constant ${c > 0}$ that depends only on ${\mu}$, such that one has the exponential convergence to equilibrium $\displaystyle {\rm dist}(\mu_t, \mu) \le e^{-t/c}{\rm dist}(\mu_0, \mu)$

for some distance between probability measures on ${{\mathbb R}^d}$.

In this sense, the Langevin process (1) gives only approximate samples from ${\mu}$. I would like to discuss an alternative approach that uses diffusion processes to obtain exact samples in finite time. This approach is based on ideas that appeared in two papers from the 1930s by Erwin Schrödinger in the context of physics, and is now referred to as the Schrödinger bridge problem.

## Stochastic kernels vs. conditional probability distributions

Posted in Control, Feedback, Information Theory, Probability by mraginsky on March 17, 2013

Larry Wasserman‘s recent post about misinterpretation of p-values is a good reminder about a fundamental distinction anyone working in information theory, control or machine learning should be aware of — namely, the distinction between stochastic kernels and conditional probability distributions.

Just a couple of short items, while I catch my breath.

1. First of all, starting January 1, 2012 I will find myself amidst the lovely cornfields of Central Illinois, where I will be an assistant professor in the Department of Electrical and Computer Engineering at UIUC. This will be a homecoming of sorts, since I have spent three years there as a Beckman Fellow. My new home will be in the Coordinated Science Laboratory, where I will continue doing (and blogging about) the same things I do (and blog about).

2. Speaking of Central Illinois, last week I was at the Allerton Conference, where I had tried my best to preach Uncle Judea‘s gospel to anyone willing to listen information theorists and their fellow travelers. The paper, entitled “Directed information and Pearl’s causal calculus,” is now up on arxiv, and here is the abstract:

Probabilistic graphical models are a fundamental tool in statistics, machine learning, signal processing, and control. When such a model is defined on a directed acyclic graph (DAG), one can assign a partial ordering to the events occurring in the corresponding stochastic system. Based on the work of Judea Pearl and others, these DAG-based “causal factorizations” of joint probability measures have been used for characterization and inference of functional dependencies (causal links). This mostly expository paper focuses on several connections between Pearl’s formalism (and in particular his notion of “intervention”) and information-theoretic notions of causality and feedback (such as causal conditioning, directed stochastic kernels, and directed information). As an application, we show how conditional directed information can be used to develop an information-theoretic version of Pearl’s “back-door” criterion for identifiability of causal effects from passive observations. This suggests that the back-door criterion can be thought of as a causal analog of statistical sufficiency.

If you had seen my posts on stochastic kernels, directed information, and causal interventions, you will, more or less, know what to expect.

Incidentally, due to my forthcoming move to UIUC, this will be my last Allerton paper!

## Missing all the action

Posted in Control, Feedback, Games and Decisions, Information Theory, Optimization by mraginsky on July 25, 2011

Update: I fixed a couple of broken links.

I want to write down some thoughts inspired by Chernoff’s memo on backward induction that may be relevant to feedback information theory and networked control. Some of these points were brought up in discussions with Serdar Yüksel two years ago.

## The lost art of writing

Posted in Academic Snark, Control, Games and Decisions by mraginsky on July 19, 2011

From the opening paragraph of Herman Chernoff‘s unpublished 1963 memo “Backward induction in dynamic programming” (thanks to Armand Makowski for a scanned copy):

The solution of finite sequence dynamic programming problems involve a backward induction argument, the foundations of which are generally understood hazily. The purpose of this memo is to add some clarification which may be slightly redundant and whose urgency may be something less than vital.

Alas, nobody writes like that anymore.

## ECE 299: regression with quadratic loss; stochastic simulation via Rademacher bootstrap

Posted in Control, Corrupting the Young, Statistical Learning and Inference by mraginsky on April 20, 2011

I gave the last lecture earlier today, wrapping up the semester. Here are the notes from the last two weeks:

Monday’s lecture was on stochastic gradient descent as an alternative to batch empirical risk minimization. I will post the notes soon.

Tagged with:

Posted in Control, Information Theory, Papers and Preprints by mraginsky on December 26, 2010

In lieu of serious posting, which will resume in the new year, a few links:

## Value of information, Bayes risks, and rate-distortion theory

Posted in Control, Games and Decisions, Information Theory by mraginsky on December 1, 2010

In the previous post we have seen that access to additional information is not always helpful in decision-making. On the other hand, extra information can never hurt, assuming one is precise about the quantitative meaning of “extra information.” In this post, I will show how Shannon’s information theory can be used to speak meaningfully about the value of information for decision-making. This particular approach was developed in the 1960s and 1970s by Ruslan Stratonovich (of the Stratonovich integral fame, among other things) and described in his book on information theory, which was published in Russian in 1975. As far as I know, it was never translated into English, which is a shame, since Stratonovich was an extremely original thinker, and the book contains a deep treatment of the three fundamental problems of information theory (lossless source coding, noisy channel coding, lossy source coding) from the viewpoint of statistical physics.

## Deadly ninja weapons: Blackwell’s principle of irrelevant information

Having more information when making decisions should always help, it seems. However, there are situations in which this is not the case. Suppose that you observe two pieces of information, ${x}$ and ${y}$, which you can use to choose an action ${u}$. Suppose also that, upon choosing ${u}$, you incur a cost ${c(x,u)}$. For simplicity let us assume that ${x}$, ${y}$, and ${u}$ take values in finite sets ${{\mathsf X}}$, ${{\mathsf Y}}$, and ${{\mathsf U}}$, respectively. Then it is obvious that, no matter which “strategy” for choosing ${u}$ you follow, you cannot do better than ${u^*(x) = \displaystyle{\rm arg\,min}_{u \in {\mathsf U}} c(x,u)}$. More formally, for any strategy ${\gamma : {\mathsf X} \times {\mathsf Y} \rightarrow {\mathsf U}}$ we have $\displaystyle c(x,u^*(x)) = \min_{u \in {\mathsf U}} c(x,u) \le c(x,\gamma(x,y)).$

Thus, the extra information ${y}$ is irrelevant. Why? Because the cost you incur does not depend on ${y}$ directly, though it may do so through ${u}$.

Interestingly, as David Blackwell has shown in 1964 in a three-page paper, this seemingly innocuous argument does not go through when ${{\mathsf X}}$, ${{\mathsf Y}}$, and ${{\mathsf U}}$ are Borel subsets of Euclidean spaces, the cost function ${c}$ is bounded and Borel-measurable, and the strategies ${\gamma}$ are required to be measurable as well. However, if ${x}$ and ${y}$ are random variables with a known joint distribution ${P}$, then ${y}$ is indeed irrelevant for the purpose of minimizing expected cost.

Warning: lots of measure-theoretic noodling below the fold; if that is not your cup of tea, you can just assume that all sets are finite and go with the poor man’s version stated in the first paragraph. Then all the results below will hold.

## Bell Systems Technical Journal: now online

The Bell Systems Technical Journal is now online. Mmmm, seminal articles … . Shannon, Wyner, Slepian, Witsenhausen — they’re all here! (h/t Anand Sarwate)