The Information Structuralist

Sampling Using Diffusion Processes, from Langevin to Schrödinger

Posted in Control, Feedback, Models of Complex Stochastic Systems, Probability by mraginsky on September 2, 2021

These notes are based on the tutorial I gave at the Geometric Methods in Optimization and Sampling Boot Camp at the Simons Institute in Berkeley.

Suppose we wish to obtain samples from some probability measure {\mu} on {{\mathbb R}^d}. If {\mu} has a sufficiently well-behaved density {f} with respect to the Lebesgue measure, i.e., {\mu(dx) = f(x) dx}, then we can use the (overdamped) continuous-time Langevin dynamics, governed by the Ito stochastic differential equation (SDE)

\displaystyle  d X_t = \frac{1}{2}\nabla \log f(X_t) dt + dW_t, \qquad t \ge 0 \ \ \ \ \ (1)

where the initial condition {X_0} is generated according to some probability law {\mu_0}, and {(W_t)_{t \ge 0}} is the standard {d}-dimensional Brownian motion. Let {\mu_t} denote the probability law of {X_t}. Then, under appropriate regularity conditions on {f}, one can establish the following:

  • {\mu} is the unique invariant distribution of (1), i.e., if {\mu_0 = \mu}, then {\mu_t = \mu} for all {t}.
  • {\mu_t} converges to {\mu} in a suitable sense as {t \rightarrow \infty} — in fact, it is often possible to show that there exists a constant {c > 0} that depends only on {\mu}, such that one has the exponential convergence to equilibrium

    \displaystyle  		{\rm dist}(\mu_t, \mu) \le e^{-t/c}{\rm dist}(\mu_0, \mu)

    for some distance between probability measures on {{\mathbb R}^d}.

In this sense, the Langevin process (1) gives only approximate samples from {\mu}. I would like to discuss an alternative approach that uses diffusion processes to obtain exact samples in finite time. This approach is based on ideas that appeared in two papers from the 1930s by Erwin Schrödinger in the context of physics, and is now referred to as the Schrödinger bridge problem.

We will now assume that the target probability measure {\mu} has a density {f} with respect to the canonical Gaussian measure on {{\mathbb R}^d}, i.e.,

\displaystyle  	\mu(dx) = f(x)\gamma(dx), \qquad \gamma(dx) := \frac{1}{(2\pi)^{d/2}} e^{-|x|^2/2}dx.

We will also assume that {f} is everywhere strictly positive. Our goal is to construct an Ito process governed by the SDE

\displaystyle  	dX_t = b(X_t,t)dt + dW_t, \qquad t \in [0,1] \ \ \ \ \ (2)

starting at {X_0 = 0}, such that:

  • {X_1} has probability law {\mu};
  • {Q}, the probability law of the process {(X_t)_{t \in [0,1]}}, is optimal in the sense that, among all the diffusion processes with values in {{\mathbb R}^d} starting at the origin at {t=0} and having the marginal distribution {\mu} at {t=1}, it remains “as close as possible” to the standard Brownian motion in the sense to be made precise below.

Note that the drift in (2) is time-dependent. This problem has been studied from many angles, and the optimal drift has an explicit form. My goal here is to give what is, to the best of my knowledge, a somewhat new and rather transparent derivation of the solution to the Schrödinger bridge problem. The main idea is based on a rather neat trick that I picked up from a paper by Hiroshi Ezawa, John Klauder, and Larry Shepp (although it was already present in some form in an earlier short paper by Václav Beneš and Shepp). The presentation will be rather informal, but every step can be made rigorous.

1. The Schrödinger bridge problem

The basic problem can be stated as follows. Let {\Omega} denote the space of continuous functions (paths) from {[0,1]} into {{\mathbb R}^d}, and let {W = (W_t)_{t \in [0,1]}} denote the canonical coordinate process on {\Omega}: For any {\omega \in \Omega}, {W_t(\omega) := \omega(t)}. Let {P} denote the Wiener measure on {\Omega}, under which {W} is the standard {d}-dimensional Brownian motion on {[0,1]}. Given any probability measure {Q} on {\Omega}, we will denote by {Q_t} the marginal probability law of the value of the path at time {t}. With this, we consider the set

\displaystyle  	{\cal M}^\mu := \left\{ Q : Q_0 = \delta_0 \text{ and } Q_1 = \mu \right\},

where {\delta_0} denotes the Dirac probability measure on {{\mathbb R}^d} centered at the origin. Our goal is to find

\displaystyle  	Q^\mu = \mathop{\text{arg\,min}}_{Q \in {\cal M}^\mu} D(Q \| P),

where {D(\cdot \| \cdot)} denotes the relative entropy between two probability measures on {\Omega}. We will need to do two things: (a) show that {Q^\mu} exists and is unique, and (b) show that it is the probability law of a diffusion process governed by an Ito SDE of the form (2).

To take care of (a), all we need is the chain rule for the relative entropy. Fix an arbitrary {Q \in {\cal M}^\mu}; without loss of generality, we can assume that {Q} is absolutely continuous with respect to the Wiener measure {P}, i.e., {D(Q \| P ) < \infty}. The chain rule then gives

\displaystyle  	D(Q \| P) = D(Q_1 \| P_1) + \int_{{\mathbb R}^d} P_1(dx) D(Q^x \| P^x), \ \ \ \ \ (3)

where {Q^x} (respectively, {P^x}) denotes the conditional probability law of {X = (X_t)_{t \in [0,1]}} given {X_1 = x} under {Q} (respectively, {P}). The conditional probability measure {P^x} is well-known, and gives the probability law of the Brownian bridge process, i.e., the standard Brownian motion pinned to the point {x} at {t=1}. Now, since {Q \in {\cal M}^\mu}, we have {Q_1 = \mu}, whereas {P_1 = \gamma} for the Wiener measure {P}. Thus, we have the following for the right-hand side of (3):

\displaystyle  \begin{array}{rcl}  	D(Q_1 \| P_1) + \int_{{\mathbb R}^d} P_1(dx) D(Q^x \| P^x) &=& D(\mu \| \gamma) + \int_{{\mathbb R}^d}\gamma(dx) D(Q^x \| P^x) \\ 	&\ge& D(\mu \| \gamma), \end{array}

where equality holds if and only if {Q^x = P^x} almost everywhere. This immediately implies two things:

  • {D(\mu \| \gamma) = \inf_{Q \in {\cal M}^\mu} D(Q \| P)};
  • the above infimum is attained by the probability measure

    \displaystyle  		Q^\mu(\cdot) = \int_{{\mathbb R}^d} \mu(dx) P^x(\cdot), 	\ \ \ \ \ (4)

    i.e., by the {\mu}-mixture of the Brownian bridge processes {P^x}.

From (4), it is easy to obtain an explicit expression for the Radon–Nikodym derivative {\frac{d Q^\mu}{d P}}: Let {F} be an arbitrary bounded measurable real-valued function on {\Omega}. Then, using the fact that {d\mu = fd\gamma}, (4), and the law of iterated expectation, we can write

\displaystyle  \begin{array}{rcl}  	{\mathbb E}_{Q^\mu}[F(X)] &=& {\mathbb E}_\mu[{\mathbb E}_{Q^\mu}[F(X)|X_1]] \\ 	&=& {\mathbb E}_\mu[{\mathbb E}_P[F(X)|X_1]] \\ 	&=& {\mathbb E}_\gamma[f(X_1){\mathbb E}_P[F(X)|X_1]] \\ 	&=& {\mathbb E}_\gamma[E_P[f(X_1)F(X)|X_1]] \\ 	&=& {\mathbb E}_P[f(W_1)F(W)], \end{array}

where the first equation is due to the fact that {Q^\mu_1 = \mu}, in the second equation we have used the fact that the conditional laws {Q^\mu[\cdot|X_1 = x]} and {P[\cdot|X_1 = x]} coincide, and in the last equation {W = (W_t)_{t \in [0,1]}} denotes the standard Brownian motion process with process law {P}. That is, the Radon–Nikodym derivative of {Q^\mu} with respect to {P}, as a function of the path {W}, depends only on the terminal point {W_1} and takes the form

\displaystyle  	\frac{dQ^\mu}{dP}(W) = f(W_1), \ \ \ \ \ (5)

where, as we recall, {f} denotes the density of the target measure {\mu} with respect to the standard Gaussian measure {\gamma} on {{\mathbb R}^d}. Thus, the optimal measure {Q^\mu} is simply the law of the standard {d}-dimensional Brownian motion on {[0,1]}, “conditioned” to have law {d\mu = fd\gamma} at {t=1}.

2. The gradient ansatz and the Girsanov transformation

In principle, (5) already tells us that we can obtain samples from {\mu} by “reweighting” the Brownian paths {W} using the weight factor {f(W_1)} that depends only on {W_1}. What we are after, however, is something different: Instead of reweighting, we would like to construct a “transport” map {T : \Omega \rightarrow \Omega} such that {Q^\mu = T_*P}, i.e., the optimal measure {Q^\mu} is the pushforward of the Wiener measure by {T}:

\displaystyle  	{\mathbb E}_{Q^\mu}[F(X)] = {\mathbb E}_P[F\circ T(W)]

for any bounded measurable {F : \Omega \rightarrow {\mathbb R}}. Moreover, we want this {T} to be such that the “transported” process {X := T \circ W} is an Ito diffusion process. In order to construct such a map, we will use a fundamental result in stochastic calculus, the Girsanov theorem, which gives the necessary and sufficient condition for a probability measure {Q} on {\Omega} to be absolutely continuous with respect to the Wiener measure {P}.

In order to make use of it, we will first make the gradient ansatz, i.e., we will assume that the diffusion process we seek is governed by an Ito SDE of the form

\displaystyle  	dX_t = -\nabla v(X_t,t)dt + dW_t, \qquad X_0 = 0, t \in [0,1] \ \ \ \ \ (6)

where {\nabla} denotes the gradient with respect to the space variable. Then we will show that we can choose a suitable function {v : {\mathbb R}^d \times [0,1] \rightarrow {\mathbb R}} which is twice continuously differentiable in {x} and once continuously differentiable in {t}, such that the probablity law of the resulting process {X} will be given by {Q^\mu}. To motivate the introduction of the gradient ansatz, it is useful to compare the SDE (6) with the Langevin SDE (1) — the latter has a time-invariant drift given by the gradient of the log-density of {\mu} with respect to the Lebesgue measure, while in the former we are using a time-varying drift since our goal is to obtain a sample from {\mu} at time {t=1}, and intuition suggests that a nonstationary process is needed to make this work.

At any rate, denoting by {Q} the probability law of the process (6), we can use the Girsanov theorem to write down the Radon–Nikodym derivative of {Q} with respect to {P}:

\displaystyle  	\frac{dQ}{dP}(W) = \exp\left\{-\int^1_0 \langle \nabla v(W_t,t), dW_t \rangle + \frac{1}{2}\int^1_0 |\nabla v(W_t,t)|^2 dt\right\}. \ \ \ \ \ (7)

Comparing (7) with (5), the path forward is now clear: We need to show that {v} can be chosen in such a way that the right-hand side of (7) is equal to {f(W_1)}. This is where the Ezawa–Klauder–Shepp trick comes in. The key idea is to eliminate the Ito integral in (7) and then to reduce the problem to solving a suitable partial differential equation (PDE).

Let us define the process {(V_t)_{t \in [0,1]}} by {V_t := v(W_t,t)}. Ito’s lemma then gives

\displaystyle  	dV_t = \left(\frac{\partial}{\partial t}v(W_t,t) + \frac{1}{2}\Delta v(W_t,t)\right) dt + \langle \nabla v(W_t,t), dW_t \rangle,

where {\Delta = \nabla \cdot \nabla} is the Laplacian. Integrating and rearranging, we obtain

\displaystyle  	-\int^1_0 \langle \nabla v(W_t,t), dW_t \rangle = V_0 - V_1 + \int^1_0 \left(\frac{\partial}{\partial t}v(W_t,t) + \frac{1}{2}\Delta v(W_t,t)\right) dt.

Substituting this into (7) and using the definition of {V_t} gives

\displaystyle  	\frac{dQ}{dP}(W) = \exp\left\{v(0,0) - v(W_1,1) - \int^1_0 \left( \frac{\partial}{\partial t}v(W_t,t) + \frac{1}{2}\Delta v(W_t,t) -\frac{1}{2} |\nabla v(W_t,t)|^2\right) dt\right\}.

If we are to have any hope of reducing the right-hand side to an expression that depends only on {W_1}, we need to ensure that the integral over {t} is identically zero.

3. The PDE connection

One way of guaranteeing this is to assume that {v} solves the PDE

\displaystyle  	\frac{\partial}{\partial t}v(x,t) + \frac{1}{2}\Delta v(x,t) = \frac{1}{2} |\nabla v(x,t)|^2 \ \ \ \ \ (8)

for all {(x,t) \in {\mathbb R}^d \times [0,1]}, subject to the terminal condition {v(x,1) = - \log \psi(x)} for some strictly positive function {\psi : {\mathbb R}^d \rightarrow (0,\infty)} to be chosen later. Assuming that such a {v} exists, we will then have

\displaystyle  	\frac{dQ}{dP}(W) = e^{v(0,0)}\psi(W_1) \ \ \ \ \ (9)

which means that {\psi} should be chosen in such a way that {e^{v(0,0)}\psi(x) = f(x)} for all {x \in {\mathbb R}^d}. Now, the PDE (8) is nonlinear in {v} due to the presence of the squared norm of the gradient of {v} on the right-hand side. However, it is known from the theory of PDEs that we can convert it into a linear PDE that can be solved explicitly; this is accomplished by making the logarithmic (or Cole–Hopf) transformation {h(x,t) := e^{-v(x,t)}}. It is then a straightforward if tedious exercise in multivariate calculus to show that the function {h} solves a much nicer linear PDE

\displaystyle  	\frac{\partial}{\partial t}h(x,t) + \frac{1}{2}\Delta h(x,t) = 0 \ \ \ \ \ (10)

on {{\mathbb R}^d \times [0,1]} subject to the terminal condition {h(x,1) = \psi(x)}. The solution of (10) is given by the Feynman–Kac formula, one of the remarkable connections between the theory of (parabolic) PDEs and diffusion processes: For any {x \in {\mathbb R}^d} and {t \in [0,1]},

\displaystyle  	h(x,t) = {\mathbb E}_P[\psi(W_1)|W_t = x]. \ \ \ \ \ (11)

The proof of (11) is a simple application of Ito’s lemma: for any {t \in [0,1]},

\displaystyle  	h(W_1,1) = h(W_t,t) + \int^1_t \left( \frac{\partial}{\partial s}h(W_s,s) + \frac{1}{2}\Delta h(W_s,s)\right)ds + \int^1_t \langle \nabla h(W_s,s), dW_s \rangle.

Taking the conditional expectation given {W_t = x} and using the fact that {h} solves (10), we obtain (11).

Going back to (8), we can now write

\displaystyle  	v(x,t) = - \log h(x,t) = - \log {\mathbb E}_P[\psi(W_1)|W_t = x],

and, in particular, {v(0,0) = - \log {\mathbb E}_P[\psi(W_1)|W_0 = 0] = -\log {\mathbb E}_P[\psi(W_1)] = - \log {\mathbb E}_\gamma[\psi]}. Substituting all of this into (9) gives

\displaystyle  	\frac{dQ}{dP}(W) = \frac{h(W_1,1)}{h(0,0)} = \frac{\psi(W_1)}{{\mathbb E}_\gamma[\psi]},

which means that we should evidently choose {\psi = f} so that {Q = Q^\mu}. Moreover, using the known Gaussian form of the transition probability densities of the Brownian motion, we can give the drift in (6) explicitly as

\displaystyle  \begin{array}{rcl}  	-\nabla v(x,t) &=& \nabla \log {\mathbb E}_\gamma[f(x + \sqrt{1-t}Z)] \\ 	&=& \nabla \log \int_{{\mathbb R}^d} f(x + \sqrt{1-t}z) e^{-|z|^2/2} dz \end{array}

where {Z} is distributed according to the canonical Gaussian measure {\gamma}. This is known as the Föllmer drift (see this paper by Ronen Eldan and James Lee for more information and for further references).

4. The optimal control formulation

Those of you who are familiar with the theory of controlled diffusion processes should recognize the PDE (8) as the Hamilton–Jacobi–Bellman equation for the value function of a certain finite-horizon optimal stochastic control problem. While I do not want to get too much into details, I will sketch the basic idea. The first step is to use the identity

\displaystyle  	-\frac{1}{2}|\xi|^2 = \min_{u \in {\mathbb R}^d} \left\{ \langle \xi,u \rangle + \frac{1}{2}|u|^2\right\},

where the minimum is achieved uniquely by {u^* = - \xi}, to rewrite (8) in the equivalent form

\displaystyle  	\frac{\partial}{\partial t}v(x,t) + \frac{1}{2}\Delta v(x,t) = - \min_{u \in {\mathbb R}^d} \left\{ \langle \nabla v(x,t), u \rangle + \frac{1}{2}|u|^2\right\} \ \ \ \ \ (12)

for {(x,t) \in {\mathbb R}^d \times [0,1]} with {v(x,1) = -\log f(x)}. The stochastic control connection is as follows: We seek an adapted control process {u = (u_t)_{t \in [0,1]}} to minimize the expected cost

\displaystyle  	J(u) := {\mathbb E}^u \left[ \frac{1}{2}\int^1_0 |u_t|^2 dt - \log f(X^u_1) \right], \ \ \ \ \ (13)

where {{\mathbb E}^u[\cdot]} denotes expectation with respect to the controlled diffusion process

\displaystyle  	dX^u_t = u_t dt + dW_t, \qquad X^u_0 = 0, t \in [0,1]. \ \ \ \ \ (14)

Here, the integral {\frac{1}{2}\int^1_0 |u_t|^2 dt} is the control cost, while {-\log f(X^u_1)} is the terminal cost at {t=1}. It can then be shown using the so-called verification theorem in the theory of controlled diffusions that the solution {v(x,t)} of the PDE (12) gives the value function (or the optimal cost-to-go function)

\displaystyle  	v(x,t) := \min_{u} {\mathbb E}^u \Bigg[\frac{1}{2}\int^1_t |u_s|^2 ds - \log f(X^u_1) \Bigg|X^u_t = x\Bigg],

so that, in particular, {v(0,0) \equiv \min_u J(u)}. Here, the idea is that we add the drift {u} to the standard Brownian motion {W} to steer the process towards the target distribution {\mu} at {t=1}, while keeping the total “control effort” small. Of course, from the preceding derivations we know that {v(0,0) = - \log {\mathbb E}_\gamma[f(Z)] = 0}, and that the optimal control is of the state feedback form, given explicitly by the Föllmer drift. Moreover, one can show using the Girsanov theorem that every {Q \in {\cal M}^\mu} can be realized as a controlled process of the form (14) for some admissible drift {u}, so that {X^u_1} has law {\mu} and

\displaystyle  	D(Q \| P) = {\mathbb E}^u \left[ \frac{1}{2}\int^1_0 |u_t|^2 dt \right] = J(u) + {\mathbb E}^u[\log f(X^u_1)] = J(u) + D(\mu \| \gamma).

This implies, in turn, that the Föllmer drift that gives the optimal measure {Q^\mu} is the most efficient control, in the sense that the sum of the expected control cost and the expected terminal cost is identically zero, and therefore none of the control effort is wasted along the way. Of course, one could have started with this optimal control formulation (as I do in this paper with Belinda Tzen), but I feel that the above derivation is more “elementary” in the sense that it does not rely on any specialized knowledge beyond basic stochastic calculus.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: