## Sampling Using Diffusion Processes, from Langevin to Schrödinger

These notes are based on the tutorial I gave at the Geometric Methods in Optimization and Sampling Boot Camp at the Simons Institute in Berkeley.

Suppose we wish to obtain samples from some probability measure on . If has a sufficiently well-behaved density with respect to the Lebesgue measure, i.e., , then we can use the (overdamped) continuous-time Langevin dynamics, governed by the Ito stochastic differential equation (SDE)

where the initial condition is generated according to some probability law , and is the standard -dimensional Brownian motion. Let denote the probability law of . Then, under appropriate regularity conditions on , one can establish the following:

- is the unique invariant distribution of (1), i.e., if , then for all .
- converges to in a suitable sense as — in fact, it is often possible to show that there exists a constant that depends only on , such that one has the exponential convergence to equilibrium
for some distance between probability measures on .

In this sense, the Langevin process (1) gives only *approximate samples* from . I would like to discuss an alternative approach that uses diffusion processes to obtain *exact* samples in *finite* time. This approach is based on ideas that appeared in two papers from the 1930s by Erwin Schrödinger in the context of physics, and is now referred to as the *Schrödinger bridge problem*.

We will now assume that the target probability measure has a density with respect to the canonical Gaussian measure on , i.e.,

We will also assume that is everywhere strictly positive. Our goal is to construct an Ito process governed by the SDE

- has probability law ;
- , the probability law of the process , is optimal in the sense that, among all the diffusion processes with values in starting at the origin at and having the marginal distribution at , it remains “as close as possible” to the standard Brownian motion in the sense to be made precise below.

Note that the drift in (2) is *time-dependent*. This problem has been studied from many angles, and the optimal drift has an explicit form. My goal here is to give what is, to the best of my knowledge, a somewhat new and rather transparent derivation of the solution to the Schrödinger bridge problem. The main idea is based on a rather neat trick that I picked up from a paper by Hiroshi Ezawa, John Klauder, and Larry Shepp (although it was already present in some form in an earlier short paper by Václav Beneš and Shepp). The presentation will be rather informal, but every step can be made rigorous.

**1. The Schrödinger bridge problem **

The basic problem can be stated as follows. Let denote the space of continuous functions (paths) from into , and let denote the canonical coordinate process on : For any , . Let denote the *Wiener measure* on , under which is the standard -dimensional Brownian motion on . Given any probability measure on , we will denote by the marginal probability law of the value of the path at time . With this, we consider the set

where denotes the Dirac probability measure on centered at the origin. Our goal is to find

where denotes the relative entropy between two probability measures on . We will need to do two things: (a) show that exists and is unique, and (b) show that it is the probability law of a diffusion process governed by an Ito SDE of the form (2).

To take care of (a), all we need is the chain rule for the relative entropy. Fix an arbitrary ; without loss of generality, we can assume that is absolutely continuous with respect to the Wiener measure , i.e., . The chain rule then gives

where (respectively, ) denotes the conditional probability law of given under (respectively, ). The conditional probability measure is well-known, and gives the probability law of the *Brownian bridge process*, i.e., the standard Brownian motion pinned to the point at . Now, since , we have , whereas for the Wiener measure . Thus, we have the following for the right-hand side of (3):

where equality holds if and only if almost everywhere. This immediately implies two things:

From (4), it is easy to obtain an explicit expression for the Radon–Nikodym derivative : Let be an arbitrary bounded measurable real-valued function on . Then, using the fact that , (4), and the law of iterated expectation, we can write

where the first equation is due to the fact that , in the second equation we have used the fact that the conditional laws and coincide, and in the last equation denotes the standard Brownian motion process with process law . That is, the Radon–Nikodym derivative of with respect to , as a function of the path , depends only on the terminal point and takes the form

where, as we recall, denotes the density of the target measure with respect to the standard Gaussian measure on . Thus, the optimal measure is simply the law of the standard -dimensional Brownian motion on , “conditioned” to have law at .

**2. The gradient ansatz and the Girsanov transformation **

In principle, (5) already tells us that we can obtain samples from by “reweighting” the Brownian paths using the weight factor that depends only on . What we are after, however, is something different: Instead of reweighting, we would like to construct a “transport” map such that , i.e., the optimal measure is the pushforward of the Wiener measure by :

for any bounded measurable . Moreover, we want this to be such that the “transported” process is an Ito diffusion process. In order to construct such a map, we will use a fundamental result in stochastic calculus, the *Girsanov theorem*, which gives the necessary and sufficient condition for a probability measure on to be absolutely continuous with respect to the Wiener measure .

In order to make use of it, we will first make the gradient ansatz, i.e., we will assume that the diffusion process we seek is governed by an Ito SDE of the form

where denotes the gradient with respect to the space variable. Then we will show that we can choose a suitable function which is twice continuously differentiable in and once continuously differentiable in , such that the probablity law of the resulting process will be given by . To motivate the introduction of the gradient ansatz, it is useful to compare the SDE (6) with the Langevin SDE (1) — the latter has a time-invariant drift given by the gradient of the log-density of with respect to the Lebesgue measure, while in the former we are using a time-varying drift since our goal is to obtain a sample from at time , and intuition suggests that a nonstationary process is needed to make this work.

At any rate, denoting by the probability law of the process (6), we can use the Girsanov theorem to write down the Radon–Nikodym derivative of with respect to :

Comparing (7) with (5), the path forward is now clear: We need to show that can be chosen in such a way that the right-hand side of (7) is equal to . This is where the Ezawa–Klauder–Shepp trick comes in. The key idea is to eliminate the Ito integral in (7) and then to reduce the problem to solving a suitable partial differential equation (PDE).

Let us define the process by . Ito’s lemma then gives

where is the Laplacian. Integrating and rearranging, we obtain

Substituting this into (7) and using the definition of gives

If we are to have any hope of reducing the right-hand side to an expression that depends only on , we need to ensure that the integral over is identically zero.

**3. The PDE connection **

One way of guaranteeing this is to assume that solves the PDE

for all , subject to the terminal condition for some strictly positive function to be chosen later. Assuming that such a exists, we will then have

which means that should be chosen in such a way that for all . Now, the PDE (8) is *nonlinear* in due to the presence of the squared norm of the gradient of on the right-hand side. However, it is known from the theory of PDEs that we can convert it into a *linear* PDE that can be solved explicitly; this is accomplished by making the *logarithmic* (or *Cole–Hopf*) *transformation* . It is then a straightforward if tedious exercise in multivariate calculus to show that the function solves a much nicer linear PDE

on subject to the terminal condition . The solution of (10) is given by the *Feynman–Kac formula*, one of the remarkable connections between the theory of (parabolic) PDEs and diffusion processes: For any and ,

The proof of (11) is a simple application of Ito’s lemma: for any ,

Taking the conditional expectation given and using the fact that solves (10), we obtain (11).

Going back to (8), we can now write

and, in particular, . Substituting all of this into (9) gives

which means that we should evidently choose so that . Moreover, using the known Gaussian form of the transition probability densities of the Brownian motion, we can give the drift in (6) explicitly as

where is distributed according to the canonical Gaussian measure . This is known as the *Föllmer drift* (see this paper by Ronen Eldan and James Lee for more information and for further references).

**4. The optimal control formulation **

Those of you who are familiar with the theory of controlled diffusion processes should recognize the PDE (8) as the *Hamilton–Jacobi–Bellman equation* for the value function of a certain finite-horizon optimal stochastic control problem. While I do not want to get too much into details, I will sketch the basic idea. The first step is to use the identity

where the minimum is achieved uniquely by , to rewrite (8) in the equivalent form

for with . The stochastic control connection is as follows: We seek an adapted control process to minimize the expected cost

where denotes expectation with respect to the controlled diffusion process

Here, the integral is the *control cost*, while is the *terminal cost* at . It can then be shown using the so-called *verification theorem* in the theory of controlled diffusions that the solution of the PDE (12) gives the *value function* (or the optimal cost-to-go function)

so that, in particular, . Here, the idea is that we add the drift to the standard Brownian motion to steer the process towards the target distribution at , while keeping the total “control effort” small. Of course, from the preceding derivations we know that , and that the optimal control is of the state feedback form, given explicitly by the Föllmer drift. Moreover, one can show using the Girsanov theorem that every can be realized as a controlled process of the form (14) for some admissible drift , so that has law and

This implies, in turn, that the Föllmer drift that gives the optimal measure is the most efficient control, in the sense that the sum of the expected control cost and the expected terminal cost is identically zero, and therefore none of the control effort is wasted along the way. Of course, one could have started with this optimal control formulation (as I do in this paper with Belinda Tzen), but I feel that the above derivation is more “elementary” in the sense that it does not rely on any specialized knowledge beyond basic stochastic calculus.

leave a comment