These notes are based on the tutorial I gave at the Geometric Methods in Optimization and Sampling Boot Camp at the Simons Institute in Berkeley.
Suppose we wish to obtain samples from some probability measure on . If has a sufficiently well-behaved density with respect to the Lebesgue measure, i.e., , then we can use the (overdamped) continuous-time Langevin dynamics, governed by the Ito stochastic differential equation (SDE)
where the initial condition is generated according to some probability law , and is the standard -dimensional Brownian motion. Let denote the probability law of . Then, under appropriate regularity conditions on , one can establish the following:
- is the unique invariant distribution of (1), i.e., if , then for all .
- converges to in a suitable sense as — in fact, it is often possible to show that there exists a constant that depends only on , such that one has the exponential convergence to equilibrium
for some distance between probability measures on .
In this sense, the Langevin process (1) gives only approximate samples from . I would like to discuss an alternative approach that uses diffusion processes to obtain exact samples in finite time. This approach is based on ideas that appeared in two papers from the 1930s by Erwin Schrödinger in the context of physics, and is now referred to as the Schrödinger bridge problem.
We will now assume that the target probability measure has a density with respect to the canonical Gaussian measure on , i.e.,
- has probability law ;
- , the probability law of the process , is optimal in the sense that, among all the diffusion processes with values in starting at the origin at and having the marginal distribution at , it remains “as close as possible” to the standard Brownian motion in the sense to be made precise below.
Note that the drift in (2) is time-dependent. This problem has been studied from many angles, and the optimal drift has an explicit form. My goal here is to give what is, to the best of my knowledge, a somewhat new and rather transparent derivation of the solution to the Schrödinger bridge problem. The main idea is based on a rather neat trick that I picked up from a paper by Hiroshi Ezawa, John Klauder, and Larry Shepp (although it was already present in some form in an earlier short paper by Václav Beneš and Shepp). The presentation will be rather informal, but every step can be made rigorous.
1. The Schrödinger bridge problem
The basic problem can be stated as follows. Let denote the space of continuous functions (paths) from into , and let denote the canonical coordinate process on : For any , . Let denote the Wiener measure on , under which is the standard -dimensional Brownian motion on . Given any probability measure on , we will denote by the marginal probability law of the value of the path at time . With this, we consider the set
where denotes the Dirac probability measure on centered at the origin. Our goal is to find
where denotes the relative entropy between two probability measures on . We will need to do two things: (a) show that exists and is unique, and (b) show that it is the probability law of a diffusion process governed by an Ito SDE of the form (2).
To take care of (a), all we need is the chain rule for the relative entropy. Fix an arbitrary ; without loss of generality, we can assume that is absolutely continuous with respect to the Wiener measure , i.e., . The chain rule then gives
where (respectively, ) denotes the conditional probability law of given under (respectively, ). The conditional probability measure is well-known, and gives the probability law of the Brownian bridge process, i.e., the standard Brownian motion pinned to the point at . Now, since , we have , whereas for the Wiener measure . Thus, we have the following for the right-hand side of (3):
where equality holds if and only if almost everywhere. This immediately implies two things:
- the above infimum is attained by the probability measure
i.e., by the -mixture of the Brownian bridge processes .
From (4), it is easy to obtain an explicit expression for the Radon–Nikodym derivative : Let be an arbitrary bounded measurable real-valued function on . Then, using the fact that , (4), and the law of iterated expectation, we can write
where the first equation is due to the fact that , in the second equation we have used the fact that the conditional laws and coincide, and in the last equation denotes the standard Brownian motion process with process law . That is, the Radon–Nikodym derivative of with respect to , as a function of the path , depends only on the terminal point and takes the form
where, as we recall, denotes the density of the target measure with respect to the standard Gaussian measure on . Thus, the optimal measure is simply the law of the standard -dimensional Brownian motion on , “conditioned” to have law at .
2. The gradient ansatz and the Girsanov transformation
In principle, (5) already tells us that we can obtain samples from by “reweighting” the Brownian paths using the weight factor that depends only on . What we are after, however, is something different: Instead of reweighting, we would like to construct a “transport” map such that , i.e., the optimal measure is the pushforward of the Wiener measure by :
for any bounded measurable . Moreover, we want this to be such that the “transported” process is an Ito diffusion process. In order to construct such a map, we will use a fundamental result in stochastic calculus, the Girsanov theorem, which gives the necessary and sufficient condition for a probability measure on to be absolutely continuous with respect to the Wiener measure .
where denotes the gradient with respect to the space variable. Then we will show that we can choose a suitable function which is twice continuously differentiable in and once continuously differentiable in , such that the probablity law of the resulting process will be given by . To motivate the introduction of the gradient ansatz, it is useful to compare the SDE (6) with the Langevin SDE (1) — the latter has a time-invariant drift given by the gradient of the log-density of with respect to the Lebesgue measure, while in the former we are using a time-varying drift since our goal is to obtain a sample from at time , and intuition suggests that a nonstationary process is needed to make this work.
At any rate, denoting by the probability law of the process (6), we can use the Girsanov theorem to write down the Radon–Nikodym derivative of with respect to :
Comparing (7) with (5), the path forward is now clear: We need to show that can be chosen in such a way that the right-hand side of (7) is equal to . This is where the Ezawa–Klauder–Shepp trick comes in. The key idea is to eliminate the Ito integral in (7) and then to reduce the problem to solving a suitable partial differential equation (PDE).
Let us define the process by . Ito’s lemma then gives
where is the Laplacian. Integrating and rearranging, we obtain
Substituting this into (7) and using the definition of gives
If we are to have any hope of reducing the right-hand side to an expression that depends only on , we need to ensure that the integral over is identically zero.
3. The PDE connection
which means that should be chosen in such a way that for all . Now, the PDE (8) is nonlinear in due to the presence of the squared norm of the gradient of on the right-hand side. However, it is known from the theory of PDEs that we can convert it into a linear PDE that can be solved explicitly; this is accomplished by making the logarithmic (or Cole–Hopf) transformation . It is then a straightforward if tedious exercise in multivariate calculus to show that the function solves a much nicer linear PDE
on subject to the terminal condition . The solution of (10) is given by the Feynman–Kac formula, one of the remarkable connections between the theory of (parabolic) PDEs and diffusion processes: For any and ,
The proof of (11) is a simple application of Ito’s lemma: for any ,
Going back to (8), we can now write
and, in particular, . Substituting all of this into (9) gives
which means that we should evidently choose so that . Moreover, using the known Gaussian form of the transition probability densities of the Brownian motion, we can give the drift in (6) explicitly as
where is distributed according to the canonical Gaussian measure . This is known as the Föllmer drift (see this paper by Ronen Eldan and James Lee for more information and for further references).
4. The optimal control formulation
Those of you who are familiar with the theory of controlled diffusion processes should recognize the PDE (8) as the Hamilton–Jacobi–Bellman equation for the value function of a certain finite-horizon optimal stochastic control problem. While I do not want to get too much into details, I will sketch the basic idea. The first step is to use the identity
where the minimum is achieved uniquely by , to rewrite (8) in the equivalent form
Here, the integral is the control cost, while is the terminal cost at . It can then be shown using the so-called verification theorem in the theory of controlled diffusions that the solution of the PDE (12) gives the value function (or the optimal cost-to-go function)
so that, in particular, . Here, the idea is that we add the drift to the standard Brownian motion to steer the process towards the target distribution at , while keeping the total “control effort” small. Of course, from the preceding derivations we know that , and that the optimal control is of the state feedback form, given explicitly by the Föllmer drift. Moreover, one can show using the Girsanov theorem that every can be realized as a controlled process of the form (14) for some admissible drift , so that has law and
This implies, in turn, that the Föllmer drift that gives the optimal measure is the most efficient control, in the sense that the sum of the expected control cost and the expected terminal cost is identically zero, and therefore none of the control effort is wasted along the way. Of course, one could have started with this optimal control formulation (as I do in this paper with Belinda Tzen), but I feel that the above derivation is more “elementary” in the sense that it does not rely on any specialized knowledge beyond basic stochastic calculus.