These notes are based on the tutorial I gave at the Geometric Methods in Optimization and Sampling Boot Camp at the Simons Institute in Berkeley.
Suppose we wish to obtain samples from some probability measure on
. If
has a sufficiently well-behaved density
with respect to the Lebesgue measure, i.e.,
, then we can use the (overdamped) continuous-time Langevin dynamics, governed by the Ito stochastic differential equation (SDE)
where the initial condition is generated according to some probability law
, and
is the standard
-dimensional Brownian motion. Let
denote the probability law of
. Then, under appropriate regularity conditions on
, one can establish the following:
-
is the unique invariant distribution of (1), i.e., if
, then
for all
.
-
converges to
in a suitable sense as
— in fact, it is often possible to show that there exists a constant
that depends only on
, such that one has the exponential convergence to equilibrium
for some distance between probability measures on
.
In this sense, the Langevin process (1) gives only approximate samples from . I would like to discuss an alternative approach that uses diffusion processes to obtain exact samples in finite time. This approach is based on ideas that appeared in two papers from the 1930s by Erwin Schrödinger in the context of physics, and is now referred to as the Schrödinger bridge problem.
We will now assume that the target probability measure has a density
with respect to the canonical Gaussian measure on
, i.e.,
We will also assume that is everywhere strictly positive. Our goal is to construct an Ito process governed by the SDE
-
has probability law
;
-
, the probability law of the process
, is optimal in the sense that, among all the diffusion processes with values in
starting at the origin at
and having the marginal distribution
at
, it remains “as close as possible” to the standard Brownian motion in the sense to be made precise below.
Note that the drift in (2) is time-dependent. This problem has been studied from many angles, and the optimal drift has an explicit form. My goal here is to give what is, to the best of my knowledge, a somewhat new and rather transparent derivation of the solution to the Schrödinger bridge problem. The main idea is based on a rather neat trick that I picked up from a paper by Hiroshi Ezawa, John Klauder, and Larry Shepp (although it was already present in some form in an earlier short paper by Václav Beneš and Shepp). The presentation will be rather informal, but every step can be made rigorous.
1. The Schrödinger bridge problem
The basic problem can be stated as follows. Let denote the space of continuous functions (paths) from
into
, and let
denote the canonical coordinate process on
: For any
,
. Let
denote the Wiener measure on
, under which
is the standard
-dimensional Brownian motion on
. Given any probability measure
on
, we will denote by
the marginal probability law of the value of the path at time
. With this, we consider the set
where denotes the Dirac probability measure on
centered at the origin. Our goal is to find
where denotes the relative entropy between two probability measures on
. We will need to do two things: (a) show that
exists and is unique, and (b) show that it is the probability law of a diffusion process governed by an Ito SDE of the form (2).
To take care of (a), all we need is the chain rule for the relative entropy. Fix an arbitrary ; without loss of generality, we can assume that
is absolutely continuous with respect to the Wiener measure
, i.e.,
. The chain rule then gives
where (respectively,
) denotes the conditional probability law of
given
under
(respectively,
). The conditional probability measure
is well-known, and gives the probability law of the Brownian bridge process, i.e., the standard Brownian motion pinned to the point
at
. Now, since
, we have
, whereas
for the Wiener measure
. Thus, we have the following for the right-hand side of (3):
where equality holds if and only if almost everywhere. This immediately implies two things:
-
;
- the above infimum is attained by the probability measure
i.e., by the-mixture of the Brownian bridge processes
.
From (4), it is easy to obtain an explicit expression for the Radon–Nikodym derivative : Let
be an arbitrary bounded measurable real-valued function on
. Then, using the fact that
, (4), and the law of iterated expectation, we can write
where the first equation is due to the fact that , in the second equation we have used the fact that the conditional laws
and
coincide, and in the last equation
denotes the standard Brownian motion process with process law
. That is, the Radon–Nikodym derivative of
with respect to
, as a function of the path
, depends only on the terminal point
and takes the form
where, as we recall, denotes the density of the target measure
with respect to the standard Gaussian measure
on
. Thus, the optimal measure
is simply the law of the standard
-dimensional Brownian motion on
, “conditioned” to have law
at
.
2. The gradient ansatz and the Girsanov transformation
In principle, (5) already tells us that we can obtain samples from by “reweighting” the Brownian paths
using the weight factor
that depends only on
. What we are after, however, is something different: Instead of reweighting, we would like to construct a “transport” map
such that
, i.e., the optimal measure
is the pushforward of the Wiener measure by
:
for any bounded measurable . Moreover, we want this
to be such that the “transported” process
is an Ito diffusion process. In order to construct such a map, we will use a fundamental result in stochastic calculus, the Girsanov theorem, which gives the necessary and sufficient condition for a probability measure
on
to be absolutely continuous with respect to the Wiener measure
.
In order to make use of it, we will first make the gradient ansatz, i.e., we will assume that the diffusion process we seek is governed by an Ito SDE of the form
where denotes the gradient with respect to the space variable. Then we will show that we can choose a suitable function
which is twice continuously differentiable in
and once continuously differentiable in
, such that the probablity law of the resulting process
will be given by
. To motivate the introduction of the gradient ansatz, it is useful to compare the SDE (6) with the Langevin SDE (1) — the latter has a time-invariant drift given by the gradient of the log-density of
with respect to the Lebesgue measure, while in the former we are using a time-varying drift since our goal is to obtain a sample from
at time
, and intuition suggests that a nonstationary process is needed to make this work.
At any rate, denoting by the probability law of the process (6), we can use the Girsanov theorem to write down the Radon–Nikodym derivative of
with respect to
:
Comparing (7) with (5), the path forward is now clear: We need to show that can be chosen in such a way that the right-hand side of (7) is equal to
. This is where the Ezawa–Klauder–Shepp trick comes in. The key idea is to eliminate the Ito integral in (7) and then to reduce the problem to solving a suitable partial differential equation (PDE).
Let us define the process by
. Ito’s lemma then gives
where is the Laplacian. Integrating and rearranging, we obtain
Substituting this into (7) and using the definition of gives
If we are to have any hope of reducing the right-hand side to an expression that depends only on , we need to ensure that the integral over
is identically zero.
3. The PDE connection
One way of guaranteeing this is to assume that solves the PDE
for all , subject to the terminal condition
for some strictly positive function
to be chosen later. Assuming that such a
exists, we will then have
which means that should be chosen in such a way that
for all
. Now, the PDE (8) is nonlinear in
due to the presence of the squared norm of the gradient of
on the right-hand side. However, it is known from the theory of PDEs that we can convert it into a linear PDE that can be solved explicitly; this is accomplished by making the logarithmic (or Cole–Hopf) transformation
. It is then a straightforward if tedious exercise in multivariate calculus to show that the function
solves a much nicer linear PDE
on subject to the terminal condition
. The solution of (10) is given by the Feynman–Kac formula, one of the remarkable connections between the theory of (parabolic) PDEs and diffusion processes: For any
and
,
The proof of (11) is a simple application of Ito’s lemma: for any ,
Taking the conditional expectation given and using the fact that
solves (10), we obtain (11).
Going back to (8), we can now write
and, in particular, . Substituting all of this into (9) gives
which means that we should evidently choose so that
. Moreover, using the known Gaussian form of the transition probability densities of the Brownian motion, we can give the drift in (6) explicitly as
where is distributed according to the canonical Gaussian measure
. This is known as the Föllmer drift (see this paper by Ronen Eldan and James Lee for more information and for further references).
4. The optimal control formulation
Those of you who are familiar with the theory of controlled diffusion processes should recognize the PDE (8) as the Hamilton–Jacobi–Bellman equation for the value function of a certain finite-horizon optimal stochastic control problem. While I do not want to get too much into details, I will sketch the basic idea. The first step is to use the identity
where the minimum is achieved uniquely by , to rewrite (8) in the equivalent form
for with
. The stochastic control connection is as follows: We seek an adapted control process
to minimize the expected cost
where denotes expectation with respect to the controlled diffusion process
Here, the integral is the control cost, while
is the terminal cost at
. It can then be shown using the so-called verification theorem in the theory of controlled diffusions that the solution
of the PDE (12) gives the value function (or the optimal cost-to-go function)
so that, in particular, . Here, the idea is that we add the drift
to the standard Brownian motion
to steer the process towards the target distribution
at
, while keeping the total “control effort” small. Of course, from the preceding derivations we know that
, and that the optimal control is of the state feedback form, given explicitly by the Föllmer drift. Moreover, one can show using the Girsanov theorem that every
can be realized as a controlled process of the form (14) for some admissible drift
, so that
has law
and
This implies, in turn, that the Föllmer drift that gives the optimal measure is the most efficient control, in the sense that the sum of the expected control cost and the expected terminal cost is identically zero, and therefore none of the control effort is wasted along the way. Of course, one could have started with this optimal control formulation (as I do in this paper with Belinda Tzen), but I feel that the above derivation is more “elementary” in the sense that it does not rely on any specialized knowledge beyond basic stochastic calculus.
Welcome back! It’s been a long break.