Divergence in everything: mixing rates for finite Markov chains

This is the first post in a series aobut the versatility and the power of the good old Kullback-Leibler divergence.

(image yoinked from Sergio Verdú‘s 2007 Shannon Lecture slides)

Today I will describe a beautiful application of the divergence to the problem of determining mixing rates of finite Markov chains. This argument is quite recent, and comes from a nice paper by Sergey Bobkov and Prasad Tetali (“Modified logarithmic Sobolev inequalities in discrete settings,” Journal of Theoretical Probability, vol. 19, no. 2, pp. 209-336, 2006; preprint). Since my interest here is information-theoretic, I will take for granted certain facts from the theory of finite Markov chains.

Let {{\mathsf X}} be a finite state space, and consider a stochastic matrix {P} on {{\mathsf X}}, i.e., {P(x,y) \ge 0} for all {x,y \in {\mathsf X}} and

\displaystyle  \sum_{y \in {\mathsf X}}P(x,y) = 1.

Following standard usage, I will represent probability distributions on {{\mathsf X}} by row vectors and real-valued functions on {{\mathsf X}} by column vectors. With these conventions, {P} acts on probability distributions from the right, {\pi \mapsto \pi P}, and on functions from the left, {f \mapsto P f}.

It is easiest to work with continuous-time Markov chains. Any such chain is an {{\mathsf X}}-valued random process {\{ X_t \}_{t \ge 0}} with some specified initial distribution of {X_0}, say, {\mu_0}. Then the distribution of {X_t} is given by

\displaystyle  \mu_t = \mu_0 P_t, \qquad \text{where } P_t = e^{-t (I - P)} = e^{-t} \sum^\infty_{n=0} \frac{t^n P^n}{n!}.

Here {I} is the identity matrix, {L = -(I-P)} is the infinitesimal generator of the Markov semigroup {\{P_t\}_{t \ge 0}}, and {P^n} denotes the {n}th matrix power of {P}.

We are interested in the asymptotic behavior of the random variables {X_t} for large values of {t}. In order to be able to make sensible statements about it, we will start with the following basic assumption:

Irreducibility — For any pair {x,y \in {\mathsf X}}, there exists some positive integer {n \ge 1}, such that {P^n (x,y) > 0}.

From this it can be shown that:

  1. There exists a stationary distribution {\pi}, such that {\pi = \pi P}. In other words, if {X_0 \sim \pi}, then {X_t \sim \pi} for all {t}.
  2. This {\pi} is unique, and {\pi(x) > 0} for all {x \in {\mathsf X}}.

Let {\mu_0} be the initial distribution, and let {\mu_t} denote the distribution of {X_t}. We will show that, no matter what {\mu_0} is, {\mu_t} will converge to {\pi} in total variation, and moreover the convergence is exponentially fast. This property, often referred to as geometric ergodicity, is a fundamental statement about stability, and so finds applications in many areas, from mathematical statistical mechanics (in connection with the second law of thermodynamics) to stochastic control. Therefore, it is not surprising that there are numerous papers and even books devoted to geometric ergodicity. A standard way of deriving geometric ergodicity is through spectral methods, pertaining to the eigenvalues of {L} viewed as a linear operator on the Hilbert space {L^2(\pi)}. There is quite a bit of work along this direction; see, for example, papers by Persi Diaconis and Daniel Stroock (for the reversible case) and by James Allen Fill (for the nonreversible case).

The argument due to Bobkov and Tetali, which I am about to describe, is information-theoretic in nature. Instead of studying the spectral properties of {L}, they look at the optimal rate at which the divergence {D(\mu_t \| \pi)} converges to zero.

Since {\pi} is strictly positive, any probability distribution {\mu} on {{\mathsf X}} has a density relative to it, i.e., the function {f(x) = \mu(x)/\pi(x)} is nonnegative and finite. To study the asymptotic behavior of {\mu_t}, we will examine the evolution of the density {f_t(x) = \mu_t(x)/\pi(x)}. To that end, let us define the time reversal of {P} (what Fill calls the reversibilization of {P}), which we denote by {P^*}, by

\displaystyle  \pi(x) P^*(x,y) = \pi(y) P(y,x), \qquad \forall x,y \in {\mathsf X}.

Note that we did not assume that the Markov chain is reversible, i.e., {P^* = P}. We have the following basic fact:

Lemma 1 For every positive integer {n \ge 1} and any two {x,y \in {\mathsf X}},

\displaystyle  	\pi(x) (P^*)^n(x,y) = \pi(y) P^n(y,x).

Proof: The proof is by induction. The base case, {n = 1}, holds by the definition of {P^*}. Assume then that the statement is true for some {n \ge 1}. Then

\displaystyle  \begin{array}{rcl}  		\pi(x)(P^*)^{n+1}(x,y) &=& \sum_{u \in {\mathsf X}} \pi(x) (P^*)^n(x,u) P^*(u,y) \\ 		&=& \sum_{u \in {\mathsf X}} \pi(u) P^n(u,x) P^*(u,y) \\ 		&=& \sum_{u \in {\mathsf X}} P^n(u,x) \pi(u) P^*(u,y) \\ 		&=& \sum_{u \in {\mathsf X}} P^n(u,x) \pi(y) P(y,u) \\ 		&=& \pi(y) \sum_{u \in {\mathsf X}} P(y,u)P^n(u,x) \\ 		&\equiv& \pi(y) P^{n+1}(y,x), 	\end{array}

where we have used the rule for matrix multiplication, the induction hypothesis, and the definition of {P^*}. This finishes the proof. \Box

Let us also define the dual semigroup {\{P^*_t\}_{t \ge 0}} by

\displaystyle  P^*_t = e^{-t(I - P^*)} = e^{-t}\sum^\infty_{n=0}\frac{t^n (P^*)^n}{n!}.

Armed with this definition, we can proceed to establish the following key result:

Lemma 2 For any {\mu_0} and any {t \ge 0}, {f_t = P^*_t f_0}. Moreover, for any {x \in {\mathsf X}}

\displaystyle  	\frac{df_t(x)}{dt} = L^* f_t(x),

where {L^* = -(I - P^*)} is the generator of the dual Markov semigroup.

Proof: Starting with {\mu_t = \mu_0 P_t}, we can write

\displaystyle  \begin{array}{rcl}  		\mu_t(x) &=& \sum_{y \in {\mathsf X}} \mu_0(y) P_t(y,x) \\ 		&=& \sum_{y \in {\mathsf X}} f_0(y) \pi(y) P_t(y,x) \\ 		&=& \sum_{y \in {\mathsf X}} f_0(y) e^{-t}\sum^\infty_{n=0}\frac{t^n \pi(y) P^n(y,x)}{n!} \\ 		&=& \sum_{y \in {\mathsf X}} f_0(y)e^{-t}\sum^\infty_{n=0}\frac{t^n \pi(x) (P^*)^n(x,y)}{n!} \qquad \text{(by Lemma 1)}\\ 		&=& \pi(x) \sum_{y \in {\mathsf X}} e^{-t} \left( \sum^\infty_{n=0} \frac{t^n (P^*)^n(x,y)}{n!}\right) f_0(y) \\ 		&=& \pi(x) \sum_{y \in {\mathsf X}} P^*_t(x,y) f_0(y) \\ 		&=& \pi(x) \cdot P^*_t f_0 (x). 	\end{array}

Dividing both sides by {\pi(x)} (which is strictly positive by irreducibility), we obtain the first statement of the lemma.

Next, by linearity we have

\displaystyle  	\frac{df_t(x)}{dt} = \left( \frac{d}{dt}P^*_t\right) f_0 (x) = L^* P^*_t f_0(x) = L^* f_t(x).

Note: the time derivative of {P^*_t} is rigorously defined via

\displaystyle  \frac{dP^*_t}{dt} = \lim_{\varepsilon \rightarrow 0}\frac{P^*_{t+\varepsilon} - P^*_t}{\varepsilon},

and it is not hard to show, using the infinite series representation of {P^*_t}, that {(d/dt)P^*_t = L^* P_t}. \Box

Before we can finally get to business, we need to introduce one more definition. For any two functions {f, g : {\mathsf X} \rightarrow {\mathbb R}}, define the so-called Dirichlet form

\displaystyle  {\mathcal E}(f,g) = - \sum_{x \in {\mathsf X}} f(x) Lg(x) \pi(x).

Here is why this definition comes in handy:

Theorem 3 For any {t > 0},

\displaystyle  	\frac{d}{dt} D(\mu_t \| \pi) = -{\mathcal E} (f_t, \log f_t).

Proof: Using the fact that {P} is irreducible and Lemma~1, it is easy to show that {f_t > 0} for all {t}. Hence, the function {t \mapsto D(\mu_t \| \pi)} is differentiable for all {0 < t < +\infty}. Now, by definition,

\displaystyle  \begin{array}{rcl}  	D(\mu_t \| \pi) &=& \sum_{x \in {\mathsf X}} \mu_t(x) \log \frac{\mu_t(x)}{\pi(x)} \\ 	&=& \sum_{x \in {\mathsf X}} \pi(x) f_t(x) \log f_t(x) \\ 	&=& \mathop{\mathbb E}{}_\pi [f_t \log f_t]. \end{array}

Therefore, by linearity of expectation

\displaystyle  \begin{array}{rcl}  	\frac{d}{dt} D(\mu_t \| \pi) &=& \mathop{\mathbb E}{}_\pi \left[ \frac{d}{dt}(f_t \log f_t) \right] \\ 	&=& \mathop{\mathbb E}{}_\pi\left[ (\log f_t + 1) \frac{df_t}{dt} \right] \\ 	&=& \mathop{\mathbb E}{}_\pi \left[ (\log f_t + 1) L^* f_t\right] \qquad \qquad \qquad \qquad \text{(by Lemma 2)} \\ 	&=& \mathop{\mathbb E}{}_\pi \left[ (\log f_t) (L^* f_t) \right] + L^* \mathop{\mathbb E}{}_\pi [f_t] \\ 	&=& \mathop{\mathbb E}{}_\pi \left[ (\log f_t) (L^* f_t)\right] - (I - P^*) 1 \\ 	&=& \mathop{\mathbb E}{}_\pi \left[ (\log f_t)(L^* f_t)\right]. \end{array}

Next, we use the fact that {L^*} is the adjoint of {L} w.r.t. the {L^2(\pi)} inner product {\langle f, g \rangle_\pi = \mathop{\mathbb E}{}_\pi [fg]}:

\displaystyle  \begin{array}{rcl}  	\langle f, Lg \rangle_\pi &=& \sum_{x \in {\mathsf X}} f(x) Lg(x) \pi(x) \\ 	&=& \sum_{x \in {\mathsf X}} f(x) Pg(x) \pi(x) - \sum_{x \in {\mathsf X}} f(x) g(x) \pi(x) \\ 	&=& \sum_{x \in {\mathsf X}} \sum_{y \in {\mathsf X}} f(x) P(x,y) g(y) \pi(x) - \sum_{x \in {\mathsf X}} f(x) g(x) \pi(x) \\ 	&=& \sum_{x \in {\mathsf X}} \sum_{y \in {\mathsf X}} f(x) P^*(y,x) g(y) \pi(y) - \sum_{x \in {\mathsf X}} f(x) g(x) \pi(x) \\ 	&=& \sum_{y \in {\mathsf X}} P^*f (y) g(y) \pi(y) - \sum_{y \in {\mathsf X}} f(y) g(y) \pi(y) \\ 	&=& \sum_{y \in {\mathsf X}} L^*f(y) g(y) \pi(y) \\ 	&=& \langle L^* f, g \rangle_\pi \end{array}

(note that when {P} is reversible, {L} is self-adjoint, {L = L^*}). Hence,

\displaystyle  	\mathop{\mathbb E}{}_\pi \left[ (L^* f_t)(\log f_t)\right] = \langle L^* f_t, \log f_t \rangle_\pi = \langle f_t, L \log f_t \rangle_\pi \equiv - {\mathcal E} (f_t,\log f_t),

and the theorem is proved. \Box

And here is the rub. Let us define, for any strictly positive {f : {\mathsf X} \rightarrow (0,+\infty)}, the entropy functional

\displaystyle  H_\pi(f) = \mathop{\mathbb E}{}_\pi \left[f \log \frac{f}{\mathop{\mathbb E}{}_\pi f}\right].

Note that if {f} is of the form {f(x) = \mu(x)/\pi(x)} for some probability distribution {\mu}, then {\mathop{\mathbb E}{}_\pi f = 1}, and so

\displaystyle  H_\pi(f) = \mathop{\mathbb E}{}_\pi [f \log f] = \sum_{x \in {\mathsf X}} \mu(x) \log f(x) \equiv D(\mu \| \pi).

Now let

\displaystyle  \rho_0 = \sup_{f > 0} \frac{{\mathcal E}(f, \log f)}{2 H_\pi(f)}. \ \ \ \ \ (1)

With this definition, we finally obtain the desired exponential convergence result:

Theorem 4 For any initial distribution {\mu_0},

\displaystyle  	\| \mu_t - \pi \|^2_{TV} \le 2 \log \left(\frac{1}{\pi_*}\right) e^{-2\rho_0 t}, \qquad \forall t \ge 0,

where {\pi_* = \min_{x \in {\mathsf X}}\pi(x)}.

Proof: Using Theorem 3 and the definition of {\rho_0}, we can write

\displaystyle  	\frac{d}{dt}D(\mu_t \| \pi) = - {\mathcal E} (f_t, \log f_t) \le - 2\rho_0 H_\pi(f_t) = - 2\rho_0 D(\mu_t \| \pi).

Integrating the resulting inequality, we obtain

\displaystyle  D(\mu_t \| \pi) \le D(\mu_0 \| \pi) e^{-2\rho_0 t}.

Using Pinsker’s inequality, we have

\displaystyle  \| \mu_t - \pi \|^2_{TV} \le 2 D(\mu_t \| \pi) \le 2 D(\mu_0 \| \pi) e^{-2\rho_0 t}.

We can upper-bound {D(\mu_0 \| \pi)} as follows:

\displaystyle  	D(\mu_0 \| \pi) = \sum_{x \in {\mathsf X}} \mu_0(x) \log \frac{\mu_0(x)}{\pi(x)} \le \sum_{x \in {\mathsf X}} \mu_0(x) \log \frac{1}{\pi_*} = \log \frac{1}{\pi_*}.

This finishes the proof. \Box

Among other things, Bobkov and Tetali show that the constant {\rho_0} defined in (1) often gives a much tighter bound on the rate of convergence to stationarity than the usual Poincaré constant, derived using the {L^2(\pi)} spectral methods (cf. Diaconis–Stroock). So the whole affair hinges on being able to derive good lower bounds on {\rho_0}.

One thought on “Divergence in everything: mixing rates for finite Markov chains”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: