This is the first post in a series aobut the versatility and the power of the good old Kullback-Leibler divergence.

(image yoinked from Sergio Verdú‘s 2007 Shannon Lecture slides)

Today I will describe a beautiful application of the divergence to the problem of determining mixing rates of finite Markov chains. This argument is quite recent, and comes from a nice paper by Sergey Bobkov and Prasad Tetali (“Modified logarithmic Sobolev inequalities in discrete settings,” *Journal of Theoretical Probability*, vol. 19, no. 2, pp. 209-336, 2006; preprint). Since my interest here is information-theoretic, I will take for granted certain facts from the theory of finite Markov chains.

Let be a finite state space, and consider a stochastic matrix on , i.e., for all and

Following standard usage, I will represent probability distributions on by row vectors and real-valued functions on by column vectors. With these conventions, acts on probability distributions from the right, , and on functions from the left, .

It is easiest to work with continuous-time Markov chains. Any such chain is an -valued random process with some specified initial distribution of , say, . Then the distribution of is given by

Here is the identity matrix, is the *infinitesimal generator* of the *Markov semigroup* , and denotes the th matrix power of .

We are interested in the asymptotic behavior of the random variables for large values of . In order to be able to make sensible statements about it, we will start with the following basic assumption:

Irreducibility —For any pair , there exists some positive integer , such that .

From this it can be shown that:

- There exists a
*stationary distribution*, such that . In other words, if , then for all . - This is unique, and for all .

Let be the initial distribution, and let denote the distribution of . We will show that, no matter what is, will converge to in total variation, and moreover the convergence is exponentially fast. This property, often referred to as *geometric ergodicity*, is a fundamental statement about *stability*, and so finds applications in many areas, from mathematical statistical mechanics (in connection with the second law of thermodynamics) to stochastic control. Therefore, it is not surprising that there are numerous papers and even books devoted to geometric ergodicity. A standard way of deriving geometric ergodicity is through spectral methods, pertaining to the eigenvalues of viewed as a linear operator on the Hilbert space . There is quite a bit of work along this direction; see, for example, papers by Persi Diaconis and Daniel Stroock (for the reversible case) and by James Allen Fill (for the nonreversible case).

The argument due to Bobkov and Tetali, which I am about to describe, is information-theoretic in nature. Instead of studying the spectral properties of , they look at the optimal rate at which the *divergence* converges to zero.

Since is strictly positive, any probability distribution on has a *density* relative to it, i.e., the function is nonnegative and finite. To study the asymptotic behavior of , we will examine the evolution of the density . To that end, let us define the *time reversal* of (what Fill calls the *reversibilization* of ), which we denote by , by

Note that we did *not* assume that the Markov chain is *reversible*, i.e., . We have the following basic fact:

Lemma 1For every positive integer and any two ,

*Proof:* The proof is by induction. The base case, , holds by the definition of . Assume then that the statement is true for some . Then

where we have used the rule for matrix multiplication, the induction hypothesis, and the definition of . This finishes the proof.

Let us also define the *dual semigroup* by

Armed with this definition, we can proceed to establish the following key result:

Lemma 2For any and any , . Moreover, for any

where is the generator of the dual Markov semigroup.

*Proof:* Starting with , we can write

Dividing both sides by (which is strictly positive by irreducibility), we obtain the first statement of the lemma.

Next, by linearity we have

**Note:** the time derivative of is rigorously defined via

and it is not hard to show, using the infinite series representation of , that .

Before we can finally get to business, we need to introduce one more definition. For any two functions , define the so-called *Dirichlet form*

Here is why this definition comes in handy:

*Proof:* Using the fact that is irreducible and Lemma~1, it is easy to show that for all . Hence, the function is differentiable for all . Now, by definition,

Therefore, by linearity of expectation

Next, we use the fact that is the *adjoint* of w.r.t. the inner product :

(note that when is reversible, is self-adjoint, ). Hence,

and the theorem is proved.

And here is the rub. Let us define, for any strictly positive , the *entropy functional*

Note that if is of the form for some probability distribution , then , and so

With this definition, we finally obtain the desired exponential convergence result:

Theorem 4For any initial distribution ,

where .

*Proof:* Using Theorem 3 and the definition of , we can write

Integrating the resulting inequality, we obtain

Using Pinsker’s inequality, we have

We can upper-bound as follows:

This finishes the proof.

Among other things, Bobkov and Tetali show that the constant defined in (1) often gives a much tighter bound on the rate of convergence to stationarity than the usual Poincaré constant, derived using the spectral methods (cf. Diaconis–Stroock). So the whole affair hinges on being able to derive good lower bounds on .

## One thought on “Divergence in everything: mixing rates for finite Markov chains”