Divergence in everything: mixing rates for finite Markov chains
This is the first post in a series aobut the versatility and the power of the good old Kullback-Leibler divergence.
Today I will describe a beautiful application of the divergence to the problem of determining mixing rates of finite Markov chains. This argument is quite recent, and comes from a nice paper by Sergey Bobkov and Prasad Tetali (“Modified logarithmic Sobolev inequalities in discrete settings,” Journal of Theoretical Probability, vol. 19, no. 2, pp. 209-336, 2006; preprint). Since my interest here is information-theoretic, I will take for granted certain facts from the theory of finite Markov chains.
Let be a finite state space, and consider a stochastic matrix on , i.e., for all and
Following standard usage, I will represent probability distributions on by row vectors and real-valued functions on by column vectors. With these conventions, acts on probability distributions from the right, , and on functions from the left, .
It is easiest to work with continuous-time Markov chains. Any such chain is an -valued random process with some specified initial distribution of , say, . Then the distribution of is given by
Here is the identity matrix, is the infinitesimal generator of the Markov semigroup , and denotes the th matrix power of .
We are interested in the asymptotic behavior of the random variables for large values of . In order to be able to make sensible statements about it, we will start with the following basic assumption:
Irreducibility — For any pair , there exists some positive integer , such that .
From this it can be shown that:
- There exists a stationary distribution , such that . In other words, if , then for all .
- This is unique, and for all .
Let be the initial distribution, and let denote the distribution of . We will show that, no matter what is, will converge to in total variation, and moreover the convergence is exponentially fast. This property, often referred to as geometric ergodicity, is a fundamental statement about stability, and so finds applications in many areas, from mathematical statistical mechanics (in connection with the second law of thermodynamics) to stochastic control. Therefore, it is not surprising that there are numerous papers and even books devoted to geometric ergodicity. A standard way of deriving geometric ergodicity is through spectral methods, pertaining to the eigenvalues of viewed as a linear operator on the Hilbert space . There is quite a bit of work along this direction; see, for example, papers by Persi Diaconis and Daniel Stroock (for the reversible case) and by James Allen Fill (for the nonreversible case).
The argument due to Bobkov and Tetali, which I am about to describe, is information-theoretic in nature. Instead of studying the spectral properties of , they look at the optimal rate at which the divergence converges to zero.
Since is strictly positive, any probability distribution on has a density relative to it, i.e., the function is nonnegative and finite. To study the asymptotic behavior of , we will examine the evolution of the density . To that end, let us define the time reversal of (what Fill calls the reversibilization of ), which we denote by , by
Note that we did not assume that the Markov chain is reversible, i.e., . We have the following basic fact:
Proof: The proof is by induction. The base case, , holds by the definition of . Assume then that the statement is true for some . Then
where we have used the rule for matrix multiplication, the induction hypothesis, and the definition of . This finishes the proof.
Let us also define the dual semigroup by
Armed with this definition, we can proceed to establish the following key result:
Proof: Starting with , we can write
Dividing both sides by (which is strictly positive by irreducibility), we obtain the first statement of the lemma.
Next, by linearity we have
Note: the time derivative of is rigorously defined via
and it is not hard to show, using the infinite series representation of , that .
Before we can finally get to business, we need to introduce one more definition. For any two functions , define the so-called Dirichlet form
Here is why this definition comes in handy:
Proof: Using the fact that is irreducible and Lemma~1, it is easy to show that for all . Hence, the function is differentiable for all . Now, by definition,
Therefore, by linearity of expectation
Next, we use the fact that is the adjoint of w.r.t. the inner product :
(note that when is reversible, is self-adjoint, ). Hence,
and the theorem is proved.
And here is the rub. Let us define, for any strictly positive , the entropy functional
Note that if is of the form for some probability distribution , then , and so
Theorem 4 For any initial distribution ,
Proof: Using Theorem 3 and the definition of , we can write
Integrating the resulting inequality, we obtain
Using Pinsker’s inequality, we have
We can upper-bound as follows:
This finishes the proof.
Among other things, Bobkov and Tetali show that the constant defined in (1) often gives a much tighter bound on the rate of convergence to stationarity than the usual Poincaré constant, derived using the spectral methods (cf. Diaconis–Stroock). So the whole affair hinges on being able to derive good lower bounds on .