# The Information Structuralist

## Typical, just typical

In the spirit of shameless self-promotion, I would like to announce a new preprint (preliminary version was presented in July at ISIT 2010 in Austin, TX):

Maxim Raginsky, “Empirical processes, typical sequences and coordinated actions in standard Borel spaces”, arXiv:1009.0282, submitted to IEEE Transactions on Information Theory

Abstract: This paper proposes a new notion of typical sequences on a wide class of abstract alphabets (so-called standard Borel spaces), which is based on approximations of memoryless sources by empirical distributions uniformly over a class of measurable “test functions.” In the finite-alphabet case, we can take all uniformly bounded functions and recover the usual notion of strong typicality (or typicality under the total variation distance). For a general alphabet, however, this function class turns out to be too large, and must be restricted. With this in mind, we define typicality with respect to any Glivenko–Cantelli function class (i.e., a function class that admits a Uniform Law of Large Numbers) and demonstrate its power by giving simple derivations of the fundamental limits on the achievable rates in several source coding scenarios, in which the relevant operational criteria pertain to reproducing empirical averages of a general-alphabet stationary memoryless source with respect to a suitable function class.

The notion of a typical sequence has been one of the pillars of information theory ever since the original paper of Shannon. There are many equivalent notions of typicality, but I am going to use one that’s often referred to as strong typicality. Namely, consider a finite set ${{\mathsf X}}$ and a probability distribution ${P}$ on it. Given some ${\varepsilon > 0}$, an ${n}$-tuple ${x^n \in {\mathsf X}^n}$ is said to be ${\varepsilon}$typical with respect to ${P}$ if

$\displaystyle \sum_{a \in {\mathsf X}} \left| \frac{|\{ 1 \le i \le n : x_i = a\}|}{n} - P(a) \right| < \varepsilon.$

We can write this more succinctly if we define the empirical distribution ${{\mathsf P}_{x^n}}$ by

$\displaystyle {\mathsf P}_{x^n}(a) = \frac{|\{ i : x_i = a\}|}{n}, \qquad \forall a \in {\mathsf X}$

and recall the definition of the total variation distance between probability distributions ${P}$ and ${Q}$ on ${{\mathsf X}}$,

$\displaystyle \| P - Q \|_{\rm TV} = \sum_{a \in {\mathsf X}} |P(a) - Q(a)|. \ \ \ \ \ (1)$

Then we can say that ${x^n}$ is ${\varepsilon}$-typical w.r.t. ${P}$ if

$\displaystyle \| {\mathsf P}_{x^n} - P \|_{\rm TV} < \varepsilon.$

One of the salient facts about typicality is the following: if ${X_1,X_2,\ldots}$ is an infinite sequence of i.i.d. draws from ${P}$, then it follows from the Law of Large Numbers that

$\displaystyle P^n \Big( X^n \text{ is not }\varepsilon\text{-typical}\Big) \longrightarrow 0 \quad \text{as }n \rightarrow \infty. \ \ \ \ \ (2)$

If you read any textbook on information theory (say, Cover and Thomas), you will find that the proof of almost every coding theorem makes use of this fact (and of its various extensions to joint and conditional distributions).

Recent work by Paul Cuff, Haim Permuter and Tom Cover on coordination capacity uses strong typicality as a basic tool for addressing the following general problem: Consider a network consisting of ${N}$ agents, such that agents ${i}$ and ${j}$ can communicate at some specified rate ${R_{i,j}}$, for each ${i \neq j}$. Suppose that the ${i}$th agent selects an action ${X_i \in {\mathsf X}_i}$ (where ${{\mathsf X}_i}$ is a finite alphabet) based on the information it receives from other agents, as well as on some common randomness shared by all the agents. What is the set of all achievable joint distributions ${P_{X^N}(x^N)}$ of the actions over the network? This problem of coordination via communication arises not only in source coding proper, but also in such contexts as decision-making and control in multiagent systems, or network security.

There is one sense in which this problem is trivial. Suppose that there is no communication among the agents, but they all observe a common random element ${\omega}$ drawn from a probability space ${(\Omega,{\mathcal F},P)}$. Then they can generate absolutely any joint distribution ${P_{X^N}}$ over their respective actions simply by agreeing beforehand on the measurable functions ${X_i : \Omega \rightarrow {\mathsf X}_i}$, ${i = 1,\ldots,N}$, such that

$\displaystyle P_{X^N}(x_1,\ldots,x_N) = P\left( \left\{ \omega \in \Omega: X_1(\omega) = x_1,\ldots, X_N(\omega) = x_N\right\}\right)$

for every possible assignment ${(x_1,\ldots,x_N) \in {\mathsf X}_1 \times \ldots \times {\mathsf X}_N}$. Once they observe ${\omega}$, they do not need to communicate at all, since each agent simply applies her own function.

However, the situation changes dramatically as soon as some constraints are imposed. For instance, what happens when some of the agents are not free to select their actions, but receive them as random assignments? Then the question becomes: given these “boundary conditions”, what are all the possible conditional distributions of the actions for the remaining agents that can be generated given the rate constraints ${\{R_{i,j}\}}$?

Let me illustrate this by means of a simple example. Consider a two-node network shown below:

We have two nodes, ${A}$ and ${B}$. Node ${A}$ is assigned an ${n}$-tuple of actions ${X^n = (X_1,\ldots,X_n)}$ drawn i.i.d. from some distribution ${P_X}$ on a finite set ${{\mathsf X}}$. Node ${B}$ must generate an ${n}$-tuple of actions ${\hat{Y}^n}$ in another finite set ${{\mathsf Y}}$ based on information it receives from Node ${A}$. Now suppose that there is an external entity (the Observer) that can observe ${X^n}$ and ${\hat{Y}^n}$. The Observer has some joint distribution ${P_{XY}}$ in mind (with the given ${X}$-marginal ${P_X}$) over the product space ${{\mathsf X} \times {\mathsf Y}}$, and the nodes know this. The problem then is this: find the smallest rate (in bits per action) at which Node ${A}$ has to communicate with Node ${B}$ in order for both of them to convince the Observer that the relative frequencies of their joint actions ${(X_i,\hat{Y}_i)}$ are consistent with ${P_{XY}}$. In symbols, we wish to determine the smallest rate ${R > 0}$, such that for every ${n}$ large enough we can find two mappings ${e_n : {\mathsf X}^n \rightarrow \{1,\ldots,2^{nR}\}}$ and ${d_n : \{1,\ldots,2^{nR}\} \rightarrow {\mathsf Y}^n}$, such that with ${\hat{Y}^n = d_n(e_n(X^n))}$, we can guarantee

$\displaystyle \mathop{\mathbb E} \left\| {\mathsf P}_{(X^n,\hat{Y}^n)} - P_{XY} \right\|_{\rm TV} \le \varepsilon,$

where ${\varepsilon > 0}$ is some small error level that will satisfy the Observer.

The result is (roughly) as follows:

Theorem 1 (Cuff, Permuter, Cover) Any rate ${R > I(X;Y)}$ suffices, and no rate ${R < I(X;Y)}$ will work. Here, ${I(X;Y)}$ is the mutual information between ${X}$ and ${Y}$ when they have the joint distribution ${P_{XY}}$.

I will only sketch the proof of achievability. A more-or-less standard argument shows that, for any ${\delta > 0}$ and any sufficiently large ${n}$, there exist a finite set ${C_n =\{ y^n(1),\ldots,y^n(J)\} \subset {\mathsf Y}^n}$ with ${J \approx 2^{n I(X;Y)}}$ and two mappings, ${e_n : {\mathsf X}^n \rightarrow \{1,\ldots,J\}}$ and ${d_n : \{1,\ldots,J\} \rightarrow C_n}$ (with ${d_n(j) = y^n(j)}$), such that, if ${X^n}$ is i.i.d. according to ${P_X}$ and ${\hat{Y}^n = d_n(e_n(X^n))}$, then the ${n}$-tuple of pairs ${(X_1,\hat{Y}_1),\ldots,(X_n,\hat{Y}_n) \in {\mathsf X} \times {\mathsf Y}}$ is ${\delta}$-typical w.r.t. ${P_{XY}}$ with high probability. The rest is a matter of choosing ${\delta = \varepsilon/2}$ and ${n = n(\delta)}$ so large that this probability is at least ${1-\varepsilon/4}$. Then it’s not terribly hard to show that

$\displaystyle \mathop{\mathbb E} \left\| {\mathsf P}_{(X^n,\hat{Y}^n)} - P_{XY} \right\|_{\rm TV} < \varepsilon.$

Now, all of this is fine when the sets ${{\mathsf X}}$ and ${{\mathsf Y}}$ are finite. But what if we are interested in generating continuous-valued actions? It then turns out that we cannot even meaningfully define an ${\varepsilon}$-typical sequence! Here’s why. Let us recall an equivalent definition of the total variation distance (1) that works in an arbitrary measurable space ${({\mathsf X},{\mathcal B}({\mathsf X}))}$, namely

$\displaystyle \| P - Q \|_{\rm TV} = 2 \sup_{A \in {\mathcal B}({\mathsf X})} |P(A) - Q(A)|. \ \ \ \ \ (3)$

Now suppose that ${P}$ is an empirical distribution induced by some ${n}$-tuple ${x^n \in {\mathsf X}^n}$, i.e., for any ${A \in {\mathcal B}({\mathsf X})}$ we have

$\displaystyle {\mathsf P}_{x^n}(A) = \frac{1}{n}\sum^n_{i=1} 1_{\{X_i \in A\}},$

while ${Q}$ is a probability measure that puts zero mass on any singleton, ${Q(\{x\}) = 0}$ for every ${x \in {\mathsf X}}$ (this will be the case if ${Q}$ is the distribution of a multivariate Gaussian random variable). Now take ${A}$ to consist of all distinct elements of ${x^n}$. Then

$\displaystyle {\mathsf P}_{x^n}(A) = 1, Q(A) = 0 \quad \Longrightarrow \quad | {\mathsf P}_{x^n}(A) - Q(A) | = 1,$

and as a result ${\| {\mathsf P}_{x^n} - Q \|_{\rm TV} = 2}$ for any choice of ${x^n}$. EPIC FAIL!

Of course, we don’t really have to use typicality arguments when working with continuous alphabets. There are plenty of other useful tools, such as ergodic theory or theory of large deviations. However, there is something to be said for the inherent simplicity and transparency of the typicality-based proofs — when done right, they rely on nothing more than a combination of the Law of Large Numbers and the triangle inequality. So why can’t we have something like that for abstract alphabets? The answer is — yes we can!

Let’s consider the Euclidean case ${{\mathsf X} = {\mathbb R}^d}$, for simplicity. The first thing to notice is that the definition (3) really asks for too much — we demand that the relative frequencies of all measurable sets, no matter how bizarre-looking, converge to their true probabilities uniformly. But, as we have just seen, that just cannot happen. However, if we scale our ambitions back a bit, we may yet have the convergence we want. All we have to do is restrict our attention to “nice” families of sets, such as balls, halfspaces, polyhedra, or the elements of your own favorite Vapnik–Chervonenkis class. So the idea is to pick a suitable infinite collection ${{\mathcal A}}$ of measurable subsets of ${{\mathsf X}}$, such that

$\displaystyle \| {\mathsf P}_{X^n} - P \|_{\mathcal A} := \sup_{A \in {\mathcal A}} \left| {\mathsf P}_{X^n}(A) - P(A) \right| \xrightarrow{n \rightarrow \infty} 0 \qquad P^\infty\text{-almost surely} \ \ \ \ \ (4)$

for any probability measure ${P}$ on ${({\mathsf X},{\mathcal B}({\mathsf X}))}$. Then we can say that an ${n}$-tuple ${x^n \in {\mathsf X}^n}$ is ${\varepsilon}$-typical w.r.t. ${P}$ if

$\displaystyle \| {\mathsf P}_{x^n} - P \|_{\mathcal A} < \varepsilon.$

This approach actually works, provided ${{\mathcal A}}$ is a sufficiently “regular” collection of sets (and I gave several examples above).

More generally, we can start with yet another (!) equivalent definition of the TV distance,

$\displaystyle \| P - Q \|_{\rm TV} = \sup_{\| f \|_\infty \le 1} \left| \mathop{\mathbb E}{}_P f - \mathop{\mathbb E}{}_Q f \right|.$

In other words, the total variation distance between ${P}$ and ${Q}$ is equal to the largest absolute difference between the expectations, under ${P}$ and ${Q}$, of any measurable function bounded by unity. Once again, this class of functions is way too large. Thus, without further ado, we simply consider a smaller class that would permit such approximations. Luckily, it so happens that function classes just of this type are extensively used in modern mathematical statistics, where they are known under the name of Glivenko–Cantelli classes. Leaving out the technical details, I can only say that I can now define typicality w.r.t. any Glivenko–Cantelli function class, so that several cool things happen:

• A convergence statement like (2) continues to hold.
• When the alphabet ${{\mathsf X}}$ is finite, I can just take all functions ${f : {\mathsf X} \rightarrow [-1,1]}$ and get back the usual notion of strong typicality.
• When the alphabet ${{\mathsf X}}$ is a complete separable metric space, I can take as my reference class the set of all functions that are 1-Lipschitz and bounded by 1, in which case my notion of typicality is compatible with the topology of the weak convergence of probability measures.
• I can now have achievability proofs that really do consist of little more than the Law of Large Numbers (a uniform version thereof, as in (4)) and the triangle inequality. In my paper, I prove a general-alphabet variant of the basic two-node result that I have discussed earlier, as well as address a more difficult set-up that involves side information at the decoder, in the spirit of Wyner and Ziv.

Why is this at all useful or interesting, you ask? Well, first of all, we can now have our cake and eat it too, where by “cake” I mean “typical sequences over abstract alphabets” and by “eat” I mean “use them to prove cool new coding theorems.” Secondly, there are numerous problems pertaining to distributed control, learning, and sensing that can benefit from this new viewpoint. For example, we may consider the problem of learning to predict the nodes’ future behavior by observing their actions. In this setting, each ${f}$ in our reference class ${{\mathcal F}}$ may correspond to the loss we incur when we use a particular candidate predictor, and we can now quantify the minimal amount of information that needs to be exchanged among the nodes so that we can accurately estimate the losses of all our candidate predictors simply by looking at their sample averages (as is done, for instance, in a 1996 paper by Kevin Buescher and P. R. Kumar on learning by canonical smooth estimation). Moreover, Glivenko–Cantelli classes of sets and functions play quite a prominent role in the theory of statistical learning, where one of the central questions is one of rates of convergence — i.e., how quickly can a learning agent learn as a function of the available amount of training data? Hopefully, with this new formalism we can now begin to answer questions like “What is the minimal amount of predictively relevant information contained in a sequence of training samples?”