# The Information Structuralist

## Counting bits with Vapnik and Chervonenkis

Posted in Information Theory, Statistical Learning and Inference by mraginsky on April 28, 2015

Machine learning is about enabling computers to improve their performance on a given task as they get more data. Can we express this intuition quantitatively using information-theoretic techniques? In this post, I will discuss a classic paper by David Haussler, Michael Kearns, and Robert Schapire that (to the best of my knowledge) took the first step in this direction. In this post, I will describe some of their results, recast in a more explicitly information-theoretic way.

We will stick to the setting of binary classification: we have a feature space ${{\mathsf X}}$ (which is completely arbitrary) and a binary label space ${{\mathsf Y} = \{0,1\}}$. Let ${{\mathcal F}}$ be a class of functions ${f : {\mathsf X} \rightarrow {\mathsf Y}}$, which we will call the hypothesis space. These functions are our models for the relationship between features and labels, which we wish to learn from data. In order to focus on the essential aspects of the problem, we will adopt the Bayesian approach and assume that the “true” model is a random element of ${{\mathcal F}}$ with a given distribution ${Q}$. We assume that the learning agent knows ${Q}$, so we can think of this distribution as the learner’s prior. Training data consist of input-output pairs ${(x_t,Y_t)}$, where ${t=0,1,2,\ldots}$. Here, each ${x_t}$ is an arbitrary point of ${{\mathsf X}}$, and for now we will treat ${{\bf x} = (x_1,x_2,\ldots)}$ as an individual sequence. The labels ${Y_t}$, though, are random and given by ${Y_t = F(x_t)}$, where ${F \sim Q}$ is the model drawn by Nature from ${Q}$.

The learning agent receives the training data sequentially: at each time ${t=1,2,\ldots}$, it has access to ${(x_1,Y_1),\ldots,(x_t,Y_t)}$. Let us denote the pair ${(x_t,Y_t)}$ by ${Z_t}$. The learning task is: given the next input ${x_{t+1}}$, predict the value ${Y_{t+1}}$. A learning algorithm ${{\mathcal A}}$ is a sequence of possibly randomized mappings ${a_t :(Z^{t-1}, x_{t}) \mapsto \widehat{Y}_t}$: at each time ${t}$, the learning agent uses all available training data to generate a prediction for the label ${Y_{t}}$ of ${x_{t}}$. After generating the prediction, the learning agent observes the true ${Y_{t}}$. We are interested in the probability that ${{\mathcal A}}$ makes an error at time ${t}$:

$\displaystyle e_{{\bf x}}({\mathcal A},t) = Q\big[\widehat{Y}_t \neq Y_t\big] \equiv Q\big[a_t(Z^{t-1},x_t) \neq Y_t\big].$

We will refer to the function ${t \mapsto e_{{\bf x}}({\mathcal A},t)}$ as the learning curve of ${{\mathcal A}}$ [note that it depends on the sequence ${{\bf x}}$ of inputs supplied to the learning agent during learning].

How can we measure the learner’s progress? Let’s write out the causal ordering of all the random variables generated in the process of learning:

$\displaystyle F,\widehat{Y}_1,Z_1,\widehat{Y}_2,Z_2,\ldots,\widehat{Y}_t,Z_t,\ldots$

Here, ${F}$ has distribution ${Q}$, ${\widehat{Y}_t}$ are determined by the learning algorithm ${{\mathcal A}}$, and ${Z_t}$ are determined by ${x_t}$ and ${F}$. Let us denote the joint distribution of all of these variables by ${{\mathbb P}}$. When the learning agent observes ${Z_{t+1}=z_{t}}$ at time ${t}$, we can quantify its information gain relative to time ${t-1}$ by

$\displaystyle {\rm IG}(t) = D({\mathbb P}_{F|Z^{t}=z^{t}} \| {\mathbb P}_F) - D({\mathbb P}_{Z^{t-1}=z^{t-1}} \| {\mathbb P}_F)$

It’s not hard to see that ${{\rm IG}(t)}$ is nonnegative because ${Z^{t}}$ (information at time ${t}$) includes ${Z^{t-1}}$. The expected information gain is given by

$\displaystyle {\mathbb E} [{\rm IG}(t)] = I(F; Z_t | Z^{t-1}),$

the conditional mutual information between ${F}$ and ${Z_t}$ given ${Z^{t-1}}$. For our purposes, it will be convenient to express the information gain in terms of a certain quantity called the volume ratio, which quantifies the rate at which the learning agent can zero in on the true model as more observations keep arriving.

To that end, let us first observe that, at time ${t}$, the agent can localize the unknown model ${F}$ to the set

$\displaystyle {\mathcal V}_t = \left\{ {f \in {\mathcal F}:} f(x_s) = Y_s, 1 \le s \le t \right\}.$

This set is known in learning theory as the version space at time ${t}$. It is easy to see that the version spaces are nested: ${{\mathcal V}_{t} \subseteq {\mathcal V}_{t-1} \subseteq \ldots \subseteq {\mathcal V}_1 \subseteq {\mathcal V}_0 \equiv {\mathcal F}}$. Clearly, ${F \in {\mathcal V}_t}$ for all ${t}$, and for any ${{\mathcal E} \subseteq {\mathcal F}}$

$\displaystyle {\mathbb P}[F \in {\mathcal E}|Z^t=z^t] = Q({\mathcal E} | {\mathcal V}_t) = \frac{Q({\mathcal E} \cap {\mathcal V}_t)}{Q({\mathcal V}_t)}$

This implies, in particular, that

$\displaystyle D({\mathbb P}_{F|Z^{t}=z^{t}} \| {\mathbb P}_F) = \log \frac{1}{Q({\mathcal V}_t)},$

and therefore that

$\displaystyle {\rm IG}(t) = \log \frac{Q({\mathcal V}_{t-1})}{Q({\mathcal V}_{t})},$

where ${Q({\mathcal V}_0) = Q({\mathcal F})=1}$. Since ${F \in \bigcap_t {\mathcal V}_t}$, we expect that each training sample shrinks the version space, and that the learning curve of any good learning algorithm should be controlled by the rate at which ${Q({\mathcal V}_t)}$ decreases. For future convenience, let us define the volume ratio at time ${t}$:

$\displaystyle \xi_t = \frac{Q({\mathcal V}_{t})}{Q({\mathcal V}_{t-1})}.$

Note, by the way, that since ${{\mathcal V}_t \subseteq {\mathcal V}_{t-1}}$, we have

$\displaystyle \xi_t = \frac{Q({\mathcal V}_t \cap {\mathcal V}_{t-1})}{Q({\mathcal V}_{t-1})} = {\mathbb P}[F \in {\mathcal V}_t | F \in {\mathcal V}_{t-1}].$

Consequently,

$\displaystyle {\rm IG}(t) = - \log \xi_t.$

We are going to use binary algorithms, so the units of all information quantities are bits.

Now let’s get down to business. Following Haussler et al., we will consider the following two learning algorithms:

1. The Bayes algorithm ${{\mathcal A}^{\rm Bayes}}$: at time ${t}$, it partitions the version space ${{\mathcal V}_{t-1}}$ into two disjoint subsets

$\displaystyle {\mathcal S}^y_t = \left\{ f \in {\mathcal V}_{t-1} : f(x_t) = y \right\}, \qquad y \in \{0,1\}$

and then predicts

$\displaystyle \widehat{Y}_t = {\bf 1}\left\{Q({\mathcal S}^1_t)-Q({\mathcal S}^0_t) \ge 0\right\}.$

The Bayes algorithm is optimal in the sense that it minimizes the probability of incorrectly predicting ${Y_t}$ given all available information. However, it is an improper learning algorithm in the sense that the function ${x_t \mapsto \widehat{Y}_t}$ will not, in general, be an element of ${{\mathcal F}}$.

2. The Gibbs algorithm ${{\mathcal A}^{\rm Gibbs}}$: at time ${t}$, it draws a random ${\widehat{F}_t \sim Q_{F|F \in {\mathcal V}_{t-1}}}$ and predicts

$\displaystyle \widehat{Y}_t = \widehat{F}_t(x_t).$

The Gibbs algorithm is suboptimal, but, unlike the Bayes algorithm, the function ${x_t \mapsto \widehat{Y}_t}$ does belong to ${{\mathcal F}}$ by construction.

Now let’s analyze their learning curves:

1. The Bayes algorithm will make a mistake whenever ${Q({\mathcal V}_t) < Q({\mathcal V}_{t-1})/2}$, i.e., whenever ${\xi_t < 1/2}$. Indeed, using the fact that, by definition of the version space,

$\displaystyle {\mathcal S}^{Y_t}_t = \left\{ f \in {\mathcal V}_{t-1} : f(x_t) = Y_t \right\} \equiv {\mathcal V}_t,$

we have

$\displaystyle Q({\mathcal V}_t) < Q({\mathcal V}_{t-1})/2 \qquad \Longrightarrow \qquad Q({\mathcal S}^{Y_t}_t) < \frac{Q({\mathcal V}_{t-1})}{2} = \frac{Q({\mathcal S}^{Y_t}_t)+Q({\mathcal S}^{1-Y_t}_t)}{2}.$

Thus, if ${\xi_t < 1/2}$, then ${Q({\mathcal S}^{Y_t}_t) < Q({\mathcal S}^{1-Y_t}_t)}$, and so ${\widehat{Y}_t \neq Y_t}$. On the other hand, if ${\xi_t > 1/2}$, the Bayes algorithm will be correct: ${\widehat{Y_t} = Y_t}$. Finally, if ${\xi_t = 1/2}$, then the Bayes algorithm will make a mistake with probability ${1/2}$. Therefore,

$\displaystyle e_{{\bf x}}({\mathcal A}^{\rm Bayes},t) = {\mathbb E} [\vartheta(1/2-\xi_t)],$

where we have defined the function

$\displaystyle \vartheta(u) = \begin{cases} 0, & u < 0 \\ 1/2, & u = 0 \\ 1, & u > 0 \end{cases}$

2. The Gibbs algorithm will make a mistake whenever ${\widehat{F}_t}$ is not in the version space ${{\mathcal V}_t}$. Since ${\widehat{F}_t}$ is drawn from the posterior distribution of ${F}$ given ${F \in {\mathcal V}_{t-1}}$, we have

$\displaystyle e_{{\bf x}}({\mathcal A}^{\rm Gibbs},t) = {\mathbb P}\left[\widehat{F}_t \not\in {\mathcal V}_t\right] = {\mathbb E} [1-\xi_t].$

Before stating and proving the main result, we need to collect some preliminary facts. First of all, the volume ratio ${\xi_t}$ is always bounded between ${0}$ and ${1}$. Moreover, we have the following useful formula: for any function ${g : [0,1] \rightarrow [0,1]}$,

$\displaystyle {\mathbb E} [g(\xi_t)] = {\mathbb E} [\xi_t g(\xi_t) + (1-\xi_t)g(1-\xi_t)] \ \ \ \ \ (1)$

(the proof is by conditioning on ${Z^{t-1}}$ and using the law of iterated expectation).

Theorem 1 We have the following bounds for the Bayes and the Gibbs algorithms:

$\displaystyle e_{{\bf x}}({\mathcal A}^{\rm Gibbs},t) \ge e_{{\bf x}}({\mathcal A}^{\rm Bayes},t) \ge h^{-1}\left(I(F;Z_t|Z^{t-1})\right) \ \ \ \ \ (2)$

and

$\displaystyle \frac{1}{2}e_{{\bf x}}({\mathcal A}^{\rm Gibbs},t) \le e_{{\bf x}}({\mathcal A}^{\rm Bayes},t) \le e_{{\bf x}}({\mathcal A}^{\rm Gibbs},t) \le \frac{I(F; Z_t | Z^{t-1})}{2}, \ \ \ \ \ (3)$

where ${h^{-1} : [0,1] \rightarrow [0,1/2]}$ is the inverse of the binary entropy function ${h(u) = -u\log u - (1-u)\log(1-u)}$.

Remark 1 Equation (2) shows that the error probability of the Bayes algorithm at time ${t}$ can be bounded below by a function of the expected information gain at time ${t}$. It also shows that the error probability of the Gibbs algorithm is at least as large as that of the Bayes algorithm. On the other hand, Equation (3) gives an upper bound on the error performance of the Bayes algorithm in terms of the expected information gain, and it also shows that the error of the Gibbs algorithm is at most twice as large as that of the Bayes algorithm.

Proof: We start by obtaining the following alternative expressions for the error probabilities of the two algorithms, as well as for the expected information gain:

Using (1) with ${g(u) = - \log u}$, we have

$\displaystyle I(F; Z_t | Z^{t-1}) = {\mathbb E} [-\log \xi_t] = -{\mathbb E} [\xi_t \log \xi_t + (1-\xi_t) \log (1-\xi_t)] = {\mathbb E} [h(\xi_t)].$

This shows, in particular, that the average information gain due to receiving another training sample is no more than one bit.

With ${g(u) = \vartheta(1/2-u)}$, we have

$\displaystyle e_{{\bf x}}({\mathcal A}^{\rm Bayes},t) = {\mathbb E} [\vartheta(1/2-\xi_t)] = {\mathbb E} [\xi_t \vartheta(1/2-\xi_t) + (1-\xi_t) \vartheta(\xi_t - 1/2)] = {\mathbb E}\left[\min\{\xi_t,1-\xi_t\}\right],$

where we have used the fact that, for any ${u \in [0,1]}$,

$\displaystyle u\vartheta(1/2-u) + (1-u)\vartheta(u-1/2) = \min\{u,1-u\}.$

Another way to arrive at this expression is by using the following fact: Suppose we have a biased coin with probability of heads equal to ${p}$. The best prediction of the outcome of a single toss of that coin, with respect to the Hamming loss ${\ell(y,\widehat{y}) = {\bf 1}\{y \neq \widehat{y}\}}$, is to predict heads if ${p \ge 1-p}$ and tails otherwise. The expected loss in that case is equal to ${\min\{p,1-p\}}$. The problem of predicting ${Y_t}$ given ${x_t}$ and ${Z^{t-1}=z^{t-1}}$ is equivalent to the above guessing problem with ${p = \xi_t}$.

Finally, with ${g(u) = 1-u}$, we have

$\displaystyle e_{{\bf x}}({\mathcal A}^{\rm Gibbs},t) = {\mathbb E} [1-\xi_t] = 2{\mathbb E} [\xi_t(1-\xi_t)].$

Again, if we consider the problem of guessing the outcome of tossing a ${{\rm Bernoulli}(p)}$ coin, we can think of the following suboptimal scheme: guess ${H}$ with probability ${p}$ and ${T}$ with probability ${1-p}$ (i.e., just mentally simulate the coin toss). The probability of error is exactly ${2p(1-p)}$, which is also twice the variance of a ${{\rm Bernoulli}(p)}$ random variable. Reasoning conditionally on ${Z^{t-1}=z^{t-1}}$, we see that the Gibbs algorithm is exactly this scheme with ${p = \xi_t}$.

We will also use the following chain of easy-to-prove inequalities: for any ${u \in [0,1]}$,

$\displaystyle u(1-u) \le \min\{u,1-u\} \le 2u(1-u) \le \frac{h(u)}{2}. \ \ \ \ \ (4)$

We start by proving the lower bound, Equation (2). First, note that, for any ${u \in [0,1]}$,

$\displaystyle h^{-1}(h(u)) = \min\{u,1-u\}.$

Moreover, the function ${h^{-1} : [0,1] \rightarrow [0,1]}$ is convex. Therefore,

$\displaystyle e_{{\bf x}}({\mathcal A}^{\rm Bayes},t) = {\mathbb E}\left[\min\{\xi_t,1-\xi_t\}\right] = {\mathbb E} [h^{-1}(h(\xi_t))] \ge h^{-1}\left({\mathbb E} [h(\xi_t)]\right) =h^{-1}\left(I(F;Z_t|Z^{t-1})\right).$

The upper bound, Equation (3), follows by taking ${u=\xi_t}$ in (4) and then taking expectations. $\Box$

Now let’s see what happens when we have to do this over and over again: at each time ${t}$, the learning agent gives a prediction ${\widehat{Y}_t}$, observes ${Y_t}$, updates ${Z_{t-1} \rightarrow Z_t}$, etc. What can we say about the expected fraction of mistakes? Given a learning algorithm ${{\mathcal A}}$, the quantity of interest is

$\displaystyle \bar{e}_{{\bf x}}({\mathcal A},T) = \frac{1}{T}\sum^T_{t=1} e_{{\bf x}}({\mathcal A},t),$

for each ${T=1,2,\ldots}$. In order to get a handle on this quantity, we will need some Vapnik-Chervonenkis combinatorics. Given the sequence ${{\bf x}}$ of input instances, let ${{\mathcal D}_T}$ denote the set of all distinct elements of the tuple ${x^T = (x_1,\ldots,x_T)}$. Say that two hypotheses ${f,f' \in {\mathcal F}}$ are equivalent if ${f(x)=f'(x)}$ for all ${x \in {\mathcal D}_T}$. This defines an equivalence relation on ${{\mathcal F}}$, so we can split it into equivalence classes ${{\mathcal E}_1,\ldots,{\mathcal E}_J}$. Here, the number of equivalence classes ${J}$ depends on ${{\bf x}}$ and on ${T}$. How many such classes can we have? Let us order the elements of ${{\mathcal D}_T}$ arbitrarily, say by ${\{x^{(1)},\ldots,x^{(N)}\}}$, where ${N \le T}$. Each ${f \in {\mathcal F}}$ can be associated with a binary string ${b(f,{\mathcal D}_T) = (f(x^{(1)}),\ldots,f(x^{(N)}))}$. Then ${f}$ and ${f'}$ are in the same equivalence class if and only if they give rise to the same binary strings:

$\displaystyle b(f,{\mathcal D}_T) \equiv b(f',{\mathcal D}_T).$

Consequently,

$\displaystyle J = \left| \left\{ b(f,{\mathcal D}_T) : f \in {\mathcal F} \right\} \right|.$

For any subset ${{\mathcal D} \subseteq {\mathcal D}_T}$, let us define ${b(f,{\mathcal D})}$ in the obvious way: if ${{\mathcal D} = \{x^{(j_1)},\ldots,x^{(j_k)})\}}$, then we let ${b(f,{\mathcal D}) = (f(x^{(j_1)}),\ldots,f(x^{(j_k)}))}$. Let ${d_T({\bf x})}$ be the largest integer ${d \le N}$, such that

$\displaystyle \left| \left\{ b(f,{\mathcal D}) : f \in {\mathcal F} \right\}\right| = 2^{d}$

for all ${{\mathcal D} \subseteq {\mathcal D}_T}$ with ${|{\mathcal D}| = d}$. A deep result, commonly known as the Sauer-Shelah lemma (see here for some fascinating history), then says that

$\displaystyle J \le \sum^{d_T({\bf x})}_{j=0} {N \choose j} \le \left(\frac{Ne}{d_T({\bf x})}\right)^{d_T({\bf x})} \le \left(\frac{Te}{d_T({\bf x})}\right)^{d_T({\bf x})}. \ \ \ \ \ (5)$

Now we are ready to state and prove the following:

Theorem 2

$\displaystyle \bar{e}_{{\bf x}}({\mathcal A}^{\rm Bayes},T) \le \bar{e}_{{\bf x}}({\mathcal A}^{\rm Gibbs},T) \le \frac{d_T({\bf x})}{2T} \log \left(\frac{Te}{d_T({\bf x})} \right). \ \ \ \ \ (6)$

Proof: We have

$\displaystyle \begin{array}{rcl} \bar{e}_{{\bf x}}({\mathcal A}^{\rm Bayes},T) &\le& \bar{e}_{{\bf x}}({\mathcal A}^{\rm Gibbs},T)\\ &\le& \frac{1}{2T}\sum^T_{t=1} {\mathbb E} [-\log \xi_t] \\ &=& \frac{1}{2T}\sum^T_{t=1} {\mathbb E}\left[\log \frac{Q({\mathcal V}_{t-1})}{Q({\mathcal V}_t)}\right] \\ &=& \frac{1}{2T}{\mathbb E}\left[\log \frac{Q({\mathcal V}_0)}{Q({\mathcal V}_T)}\right] \\ &=& \frac{1}{2T}{\mathbb E}\left[\log \frac{1}{Q({\mathcal V}_T)}\right]. \end{array}$

From the definition of the version spaces, it is immediate that ${{\mathcal V}_T}$ is the equivalence class containing the “true” model ${F \sim Q}$. In other words, there exists some random ${j(F) \in \{1,\ldots,J\}}$, such that ${{\mathcal V}_T = {\mathcal E}_{j(F)}}$. Consequently, we can write

$\displaystyle \begin{array}{rcl} {\mathbb E}\left[\log \frac{1}{Q({\mathcal V}_T)}\right] &=& \sum^J_{j=1} {\mathbb E}\left[{\bf 1}\{F \in {\mathcal E}_j\}\log\frac{1}{Q({\mathcal V}_T)}\right] \\ &=& \sum^J_{j=1} {\mathbb E}\left[{\bf 1}\{F \in {\mathcal E}_j\}\log\frac{1}{Q({\mathcal E}_j)}\right] \\ &=& -\sum^J_{j=1} Q({\mathcal E}_j) \log Q({\mathcal E}_j). \end{array}$

We can recognize the quantity in the last line as the Shannon entropy of the index ${j(F)}$: that is,

$\displaystyle \begin{array}{rcl} \bar{e}_{{\bf x}}({\mathcal A}^{\rm Bayes},T) &\le& \bar{e}_{{\bf x}}({\mathcal A}^{\rm Bayes},T) \\ &\le& \frac{H(j(F))}{2T}. \end{array}$

Now, since ${j(F)}$ takes values in ${\{1,\ldots,J\}}$, we can invoke the Sauer-Shelah estimate (5) to get

$\displaystyle H(j(F)) \le \log J \le d_T({\bf x}) \log \left(\frac{T e}{d_T({\bf x})}\right).$

The claimed bound (6) follows. $\Box$

If ${{\mathcal F}}$ is a Vapnik-Chervonenkis class with VC dimension ${d}$, then we immediately obtain:

Corollary 3

$\displaystyle \bar{e}_{{\bf x}}({\mathcal A}^{\rm Bayes},T) \le \bar{e}_{{\bf x}}({\mathcal A}^{\rm Gibbs},T) \le \frac{d}{2T} \log \left(\frac{Te}{d} \right).$