# The Information Structuralist

## Lower bounds for passive and active learning

Sasha Rakhlin and I will be presenting our paper “Lower bounds for passive and active learning” at this year’s NIPS, which will be taking place in Granada, Spain from December 12 to December 15. The proofs of our main results rely heavily on information-theoretic techniques, specifically the data processing inequality for ${f}$-divergences and a certain type of constant-weight binary codes.

The paper deals with the well-known binary classification problem: We have two correlated random variables ${X \in {\mathsf X}}$ and ${Y \in \{0,1\}}$, where traditionally ${X}$ is referred to as an instance or a feature and ${Y}$ as the (binary) label. The joint distribution ${P \equiv P_{XY}}$ of ${X}$ and ${Y}$ is unknown, and the goal is to learn a classifier, i.e., a mapping ${f : {\mathsf X} \rightarrow \{0,1\}}$, whose probability of error is as small as possible. More formally, the accuracy of a classifier ${f}$ is measured by its excess risk w.r.t. ${P}$:

$\displaystyle E_P(f) = P\big\{f(X) \neq Y\big\} - \inf_{g : {\mathsf X} \rightarrow \{0,1\}} P\big\{g(X) \neq Y\big\}.$

It is known that the infimum on the right-hand side is achieved by the Bayes classifier, which has the following form: Given ${P}$, let ${\eta(x) := {\mathbb E}[Y|X=x]}$ be the so-called regression function. Then the Bayes classifier is

$\displaystyle f^*(x) = \begin{cases} 1, & \text{if } \eta(x) \ge 1/2 \\ 0, & \text{otherwise} \end{cases}$

In other words, the optimum strategy is to classify a given ${x \in {\mathsf X}}$ as a ${1}$ if and only if the conditional probability (under ${P}$) that ${Y=1}$ given ${X=x}$ is at least half.

The classical set-up for this problem assumes that the learning agent has access to a large number ${n}$ of i.i.d. samples ${(X_1,Y_1),\ldots,(X_n,Y_n)}$ from ${P}$, and can use these samples to select a candidate classifier ${\widehat{f}_n}$ from some fixed class ${{\mathcal F}}$. The goal of learning is to ensure that, with high probability, the candidate classifier ${\widehat{f}_n}$ is as close to ${f^*}$ as possible. This is the passive learning model, since the learning agent has no control over the process of collecting the data. By contrast, under the active model, at each time step ${t=1,2,\ldots}$ the learner selects the new feature ${X_t \in {\mathsf X}}$ based on all the past data ${(X_1,Y_1),\ldots,(X_{t-1},Y_{t-1})}$, and then requests the corresponding label ${Y_t \sim P_{Y|X}}$ from an “oracle.” When a suitably large number ${n}$ of feature-label pairs has been collected, the learner outputs a candidate classifier ${\widehat{f}_n \in {\mathcal F}}$, whose performance is measured, as before, by the excess risk.

Remark 1 This is not the only active learning model out there. Another widely studied setting, which can be more accurately called “selective sampling,” involves i.i.d. training samples just like in the passive case, but the learner accesses them sequentially and has the freedom to decide, for each of the examples, whether or not he wants the corresponding label to be revealed. In this set-up, unlabeled features are essentially free; only the number of label requests matters.

Clearly, the active learning model is stronger than the passive model. But how much do we gain by allowing active learning? One way to quantify it is to look at sample complexity of the underlying learning problem: given some class ${{\mathcal P}}$ of possible distributions of ${(X,Y)}$ and a class ${{\mathcal F}}$ of candidate classifiers, what is the minimum number of feature-label pairs that the learner needs to see in order to achieve a given level of excess risk with a given probability of success? Here the probability of success is computed w.r.t. the probability measure that governs the data-gathering process, so in the passive case it’s just ${P^n}$, while in the active case it is given by interconnecting the agent’s (possibly stochastic) rules for selecting ${X_t}$ on the basis of ${X^{t-1},Y^{t-1}}$ for each ${t}$ with the conditional probability distribution ${P_{Y|X}}$, while respecting the causal ordering

$\displaystyle X_1,Y_1,X_2,Y_2,\ldots,X_t,Y_t,\ldots$

The sample complexity is an “information-theoretic” limit in the sense that no strategy, no matter how clever or computationally powerful, can make do with fewer training samples.

To date, there has been quite a great deal of work on developing and analyzing algorithms for active learning — see, e.g., the papers by Steve Hanneke (see also the preprint) and Vladimir Koltchinskii and references therein. A theoretical analysis of any given algorithm gives us upper bounds on the sample complexity, so if we find an upper bound for active learning that is much lower than existing tight lower bounds for passive learning, then we can claim that active learning indeed does help. Both Hanneke and Koltchinskii show that the sample complexity of a number of very natural schemes for active learning can be upper-bounded in terms of what Hanneke has called the disagreement coefficient, which is defined as follows. Consider the underlying distribution ${P}$ of ${(X,Y)}$ and a class ${{\mathcal F}}$ of candidate classifiers. For each ${\epsilon \in [0,1]}$ define the ${\epsilon}$-minimal set

$\displaystyle {\mathcal F}_\epsilon(f^*) = \Big\{ f \in {\mathcal F}: P\big\{f(X) \neq f^*(X)\big\} \le \epsilon \Big\},$

where ${f^*}$ is the Bayes classifier corresponding to ${P}$. In words, ${{\mathcal F}_\epsilon}$ consists of all classifiers ${f \in {\mathcal F}}$ that disagree with the Bayes classifier ${f^*}$ with ${P}$-probability at most ${\epsilon}$. Next, define the disagreement set

$\displaystyle D_\epsilon(f^*) = \Big\{ x \in {\mathsf X}: \exists f \in {\mathcal F}_\epsilon(f^*) \text{ such that } f(x) \neq f^*(x)\Big\}.$

In words, the disagreement set consists of all features ${x \in {\mathsf X}}$, for which there exists at least one classifier ${f}$ in the ${\epsilon}$-minimal set that disagrees with ${f^*}$ at ${x}$. Now consider the quantity

$\displaystyle \tau(\epsilon) := \frac{P\big\{D_\epsilon(f^*)\big\}}{\epsilon}, \ \ \ \ \ (1)$

which measures the “size” of the disagreement set (relative to the “radius” ${\epsilon}$ of ${{\mathcal F}_\epsilon(f^*)}$).

Remark 2 The function ${\tau(\epsilon)}$ defined in (1) has been used by Alexander in his work on deviation inequalities for empirical processes, and more recently by Giné and Koltchinskii in their work on passive learning. The latter authors have termed the function ${\tau}$ the Alexander capacity.

The disagreement coefficient for the pair ${(P,{\mathcal F})}$ is then defined as

$\displaystyle \tau_0 := \sup_{\epsilon > 0} \tau(\epsilon).$

Hanneke’s paper contains several examples of how the disagreement coefficient may be computed (or bounded) for specific learning problems. Moreover, if ${\tau_0}$ is finite, then a natural active learning scheme can be based on the observation that only the features in the disagreement set are “informative.” Of course, the disagreement set depends on the unknown distribution, but, as Koltchinskii has shown, it is possible to estimate this set from data. In particular, if the regression function ${\eta}$ corresponding to the distribution ${P_{XY}}$ has margin ${h > 0}$, i.e., if ${|2\eta(x)-1| \ge h}$ for all ${x \in {\mathsf X}}$, then the algorithm proposed by Koltchinskii needs on the order of

$\displaystyle \frac{\tau_0 \log (1/\epsilon)}{h^2} \left[d \log \tau_0 + \log (1/\delta)\right] \ \ \ \ \ (2)$

examples in order to attain an excess risk of at most ${\epsilon}$ with probability at least ${1-\delta}$, where ${d}$ is the VC dimension of the class ${{\mathcal F}}$.

What Sasha and I have proved is a minimax lower bound on the sample complexity of active learning under the margin assumption. Roughly speaking, we have shown that, for any admissible capacity function ${\tau}$ and any sufficiently small ${\epsilon}$, ${\delta}$ and ${h}$, there exists a choice of the pair ${(P,{\mathcal F})}$ that has margin ${h}$, whose Alexander capacity at ${\epsilon}$ is equal to ${\tau(\epsilon)}$ (it may be different for other values ${\epsilon'}$), and any active learning algorithm that attains excess risk of at most ${h \epsilon}$ with ${P}$-probability at least ${1-\delta}$ needs at least order of

$\displaystyle \frac{(1-\delta)d \log \tau(\epsilon)}{h^2} + \frac{\tau(\epsilon)\log (1/\delta)}{h^2} \ \ \ \ \ (3)$

examples. We have also proved the corresponding lower bound for passive learning: ${n}$ must be at least order of

$\displaystyle \frac{(1-\delta)d \log \tau(\epsilon)}{\epsilon h^2} + \frac{\log (1/\delta)}{\epsilon h^2}. \ \ \ \ \ (4)$

Note that, in contrast with Koltchinskii’s upper bound (2) that involves the supremum ${\tau_0}$ of Alexander’s capacity, our lower bound actually depends on the value of the capacity at the given ${\epsilon}$. Thus, we can quantify the relative advantage of active learning over passive learning in terms of the rate of growth of ${\tau}$ as a function of ${\epsilon}$. For instance, if ${\tau(\epsilon) \sim 1/\epsilon}$, then active learning has very little advantage over passive learning, since both will require at least ${\log(1/\delta)/\epsilon h^2}$ examples. On the other hand, if ${\tau(\epsilon)}$ is close to ${\tau_0}$, then the best passive learner will need a factor of ${1/\epsilon}$ more examples than the best active learner.

As I’ve mentioned earlier, our proof of the bounds (3) and (4) makes essential use of the data processing inequality for ${f}$-divergences (in the paper, we use the term “${\phi}$-divergence,” since ${f}$ is reserved for a generic classifier). This technique is much stronger than any method based on Fano’s inequality, since the latter generally cannot give the term proportional to ${\log (1/\delta)}$ and, more importantly, gives very loose bounds for the active case. This is where the freedom of choosing a suitable ${f}$-divergence comes to save the day, since it allows us to decouple the conditional information gain at each time step from the variables that describe the “global” behavior of the learning algorithm (e.g., the total number of times a given feature point has been queried by the learner). It should be pointed out that we were definitely not the first to use the data processing inequality for ${f}$-divergences in a statistical context. It was used implicitly by Adityanand Guntuboyina and explicitly by Alexander Gushchin (in a really obscure paper that you’ve probably never heard of) to improve upon Fano’s method for deriving minimax lower bounds on the risk of passive statistical estimation procedures.