Machine learning is about enabling computers to improve their performance on a given task as they get more data. Can we express this intuition quantitatively using information-theoretic techniques? In this post, I will discuss a classic paper by David Haussler, Michael Kearns, and Robert Schapire that (to the best of my knowledge) took the first step in this direction. In this post, I will describe some of their results, recast in a more explicitly information-theoretic way.
Sasha Rakhlin and I will be presenting our paper “Lower bounds for passive and active learning” at this year’s NIPS, which will be taking place in Granada, Spain from December 12 to December 15. The proofs of our main results rely heavily on information-theoretic techniques, specifically the data processing inequality for -divergences and a certain type of constant-weight binary codes.
Alekh Agarwal and Sasha Rakhlin are organizing a workshop at this year’s NIPS. I’m on the program committee, so it is my duty (and distinct pleasure) to invite you all to peruse the full call for papers here, or at least to check out this key snippet:
We would like to welcome high-quality submissions on topics including but not limited to:
- Fundamental statistical limits with bounded computation
- Trade-offs between statistical accuracy and computational costs
- Computation-preserving reductions between statistical problems
- Algorithms to learn under budget constraints
- Budget constraints on other resources (e.g. bounded memory)
- Computationally aware approaches such as coarse-to-fine learning
Interesting submissions in other relevant topics not listed above are welcome too. Due to the time constraints, most accepted submissions will be presented as poster spotlights.
Oh, and did I mention that the workshop will take place in mid-December in Sierra Nevada, Spain?
Obligatory disclaimer: YMMV, “favorite” does not mean “best,” etc. etc.
- Emmanuel Abbe and Andrew Barron, “Polar coding schemes for the AWGN channel” (pdf)
- Tom Cover, “On the St. Petersburg paradox”
- Paul Cuff, Tom Cover, Gowtham Kumar, Lei Zhao, “A lattice of gambles”
- Ioanna Ioannou, Charalambos Charalambous, Sergey Loyka, “Outage probability under channel distribution uncertainty” (pdf; longer version: arxiv:1102.1103)
- Mohammad Naghshvar, Tara Javidi, “Performance bounds for active sequential hypothesis testing”
- Chris Quinn, Negar Kiyavash, Todd Coleman, “Equivalence between minimal generative model graphs and directed information graphs” (pdf)
- Ofer Shayevitz, “On Rényi measures and hypothesis testing” (long version: arxiv:1012.4401)
The problem of constructing polar codes for channels with continuous input and output alphabets can be reduced, in a certain sense, to the problem of constructing finitely supported approximations to capacity-achieving distributions. This work analyzes several such approximations for the AWGN channel. In particular, one approximation uses quantiles and approaches capacity at a rate that decays exponentially with support size. The proof of this fact uses a neat trick of upper-bounding the Kullback-Leibler divergence by the chi-square distance and then exploiting the law of large numbers.
A fitting topic, since this year’s ISIT took place in St. Petersburg! Tom has presented a reformulation of the problem underlying this (in)famous paradox in terms of finding the best allocation of initial capital so as to optimize various notions of relative wealth. This reformulation obviates the need for various extra assumptions, such as diminishing marginal returns (i.e., concave utilities), and thus provides a means of resolving the paradox from first principles.
There is a well-known correspondence between martingales and “fair” gambling systems. Paul and co-authors explore another correspondence, between fair gambles and Lorenz curves used in econometric modeling, to study certain stochastic orderings and transformations of martingales. There are nice links to the theory of majorization and, through that, to Blackwell’s framework for comparing statistical experiments in terms of their expected risks.
The outage probability of a general channel with stochastic fading is the probability that the conditional input-output mutual information given the fading state falls below the given rate. In this paper, it is assumed that the state distribution is not known exactly, but there is an upper bound on its divergence from some fixed “nominal” distribution (this model of statistical uncertainty has been used previously in the context of robust control). The variational representation of the divergence (as a Legendre-Fenchel transform of the moment-generating function) then allows for a clean asymptotic analysis of the outage probability.
Mohammad and Tara show how dynamic programming techniques can be used to develop tight converse bounds for sequential hypothesis testing problems with feedback, in which it is possible to adaptively control the quality of the observation channel. This viewpoint is a lot cleaner and more conceptually straightforward than “classical” proofs based on martingales (à la Burnashev). This new technique is used to analyze asymptotically optimal strategies for sequential -ary hypothesis testing, variable-length coding with feedback, and noisy dynamic search.
For networks of interacting discrete-time stochastic processes possessing a certain conditional independence structure (motivating example: discrete-time approximations of smooth dynamical systems), Chris, Negar and Todd show the equivalence between two types of graphical models for these networks: (1) generative models that are minimal in a certain “combinatorial” sense and (2) information-theoretic graphs, in which the edges are drawn based on directed information.
Ofer obtained a new variational characterization of Rényi entropy and divergence that considerably simplifies their analysis, in many cases completely replacing delicate arguments based on Taylor expansions with purely information-theoretic proofs. He also develops a new operational characterization of these information measures in terms of distributed composite hypothesis testing.
Gather round for another tale of the mighty divergence and its adventures!
This time I want to show how the well-known Cramér–Rao lower bound (or the information inequality) for parameter estimation can be derived from one of the most fundamental results in information theory: the data processing theorem for divergence. What is particularly nice about this derivation is that, along the way, we get to see how the Fisher information controls local curvature of the divergence between two members of a parametric family of distributions.
I gave the last lecture earlier today, wrapping up the semester. Here are the notes from the last two weeks:
- Regression with quadratic loss, mostly in reproducing kernel Hilbert spaces, with and without regularization.
- Case study: stochastic simulation via Rademacher bootstrap, where I discuss the work of Vladimir Koltchinskii et al. on efficient stopping algorithms for Monte Carlo stochastic simulation. The idea is to keep sampling until the empirical Rademacher average falls below a given threshold. Once that happens, you stop and compute a minimizer of the empirical risk. The work of Koltchinskii et al. was in turn inspired by the ideas of Mathukumalli Vidyasagar on the use of statistical learning theory in randomized algorithms for robust controller synthesis.
Monday’s lecture was on stochastic gradient descent as an alternative to batch empirical risk minimization. I will post the notes soon.
More learning-theoretic goodness:
- Case study: empirical quantizer design, where I discuss beautiful work by Tamás Linder et al. that uses VC theory to bound the performance of empirically designed vector quantizers (which is engineering jargon for consistency of the method of k-means).
- Binary classification: from the classic bounds for linear and generalized linear discriminant rules to modern techniques based on surrogate losses; voting methods; kernel machines; Convex Risk Minimization.
It’s that time again, the time to savor the dreamy delights of divergence!
In this post, we will look at a powerful information-theoretic method for deriving concentration-of-measure inequalities (i.e., tail bounds) for general functions of independent random variables.
Having covered generalization bounds for abstract ERM via Rademacher averages, we now trace the historical development of the field:
- Vapnik-Chervonenkis classes: shatter coefficients; VC dimension; examples of VC classes; Sauer-Shelah lemma; implication for Rademacher averages
This semester, I am teaching a graduate course on Statistical Learning Theory in my department. My aim is to introduce graduate students in electrical engineering to such things as Empirical Risk Minimization, generalization bounds, model selection, complexity regularization, minimax lower bounds &c., with certain examples of applications to information theory, signal processing, and adaptive control.
Unfortunately, I have not come across a textbook that would be suitable for teaching statistical learning theory to graduate students of engineering. Most texts out there are skewed towards either the computer science side of things or the mathematical statistics crowd. The only book that comes somewhat close to what I had in mind is M. Vidyasagar‘s Learning and Generalization, but its $169 price tag has stopped me from officially adopting it. Instead, I have been preparing my own lecture notes. As the class goes on, I will be posting additional notes here, but here is what I have covered so far:
- Introduction: learning from examples in a probabilistic setting; goals of learning; basics of statistical decision theory; estimation and approximation errors
- Concentration inequalities: Markov’s and Chebyshev’s inequalities; the Chernoff bounding trick; Hoeffding’s inequality; McDiarmid’s inequality; examples
- Formulation of the learning problem: concept and function learning in the realizable case; PAC learning; model-free learning; Empirical Risk Minimization and its consistency
- More on Empirical Risk Minimization: error bounds via Rademacher averages
Normally, I write these in frenetic two-hour bursts, so they are more than likely full of misconceptions, omissions, appalling lack of rigor, bugs, typos, and the like. Caveat lector!