Let’s continue with our magical mystery tour through the lands of divergence.

(image yoinked from Sergio Verdú‘s 2007 Shannon Lecture slides)

Today’s stop is in the machine learning domain. The result I am about to describe has been floating around in various forms in many different papers, but it has been nicely distilled by Hari Narayanan and Sasha Rakhlin in their recent paper on a random walk approach to online convex optimization.

The problem we will be looking at is *online optimization*, a very versatile and broad paradigm for modeling and studying sequential decision-making in the presence of uncertainty. If you are interested in learning the fundamentals of this field, I recommend checking out *Prediction, Learning, and Games* by Nicolò Cesa-Bianchi and Gábor Lugosi. There is also a blog called Inherent Uncertainty devoted primarily to online optimization and learning.

Anyway, the problem is as follows. Consider a repeated game between two players, the Agent (A) and Nature (N). The game takes place in discrete time. The moves of A are points in some set ; the moves of N are functions in some set . The game proceeds according to the following protocol:

- At time
- A selects a point
- N selects a function and announces it to A
- A suffers loss

At the end of the game, the total loss suffered by A is given by . The moves of A are constrained by causality — i.e., may depend only on and on . We also assume that N is *oblivious*, i.e., it does not see the moves of A. (This restriction can be removed.) The objective of A is to devise a strategy for selecting so as to keep the *worst-case regret*

as small as possible. Note that the regret is the worst-case gap between the total cost of incremental decisions made on the basis of causally available information and the total cost that could have been incurred in hindsight, with full knowledge of the costs, by a single point in . It is also possible to introduce the regret against *dynamic* strategies, i.e., sequences of points , but then we must restrict the complexity of these comparator sequences in some way (e.g., require that they vary slowly), and the analysis is a lot messier. Nevertheless, even this simple set-up can describe sequential versions of all sorts of problems, including data compression, prediction, classification, regression, investment, task assignment, etc.

We will allow A to use *randomized strategies*. That is, at each time A can choose a probability distribution over and draw his next move at random according to . As before, A is constrained by causality, so the choice of can depend only on currently available information. We shall therefore compare the *expected* cumulative cost over rounds with the smallest cumulative cost that can be incurred with full knowledge of using a single random choice of a point in . In other words, the regret is

where a strategy of A is a sequence of distributions over , so that for every .

Let’s consider the following natural scheme. Let’s suppose that can be equipped with an appropriate -field, and let us choose a prior probability measure on . Let us also choose a positive constant . Then our strategy is as follows:

- At time
- A samples from a probability measure that has the density
where with the initial condition , and

- N chooses and reveals it to
- A suffers loss

- A samples from a probability measure that has the density

This strategy is very intuitive. Essentially, if a point is bad from the viewpoint of time , i.e., its past cumulative cost was high, then A places a smaller weight on it. On the other hand, if is good, i.e., has small , then it is assigned a larger weight. Let’s see how well this strategy does. We start with the following exact expression:

Theorem 1For any -valued random variable with distribution , we have

*Proof:* We begin by computing the divergence :

Hence,

Summing from to and using the fact that , we obtain

Now let be a random variable with distribution . Then

Collecting everything, we get (1). (Note that we only need to consider those that satisfy , because then for every .)

If we assume more about the functions in , we can particularize the above result in various ways. For instance, here is what we get if the functions in are nonnegative and bounded:

Theorem 2If the functions in are bounded between and , then

*Proof:* Let us inspect the divergence :

Now we recall a nifty bound, due to Wassily Hoeffding (see, for example, Lemma 1 in these lecture notes by Gábor Lugosi): If is a real-valued random variable such that almost surely, then for any we have

We can apply this bound to with , , and to get

Then the terms involving the expectation of will cancel, and we will obtain

Summing over and using the fact that divergence is nonnegative, we obtain (2).

If we know the horizon is advance and if we consider only those comparator distributions , for which is bounded by some , then we can optimize the choice of : letting , we will guarantee that the regret will be bounded by . In other words, the *per-round* regret is . There are also various tricks that can be used when is not known in advance. Of course, sampling from these ‘s may be difficult, especially if has large cardinality or is contained in a high-dimensional space. In fact, the paper by Hari and Sasha addresses the question of whether this can be done efficiently. They present an explicit strategy, based on a modified Metropolis–Hastings algorithm, that can be used to *approximately* sample from the ‘s in a computationally efficient manner.