# Divergence in everything: bounding the regret in online optimization

Let’s continue with our magical mystery tour through the lands of divergence.

(image yoinked from Sergio Verdú‘s 2007 Shannon Lecture slides)

Today’s stop is in the machine learning domain. The result I am about to describe has been floating around in various forms in many different papers, but it has been nicely distilled by Hari Narayanan and Sasha Rakhlin in their recent paper on a random walk approach to online convex optimization.

The problem we will be looking at is online optimization, a very versatile and broad paradigm for modeling and studying sequential decision-making in the presence of uncertainty. If you are interested in learning the fundamentals of this field, I recommend checking out Prediction, Learning, and Games by Nicolò Cesa-Bianchi and Gábor Lugosi. There is also a blog called Inherent Uncertainty devoted primarily to online optimization and learning.

Anyway, the problem is as follows. Consider a repeated game between two players, the Agent (A) and Nature (N). The game takes place in discrete time. The moves of A are points in some set ${{\mathsf X}}$; the moves of N are functions in some set ${{\mathcal F} = \{ f : {\mathsf X} \rightarrow {\mathbb R} \}}$. The game proceeds according to the following protocol:

• At time ${t=1,\ldots,T}$
• A selects a point ${x_t \in {\mathsf X}}$
• N selects a function ${f_t \in {\mathcal F}}$ and announces it to A
• A suffers loss ${f_t(x_t)}$

At the end of the game, the total loss suffered by A is given by ${\sum^T_{t=1}f_t(x_t)}$. The moves of A are constrained by causality — i.e., ${x_t}$ may depend only on ${x_1,\ldots,x_{t-1}}$ and on ${f_1,\ldots,f_{t-1}}$. We also assume that N is oblivious, i.e., it does not see the moves of A. (This restriction can be removed.) The objective of A is to devise a strategy for selecting ${\{x_t\}^T_{t=1}}$ so as to keep the worst-case regret

$\displaystyle R_T = \sup_{f_1,\ldots,f_T \in {\mathcal F}}\left\{\sum^T_{t=1} f_t(x_t) - \inf_{u \in {\mathsf X}}\sum^T_{t=1} f_t(u)\right\}.$

as small as possible. Note that the regret is the worst-case gap between the total cost of incremental decisions made on the basis of causally available information and the total cost that could have been incurred in hindsight, with full knowledge of the costs, by a single point in ${{\mathsf X}}$. It is also possible to introduce the regret against dynamic strategies, i.e., sequences of points ${u_1,\ldots,u_T \in {\mathsf X}}$, but then we must restrict the complexity of these comparator sequences in some way (e.g., require that they vary slowly), and the analysis is a lot messier. Nevertheless, even this simple set-up can describe sequential versions of all sorts of problems, including data compression, prediction, classification, regression, investment, task assignment, etc.

We will allow A to use randomized strategies. That is, at each time ${t}$ A can choose a probability distribution ${P_{t-1}}$ over ${{\mathsf X}}$ and draw his next move ${X_t}$ at random according to ${P_{t-1}}$. As before, A is constrained by causality, so the choice of ${P_{t-1}}$ can depend only on currently available information. We shall therefore compare the expected cumulative cost over ${T}$ rounds with the smallest cumulative cost that can be incurred with full knowledge of ${f_1,\ldots,f_T}$ using a single random choice of a point in ${{\mathsf X}}$. In other words, the regret is

$\displaystyle R_T = \sup_{f_1,\ldots,f_T} \sup_{P_U}\mathop{\mathbb E} \left\{\sum^T_{t=1}f_t(X_t) - f_t(U)\right\},$

where a strategy of A is a sequence ${\{P_{t-1}\}^T_{t=1}}$ of distributions over ${{\mathsf X}}$, so that ${X_t \sim P_t}$ for every ${t}$.

Let’s consider the following natural scheme. Let’s suppose that ${{\mathsf X}}$ can be equipped with an appropriate ${\sigma}$-field, and let us choose a prior probability measure ${P_0}$ on ${{\mathsf X}}$. Let us also choose a positive constant ${\eta > 0}$. Then our strategy is as follows:

• At time ${t=1,\ldots,T}$
• A samples ${X_t}$ from a probability measure ${P_{t-1}}$ that has the density

$\displaystyle p_{t-1}(x) = \frac{dP_{t-1}(x)}{dP_0} = \frac{e^{-\eta F_{t-1}(x)}}{Z_{t-1}},$

where ${F_{t-1}(x) = \sum^{t-1}_{s=1}f_s(x)}$ with the initial condition ${F_0(x) \equiv 0}$, and

$\displaystyle Z_{t-1} = \int_{\mathsf X} e^{-\eta F_{t-1}(x)}P_0(dx).$

• N chooses ${f_t}$ and reveals it to ${A}$
• A suffers loss ${f_t(X_t)}$

This strategy is very intuitive. Essentially, if a point ${x \in {\mathsf X}}$ is bad from the viewpoint of time ${t}$, i.e., its past cumulative cost ${F_{t-1}(x)}$ was high, then A places a smaller weight on it. On the other hand, if ${x}$ is good, i.e., has small ${F_{t-1}(x)}$, then it is assigned a larger weight. Let’s see how well this strategy does. We start with the following exact expression:

Theorem 1 For any ${{\mathsf X}}$-valued random variable ${U}$ with distribution ${P_U}$, we have

$\displaystyle \mathop{\mathbb E}\left\{ \sum^T_{t=1} f_t(X_t) - \sum^T_{t=1} f_t(U)\right\} = \eta^{-1} \left( D(P_U \| P_0) - D(P_U \| P_T)\right) + \eta^{-1}\sum^T_{t=1} D(P_{t-1}\| P_t). \ \ \ \ \ (1)$

Proof: We begin by computing the divergence ${D(P_{t-1}\|P_t)}$:

$\displaystyle \begin{array}{rcl} D(P_{t-1}\| P_t) &=& \int_{\mathsf X} \log \frac{dP_{t-1}(x)}{dP_t} P_{t-1}(dx) \\ &=& \int_{\mathsf X} \log \frac{dP_{t-1}(x)}{dP_0} P_{t-1}(dx) - \int_{\mathsf X} \log \frac{dP_t(x)}{dP_0} P_{t-1}(dx) \\ &=& \int_{\mathsf X} \log \frac{e^{-\eta F_{t-1}(x)}}{Z_{t-1}} P_{t-1}(dx) - \int_{\mathsf X} \log \frac{e^{-\eta F_t(x)}}{Z_t} P_{t-1}(dx) \\ &=& \eta\ \mathop{\mathbb E}\left[F_t(X_t) - F_{t-1}(X_t) \right] + \log\frac{Z_t}{Z_{t-1}} \\ &=& \eta\ \mathop{\mathbb E} f_t(X_t) + \log\frac{Z_t}{Z_{t-1}}. \end{array}$

Hence,

$\displaystyle \mathop{\mathbb E} f_t(X_t) = \eta^{-1}\left(D(P_{t-1}\| P_t) - \log \frac{Z_t}{Z_{t-1}}\right).$

Summing from ${t=1}$ to ${t=T}$ and using the fact that ${Z_0 \equiv 1}$, we obtain

$\displaystyle \mathop{\mathbb E}\left\{ \sum^T_{t=1}f_t(X_t)\right\} = \eta^{-1}\left( \sum^T_{t=1} D(P_{t-1}\| P_t) - \log Z_T\right).$

Now let ${U}$ be a random variable with distribution ${P_U}$. Then

$\displaystyle \begin{array}{rcl} \mathop{\mathbb E}\left\{\sum^T_{t=1} f_t(U)\right\} &=& \mathop{\mathbb E} F_T(U) \\ &=& -\eta^{-1}\left( \int_{\mathsf X} \log \frac{dP_T(u)}{dP_0} P_U(du) + \log Z_T \right) \\ &=& -\eta^{-1}\left( \int_{\mathsf X} \log \frac{dP_T(u)}{dP_U} P_U(du) + \int_{\mathsf X} \log \frac{dP_U(u)}{dP_0} P_U(du) + \log Z_T \right) \\ &=& -\eta^{-1}\left( D(P_U \| P_0) - D(P_U \| P_T) + \log Z_T\right). \end{array}$

Collecting everything, we get (1). (Note that we only need to consider those ${P_U}$ that satisfy ${D(P_U \| P_0) < \infty}$, because then ${D(P_U \| P_t) < \infty}$ for every ${t}$.) $\Box$

If we assume more about the functions in ${{\mathcal F}}$, we can particularize the above result in various ways. For instance, here is what we get if the functions in ${{\mathcal F}}$ are nonnegative and bounded:

Theorem 2 If the functions in ${{\mathcal F}}$ are bounded between ${0}$ and ${1}$, then

$\displaystyle \mathop{\mathbb E}\left\{ \sum^T_{t=1} f_t(X_t) - \sum^T_{t=1} f_t(U)\right\} \le \eta^{-1} D(P_U \| P_0) + T\eta/8. \ \ \ \ \ (2)$

Proof: Let us inspect the divergence ${D(P_{t-1}\| P_t)}$:

$\displaystyle \begin{array}{rcl} D(P_{t-1}\|P_t) &=& \eta \mathop{\mathbb E} f_t(X_t) + \log\frac{Z_t}{Z_{t-1}} \\ &=& \eta \mathop{\mathbb E} f_t(X_t) + \log \frac{ \int_{\mathsf X} e^{-\eta F_t(x)}P_0(dx)}{Z_{t-1}} \\ &=& \eta \mathop{\mathbb E} f_t(X_t) + \log \frac{ \int_{\mathsf X} e^{-\eta f_t(x)} e^{-\eta F_{t-1}(x)}P_0(dx)}{Z_{t-1}} \\ &=& \eta \mathop{\mathbb E} f_t(X_t) + \log \mathop{\mathbb E} e^{-\eta f_t(X_{t})}. \end{array}$

Now we recall a nifty bound, due to Wassily Hoeffding (see, for example, Lemma 1 in these lecture notes by Gábor Lugosi): If ${Z}$ is a real-valued random variable such that ${a \le Z \le b}$ almost surely, then for any ${s \in {\mathbb R}}$ we have

$\displaystyle \log \mathop{\mathbb E} e^{sZ} \le s \mathop{\mathbb E} Z + \frac{s^2 (b-a)^2}{8}.$

We can apply this bound to ${Z = f_t(X_{t-1})}$ with ${s=-\eta}$, ${a = 0}$, and ${b = 1}$ to get

$\displaystyle \log \mathop{\mathbb E} e^{-\eta f_t(X_{t})} \le -\eta\ \mathop{\mathbb E} f_t(X_t) + \frac{\eta^2}{8}.$

Then the terms involving the expectation of ${f_t(X_t)}$ will cancel, and we will obtain

$\displaystyle D(P_{t-1}\|P_t) \le \frac{\eta^2}{8}.$

Summing over ${t}$ and using the fact that divergence is nonnegative, we obtain (2). $\Box$

If we know the horizon ${T}$ is advance and if we consider only those comparator distributions ${P_U}$, for which ${D(P_U \| P_0)}$ is bounded by some ${D < \infty}$, then we can optimize the choice of ${\eta}$: letting ${\eta = \sqrt{8D/T}}$, we will guarantee that the regret will be bounded by ${\sqrt{DT/2}}$. In other words, the per-round regret is ${O(1/\sqrt{T})}$. There are also various tricks that can be used when ${T}$ is not known in advance. Of course, sampling from these ${P_t}$‘s may be difficult, especially if ${{\mathsf X}}$ has large cardinality or is contained in a high-dimensional space. In fact, the paper by Hari and Sasha addresses the question of whether this can be done efficiently. They present an explicit strategy, based on a modified Metropolis–Hastings algorithm, that can be used to approximately sample from the ${P_t}$‘s in a computationally efficient manner.