# The Information Structuralist

Posted in Control, Feedback, Information Theory, Narcissism, Optimization, Papers and Preprints by mraginsky on October 13, 2010

It’s time to fire up the Shameless Self-Promotion Engine again, for I am about to announce a preprint and a paper to be published. Both deal with more or less the same problem — i.e., fundamental limits of certain sequential procedures — and both rely on the same set of techniques: metric entropy, Fano’s inequality, and bounds on the mutual information through divergence with auxiliary probability measures.

So, without further ado, I give you:

1. MR and Alexander (Sasha) Rakhlin, “Information-based complexity, feedback, and dynamics in sequential convex programming,” arXiv:1010.2285, submitted to IEEE Transactions on Information Theory

Abstract: We study the intrinsic limitations of sequential convex optimization through the lens of feedback information theory. In the oracle model of optimization, an algorithm queries an oracle for noisy information about the unknown objective function, and the goal is to (approximately) minimize every function in a given class using as few queries as possible. We show that, in order for a function to be optimized, the algorithm must be able to accumulate enough information about the objective. This, in turn, puts limits on the speed of optimization under specific assumptions on the oracle and the type of feedback. Our techniques are akin to the ones used in statistical literature to obtain minimax lower bounds on the risks of estimation procedures; the notable difference is that, unlike in the case of i.i.d. data, a sequential optimiation algorithm can gather observations in a controlled manner, so that the amount of information at each step is allowed to change in time. In particular, we show that optimization algorithms often obey the law of diminishing returns: the signal-to-noise ratio drops as the optimization algorithm approaches the optimum. To underscore the generality of the tools, we use our approach to derive fundamental lower bounds for a certain active learning problem. Overall, the present work connects the intuitive notions of “information” in optimization, experimental design, estimation, and active learning to the quantitative notion of Shannon information.

Disclaimer: This work was done in parallel with, and independently of, Alekh Agarwal et al. While some of the results overlap, the emphasis is different. Ours has more of a dynamical systems flavor, asking questions like “how much information can a sequential estimation scheme extract from a particular oracle?” and “how quickly can an arbitrary algorithm reduce its uncertainty?”, while Alekh et al. are motivated by the trade-off between complexity and accuracy in large-scale statistical learning problems.

2. MR, “Divergence-based characterization of fundamental limitations of adaptive dynamical systems,” arXiv:1010.2286, to appear in Proceedings of the 48th Annual Allerton Conference on Communication, Control and Computing

Abstract: Adaptive dynamical systems arise in a multitude of contexts, e.g., optimization, control, communications, signal processing, and machine learning. A precise characterization of their fundamental limitations is therefore of paramount importance. In this paper, we consider the general problem of adaptively controlling and/or identifying a stochastic dynamical system, where our a priori knowledge allows us to place the system in a subset of a metric space (the uncertainty set). We present an information-theoretic meta-theorem that captures the trade-off between the metric complexity (or richness) of the uncertainty set, the amount of information acquired online in the process of controlling and observing the system, and the residual uncertainty remaining after the observations have been collected. Following the approach of Zames, we quantify a priori information by the Kolmogorov (metric) entropy of the uncertainty set, while the information acquired online is expressed as a sum of information divergences. The general theory is used to derive new minimax lower bounds on the metric identification error, as well as to give a simple derivation of the minimum time needed to stabilize an uncertain stochastic linear system.

Both papers deal with what I call playing “Twenty Questions” with a black box. The basic idea is this. We have some “universe” ${{\mathcal F}}$ of objects. Nature selects an arbitrary object ${f \in {\mathcal F}}$. This choice is not revealed to us, and yet our goal is to make some optimal decision regarding this object. While we do not know what ${f}$ is, we can gather information about it by querying a stochastic black box (an oracle). The querying process, shown in the figure below, takes place in discrete time as follows. At time ${t = 0,1,2,\ldots,T:}$

1. We issue a query ${X_t}$, which is a function of all the past queries ${X^{t-1}_0 = (X_0,\ldots,X_{t-1})}$ and all the past oracle outputs ${Y^t_0 = (Y_0,\ldots,Y_t)}$ [here ${Y_0}$ is a fixed initial state of the oracle].
2. The oracle responds with ${Y_{t+1} = \varphi(Y_t,X_t,\xi_t,f)}$, where ${\{ \xi_t\}}$ is a stochastic noise (disturbance) process and ${\varphi(\cdot)}$ is a known mapping.

At time ${T}$ we compute a decision ${\theta_T}$ in some decision space ${\Theta}$ as a function of all the available information ${(X^{T-1}_0,Y^T_0)}$ and incur a cost ${\rho(f,\theta_T)}$, where ${\rho : {\mathcal F} \times \Theta \rightarrow {\mathbb R}^+}$ is some cost function.

The mappings used to compute the query and the final decision form a policy. The basic question is this: what is the minimum number of queries ${T}$ any policy must pose, so that the expected cost is small no matter what ${f}$ had been chosen by Nature?

The paper with Sasha Rakhlin uses this set-up to study the intrinsic limitations of sequential algorithms for convex optimization. In this setting, the game of twenty questions can be described as follows. Let ${{\mathsf X}}$ be a compact convex subset of ${{\mathbb R}^n}$ and let ${{\mathcal F}}$ be a class of convex functions ${f : {\mathsf X} \rightarrow {\mathbb R}}$. The queries are points in ${{\mathsf X}}$, and suppose that the oracle provides a noisy version of the value and the gradient of ${f}$ at the query point (if ${f}$ is not differentiable, then an arbitrary subgradient is used instead):

$\displaystyle Y_{t+1} = \Big( f(X_t) + \xi^{(0)}_t, \nabla f(X_t) + \xi^{(1)}_t \Big),$

where ${\{\xi_t\}^\infty_{t=0} = \left\{\left(\xi^{(0)}_t,\xi^{(1)}_t\right)\right\}^\infty_{t=0}}$ is, say, a white Gaussian noise process with variance ${\sigma^2}$. The decision space is the problem domain ${{\mathsf X}}$, and the cost is the error of optimization:

$\displaystyle \rho(f,\theta) = f(\theta) - \inf_{x \in {\mathsf X}} f(x).$

The minimum number of queries needed to optimize every ${f \in {\mathcal F}}$ with an error of at most ${\varepsilon}$ is called the information-based complexity of ${{\mathcal F}}$ relative to the given oracle.

The other paper deals with adaptive control and system identification. In that setting, each ${f \in {\mathcal F}}$ represents a stochastic dynamical system with inputs ${\{X_t\}}$ and outputs ${\{Y_t\}}$. Supposing that ${{\mathcal F}}$ is a metric space with some metric ${\rho}$, we wish to identify the system after applying a sequence of ${T}$ inputs and observing the resulting outputs. Thus, the decision ${\theta_T}$ is our estimate of the system. The goal is, again, to determine the smallest number of queries (inputs) we need to apply in order to locate the unknown system up to a ball of radius ${\varepsilon}$. Alternatively, the objective may pertain to something like adaptive stabilization, in which case the decision space is the space of input-output paths of length ${T}$, and the cost function may be something like

$\displaystyle \rho(f,\theta) = \frac{1}{T}\sum^{T-1}_{t=0} c(X_t,Y_{t}) - J^*(f)$

where ${\theta = (X^{T-1}_0,Y^{T}_0)}$ is the input-output trajectory up to time ${T}$, ${c(x,y)}$ is the cost of applying the control (input) ${x}$ when the system is in state ${y}$, and ${J^*(f)}$ is the optimal cost attainable on ${f}$ by any admissible control policy.

So what these two papers present is a unified information-theoretic machinery for tackling questions like these. In a very sketchy form, the argument goes as follows.

1. Suppose that we can endow the universe of objects ${{\mathcal F}}$ with a “distance” ${d(\cdot,\cdot)}$ with the following property. Choose any ${\varepsilon > 0}$. Consider any pair ${f,f' \in {\mathcal F}}$ of objects that are separated by at least ${2\varepsilon}$, i.e., ${d(f,f') \ge 2\varepsilon}$. Suppose that there exists some ${\theta \in \Theta}$ such that ${\rho(f,\theta) < \varepsilon}$. Then we must necessarily have ${\rho(f',\theta) > \varepsilon}$. In other words, a decision that’s ${\varepsilon}$-good for ${f}$ cannot be ${\varepsilon}$-good for any ${f'}$ that is more than ${2\varepsilon}$ away from ${f}$, as measured by ${d}$.
2. Now consider any finite set ${\{f_1,\ldots,f_N\} \subset {\mathcal F}}$ of objects that are ${2\varepsilon}$-separated, i.e.,

$\displaystyle d(f_i,f_j) \ge 2\varepsilon, \qquad \forall i \neq j.$

Suppose that Nature chooses an object from this set uniformly at random. Thus, let ${W \in \{1,\ldots,N\}}$ be a uniformly distributed random variable. [To get tight bounds, we should try to choose the largest finite subset of ${{\mathcal F}}$ that has the required separation property.]

3. Now consider any policy that takes ${T}$ time steps and outputs a decision ${\theta_T}$. Define the estimator

$\displaystyle \hat{W} = {\rm arg\,min}_{1\le i \le N} \rho(f_i,\theta_T).$

In other words, ${\hat{W}}$ chooses that object from among ${\{f_1,\ldots,f_N\}}$, for which the decision ${\theta_T}$ is the best. Suppose that the policy is so good that, no matter what ${f}$ is, it achieves the cost of no more than ${\varepsilon}$ with probability at least ${1-\delta}$. Then a simple argument can be used to show that if ${\hat{W} \neq W}$ (in other words, if the above estimator is in error and does not pick out the actual object chosen by Nature), then ${\rho(f_W,\theta_T) > \varepsilon}$. That is,

$\displaystyle \Pr \left( \hat{W} \neq W \right) \le \sup_{f \in {\mathcal F}} \Pr \left( \rho(f,\theta_T) \ge \varepsilon \right) \le \delta.$

What we have shown so far is that a good policy can be used to construct a good estimator.

4. What we will show next is that this estimator cannot be too good. For that purpose, we call forth the gods of Fano’s inequality. Recall that ${\hat{W}}$ is a function of ${\theta_T}$, which in turn is a function of ${(X^{T-1}_0,Y^T_0)}$. Since ${W}$ is uniformly distributed on a set of size ${N}$, we can lower-bound the probability of error ${\Pr(\hat{W}\neq W)}$ as follows:

$\displaystyle \Pr\left( \hat{W} \neq W \right) \ge 1 - \frac{I\big(W; X^{T-1}_0,Y^T_0\big) + \log 2}{\log N},$

where ${I\big(W; X^{T-1}_0, Y^T_0\big)}$ is the Shannon mutual information between Nature’s random choice ${W}$ and the input-output path ${(X^{T-1}_0,Y^T_0)}$.

5. Collecting all these bounds, we end up with a trade-off inequality of the form

$\displaystyle I\big(W; X^{T-1}_0, Y^T_0 \big) \ge (1-\delta) \log N - \log 2.$

To complete the argument, we must bound the mutual information from above by something like ${T \Psi(\varepsilon)}$, where ${\Psi}$ is some positive-valued function whose exact form will depend on the complexity (or richness) of ${{\mathcal F}}$, as well as on the structure and the noisiness of the oracle. Once this is done, we get the bound

$\displaystyle T \ge \frac{(1-\delta)\log N - \log 2}{\Psi(\varepsilon)}.$

Steps 1 through 5 in the list above should be immediately recognizable to anyone familiar with the way statisticians derive minimax lower bounds on the risk of various estimation procedures. The new bit, which I will not describe here in a lot of detail (read the papers and find out!), is a systematic procedure for upper-bounding the mutual information as a sum of conditional divergences. Care must be taken here to account for the dependence of future inputs and outputs on the past inputs and outputs, but what we end up with looks something like this:

$\displaystyle I\big( W; X^{T-1}_0, Y^T_0 \big) \le \sum^T_{t=1} D\Big(P_{Y_t|W,X^{t-1}_0,Y^{t-1}_0} \Big\| Q_{Y_t|X^{t-1}_0,Y^{t-1}_0} \Big| P_{W,X^{t-1}_0,Y^{t-1}_0}\Big),$

where the ${P}$-measures describe the actual interaction of the policy with the black box, while the auxiliary ${Q}$-measures are arbitrary (subject to some mild conditions) and describe some nominal (or desired) behavior. The success of the whole enterprise now hinges on choosing the ${Q}$-measures in just the right way. Being really vague and hand-wavy here, I can say that the term corresponding to each ${t}$ measures how close the policy is at time ${t}$ to its claimed goal. This actually brings into focus a kind of a law of diminishing returns: the speed at which the policy approaches the optimum is, roughly speaking, inversely dependent on the rate at which each subsequent query can reduce the remaining uncertainty about ${W}$.

Now that this work is done, what strikes me the most is how powerful of a tool Fano’s inequality can be (see Mark Reid’s blog post about this, as well). As an information theorist, I use it all the time, but it still amazes me that it can give such tight lower bounds in such settings as optimization or control, which (at least superficially) have nothing to do with coding.