# The Information Structuralist

## Information theory in economics, Part II: Robustness

Posted in Echoes of Cybernetics, Economics, Games and Decisions, Information Theory by mraginsky on July 20, 2012

As we have seen in Part I, the rational inattention framework of Christopher Sims aims to capture the best a rational agent can do when his capacity for processing information is limited. This rationally inattentive agent, however, has no reason to question his statistical model. In this post we will examine the robustness framework of Thomas Sargent, which deals with the issue of model uncertainty, but does not assume any capacity limitations.

Again, for simplicity I will stick to single-stage problems. The setting is exactly the same as before: we have an agent who must make a decision ${U \in {\mathsf U}}$ pertaining to some state of the world ${X \in {\mathsf X}}$ on the basis of some signal ${Z \in {\mathsf Z}}$ correlated with ${X}$. If the quality of a decision procedure ${\gamma : {\mathsf Z} \rightarrow {\mathsf U}}$ is measured in terms of its expected cost ${J(\gamma) = {\mathbb E}[c(X,\gamma(Z))]}$, then the agent faces the optimization problem

$\displaystyle \inf_\gamma J(\gamma),$

and the optimal decision function ${\gamma^* : {\mathsf Z} \rightarrow {\mathsf U}}$ is given by

$\displaystyle \gamma^*(z) = \mathop{\text{arg\,min}}_{u \in {\mathsf U}}{\mathbb E}[c(X,u)|Z=z].$

In order to compute ${\gamma^*}$, the agent must know ${P_X}$ (the distribution of ${X}$) and the observation model ${P_{Z|X}}$ that relates the observed signal ${Z}$ to the unobserved state ${X}$. In other words, the agent must have a probabilistic model, embodied in the joint distribution ${P = P_{X,Z}}$ of ${X}$ and ${Z}$.

All of this is standard fare in the realm of Bayesian rationality. But now let’s suppose that the agent treats the model ${P}$ only as an approximation that was adopted for reasons of computational tractability, limited knowledge, etc. What should the agent do in this situation? Sargent’s proposal, inspired by ideas from robust control and game theory, is this: instead of simply optimizing the decision procedure to the “nominal” model ${P}$ (which may, after all, prove to be inaccurate), the agent should hedge his bets and allow for the fact that the “true” model (which we will denote by ${Q}$) may differ from the nominal model ${P}$, but the difference is bounded. While there are many ways of quantifying this model uncertainty, Sargent has opted for the divergence bound ${D(Q \| P) \le R}$. The parameter ${R > 0}$ is chosen by the agent and reflects his degree of confidence in ${P}$: the smaller this ${R}$, the more the agent will tend to trust his model-building skills. With the problem framed in this way, the natural strategy is the minimax one: if we define, for any distribution ${Q}$ of ${(X,Z)}$ and for any strategy ${\gamma : {\mathsf Z} \rightarrow {\mathsf U}}$, the expected cost

$\displaystyle J(Q,\gamma) = {\mathbb E}_Q[c(X,\gamma(Z))],$

then the agent should solve the optimization problem

$\displaystyle J^*(R) = \inf_\gamma \sup_{Q : D(Q \| P) \le R} J(Q,\gamma). \ \ \ \ \ (1)$

The idea here is that the resulting strategy will be robust against some amount of model uncertainty, as determined by the magnitude of ${R}$. We can also envision a malicious adversary, who tries to thwart the agent’s objective by choosing the worst possible joint distribution ${Q}$ of the state and the signal, subject to the divergence constraint ${D(Q \| P) \le R}$. Thus, we can view the quantity ${J^*(R)}$ defined in (1) as the upper value of a zero-sum game between the agent and the adversary. The agent’s moves are the strategies ${\gamma}$ (so that mixed strategies would involve additional randomization on the agent’s part), while the adversary’s moves are the distributions ${Q}$ in the “divergence ball” of radius ${R}$ around the nominal distribution ${P}$.

Now let’s see what we can say about the optimal strategy in (1). We start by examining the inner maximization for a fixed ${\gamma}$. If we introduce a Lagrange multiplier ${\lambda \ge 0}$ for the constraint ${D(Q \| P) \le R}$, then it can be shown that strong duality holds, so

$\displaystyle \begin{array}{rcl} \sup_{Q : D(Q \| P) \le R} J(Q,\gamma) &=& \sup_Q \inf_{\lambda \ge 0}\left[ J(Q,\gamma) - \lambda(D(Q\|P)-R)\right] \\ &=& \inf_{\lambda \ge 0} \sup_Q \left[ J(Q,\gamma) - \lambda(D(Q\|P)-R)\right], \end{array}$

where now the supremum is over all probability distributions for ${X}$ and ${Z}$. We will now show that, for a fixed ${\lambda}$, we can compute the supremum over ${Q}$ in closed form. To that end, we will need the following result, known as the Donsker-Varadhan formula:

Lemma 1 (Donsker-Varadhan) For any probability measure ${\mu}$ on some space ${{\mathsf W}}$ and any measurable function ${f : {\mathsf W} \rightarrow {\mathbb R}}$ such that ${{\mathbb E}_\mu[e^f] < \infty}$,

$\displaystyle \sup_{\nu}\left\{ {\mathbb E}_\nu[f] - D(\nu \| \mu)\right\} = \log {\mathbb E}_\mu[e^f]. \ \ \ \ \ (2)$

Remark 1 This result is so fundamental and so simple that it has been rediscovered multiple times (e.g., by the machine learning community).

Proof: To keep things simple, I will present the proof for finite ${{\mathsf W}}$. Let us define the tilted distribution

$\displaystyle \nu^{(f)}(w) = \frac{\mu(w)e^{f(w)}}{{\mathbb E}_\mu[e^f]}.$

Then the supremum in (2) is achieved by ${\nu^{(f)}}$. Indeed, for any other ${\nu}$ we will have

$\displaystyle \begin{array}{rcl} {\mathbb E}_\nu[f] - D(\nu \| \mu) &=&\, {\mathbb E}_\nu\left[ f - \log \frac{\nu}{\mu} \right] \\ &=& -{\mathbb E}_\nu \left[ \log \frac{\nu}{\mu e^f}\right] \nonumber\\ &=& -{\mathbb E}_\nu \left[ \log \frac{\nu}{\nu^{(f)}} \right] + \log {\mathbb E}_\mu[e^f] \nonumber\\ &=& - D(\nu \| \nu^{(f)}) + \log {\mathbb E}_\mu[e^f] \nonumber\\ &\le& \log {\mathbb E}_\mu[e^f],\nonumber \end{array}$

where equality holds if and only if ${\nu = \nu^{(f)}}$. $\Box$

Now we can use (2) and write

$\displaystyle \begin{array}{rcl} \sup_Q \left[ J(Q,\gamma) - \lambda(D(Q\|P)-R)\right] &=& \sup_Q \left\{ {\mathbb E}_Q[c(X,\gamma(Z))] - \lambda (D(Q \| P)-R) \right] \\ &=& \lambda \sup_Q \left\{(1/\lambda){\mathbb E}_Q[c(X,\gamma(Z))] - D(Q\|P)\right\} + \lambda R \\ &=& \lambda \log {\mathbb E}_P\left[ \exp \left( \frac{1}{\lambda} c(X,\gamma(Z))\right)\right] + \lambda R \end{array}$

Consequently, we can express the optimal value of the optimization problem (1) as

$\displaystyle J^*(R) = \inf_{\lambda \ge 0} \inf_\gamma \left\{ \lambda \log {\mathbb E}_P\left[ \exp \left( \frac{1}{\lambda} c(X,\gamma(Z))\right)\right] + \lambda R \right\}.$

This is as far as we can get — we have no way of doing the optimization over the choice of the strategy ${\gamma}$ and the Lagrange multiplier ${\lambda \ge 0}$ in closed form. But we can gain some intuition by focusing on some fixed value of ${\lambda}$. To that end, let us define

$\displaystyle J^*(\lambda, R) := \lambda \log \inf_\gamma {\mathbb E}_P\left[ \exp \left( \frac{1}{\lambda} c(X,\gamma(Z))\right)\right] + \lambda R. \ \ \ \ \ (3)$

Now let’s examine the quantity under the logarithm. It is actually a standard Bayesian optimum cost problem for the nominal model ${P}$, except now instead of the original cost ${c(x,u)}$ we have the exponentiated cost ${\exp\left[c(x,u)/\lambda\right]}$. So for each value of ${\lambda}$ the optimal strategy ${\gamma^*_\lambda}$ minimizes the expected cost ${{\mathbb E}_P[e^{c(X,\gamma(Z))/\lambda}]}$:

$\displaystyle \gamma^*_\lambda(z) = \mathop{\text{arg\,min}}_{u \in {\mathsf U}} {\mathbb E}_P\Bigg[ \exp\left(\frac{1}{\lambda}c(X,u) \right)\Bigg|Z=z\Bigg]. \ \ \ \ \ (4)$

So the robust optimization problem (1) is, actually, an ordinary Bayesian optimization problem in disguise, but with a different cost function. This may not seem like a big deal, but now we can actually see how cautious a robust strategy is compared to a non-robust one. Let’s suppose that, instead of optimizing the expected cost ${{\mathbb E}_P[c(X,\gamma(Z))]}$ over all ${\gamma}$, we wanted to minimize the probability that the cost ${c(X,\gamma(Z))}$ exceeds some threshold ${\alpha \ge 0}$. Then the problem is

$\displaystyle \inf_\gamma P \left\{ c(X,\gamma(Z)) \ge \alpha \right\} \equiv \inf_\gamma {\mathbb E}_P \left[ 1_{\{c(X,\gamma(Z)) \ge \alpha \}}\right].$

The solution to this problem will, of course, depend on ${\alpha}$. But now we will see that a robust strategy will work for all ${\alpha \ge 0}$ (even though it will not be optimal for each individual ${\alpha}$). To see this, we will exploit the fact that the graph of the step function with the step at ${\alpha}$, i.e., of the indicator function of the semi-infinite interval ${[\alpha,+\infty)}$, lies below the graph of the shifted exponential function ${t \mapsto e^{(t-\alpha)/\lambda}}$ for any ${\lambda \ge 0}$:

$\displaystyle 1_{\{t \ge \alpha\}} \le e^{(t-\alpha)/\lambda}, \qquad \forall \alpha,\lambda \ge 0. \ \ \ \ \ (5)$

The two graphs touch precisely at the point ${t=\alpha}$, and the bound is tighter for larger values of ${\lambda}$. Therefore, using (5) and the definition of ${J^*(\lambda,R)}$ in (3), we can write

$\displaystyle \begin{array}{rcl} \inf_\gamma P \left\{ c(X,\gamma(Z)) \ge \alpha \right\} &\le& e^{-\alpha/\lambda}\inf_\gamma {\mathbb E}_P\left[ \exp \left( \frac{1}{\lambda} c(X,\gamma(Z))\right)\right] \\ &=& e^{-R}e^{-\alpha/\lambda}\exp\left[ \frac{1}{\lambda}J^*(\lambda,R)\right]. \end{array}$

Moreover, this upper bound is actually achieved by ${\gamma^*_\lambda}$. The main thing to note here is the presence of the exponentially decaying factor ${e^{-\alpha/\lambda}}$. So, if an agent decides to use a robust strategy and if his model ${P}$ actually turns out to be correct, then he ends up ensuring not only a small average cost, but also small probability of large excursions of the cost! For this reason, the strategy (4) is often called risk-sensitive, the problem of minimizing the expected exponential cost ${{\mathbb E}_P[e^{c(X,\gamma(Z))/\lambda}]}$ is called risk-sensitive optimization, and the Lagrange multiplier ${\lambda}$ is called the risk aversion factor.

For a nice treatment of robustness and risk-sensitive optimization in the context of information theory and statistical estimation, check out this paper by Neri Merhav.

### 5 Responses

1. Tara said, on July 21, 2012 at 10:39 am

I would like to use some of your posts for the IT society’s newsletter… This one is quite quite pretty…

• mraginsky said, on July 25, 2012 at 12:52 pm

I am certainly open to that idea, so we should definitely talk about that. (And thank you.)

2. Eric Young said, on May 20, 2013 at 12:28 pm

In the interest of improving communication between disciplines, I invite you and your readers to look at our work on robustness and rational inattention, available at my website.

3. […] As Max points out in the comments, this is really a specialized version of the Donsker-Varadhan formula, also mentioned by Mokshay in a comment here. I think the difficulty with concepts like these is […]