## Information theory in economics, Part II: Robustness

As we have seen in Part I, the rational inattention framework of Christopher Sims aims to capture the best a rational agent can do when his capacity for processing information is limited. This rationally inattentive agent, however, has no reason to question his statistical model. In this post we will examine the *robustness* framework of Thomas Sargent, which deals with the issue of model uncertainty, but does not assume any capacity limitations.

Again, for simplicity I will stick to single-stage problems. The setting is exactly the same as before: we have an agent who must make a decision pertaining to some state of the world on the basis of some signal correlated with . If the quality of a decision procedure is measured in terms of its expected cost , then the agent faces the optimization problem

and the optimal decision function is given by

In order to compute , the agent must know (the distribution of ) and the observation model that relates the observed signal to the unobserved state . In other words, the agent must have a *probabilistic model*, embodied in the joint distribution of and .

All of this is standard fare in the realm of Bayesian rationality. But now let’s suppose that the agent treats the model only as an *approximation* that was adopted for reasons of computational tractability, limited knowledge, etc. What should the agent do in this situation? Sargent’s proposal, inspired by ideas from robust control and game theory, is this: instead of simply optimizing the decision procedure to the “nominal” model (which may, after all, prove to be inaccurate), the agent should hedge his bets and allow for the fact that the “true” model (which we will denote by ) may differ from the nominal model , but the difference is bounded. While there are many ways of quantifying this *model uncertainty*, Sargent has opted for the divergence bound . The parameter is chosen by the agent and reflects his degree of confidence in : the smaller this , the more the agent will tend to trust his model-building skills. With the problem framed in this way, the natural strategy is the *minimax* one: if we define, for any distribution of and for any strategy , the expected cost

then the agent should solve the optimization problem

The idea here is that the resulting strategy will be *robust* against some amount of model uncertainty, as determined by the magnitude of . We can also envision a malicious adversary, who tries to thwart the agent’s objective by choosing the worst possible joint distribution of the state and the signal, subject to the divergence constraint . Thus, we can view the quantity defined in (1) as the upper value of a zero-sum game between the agent and the adversary. The agent’s moves are the strategies (so that mixed strategies would involve additional randomization on the agent’s part), while the adversary’s moves are the distributions in the “divergence ball” of radius around the nominal distribution .

Now let’s see what we can say about the optimal strategy in (1). We start by examining the inner maximization for a fixed . If we introduce a Lagrange multiplier for the constraint , then it can be shown that strong duality holds, so

where now the supremum is over *all* probability distributions for and . We will now show that, for a fixed , we can compute the supremum over in closed form. To that end, we will need the following result, known as the *Donsker-Varadhan formula*:

Lemma 1 (Donsker-Varadhan)For any probability measure on some space and any measurable function such that ,

Remark 1This result is so fundamental and so simple that it has been rediscovered multiple times (e.g., by the machine learning community).

*Proof:* To keep things simple, I will present the proof for finite . Let us define the *tilted distribution*

Then the supremum in (2) is achieved by . Indeed, for any other we will have

where equality holds if and only if .

Now we can use (2) and write

Consequently, we can express the optimal value of the optimization problem (1) as

This is as far as we can get — we have no way of doing the optimization over the choice of the strategy and the Lagrange multiplier in closed form. But we can gain some intuition by focusing on some fixed value of . To that end, let us define

Now let’s examine the quantity under the logarithm. It is actually a standard Bayesian optimum cost problem for the nominal model , except now instead of the original cost we have the *exponentiated* cost . So for each value of the optimal strategy minimizes the expected cost :

So the robust optimization problem (1) is, actually, an ordinary Bayesian optimization problem in disguise, but with a different cost function. This may not seem like a big deal, but now we can actually see how cautious a robust strategy is compared to a non-robust one. Let’s suppose that, instead of optimizing the expected cost over all , we wanted to minimize the probability that the cost exceeds some threshold . Then the problem is

The solution to this problem will, of course, depend on . But now we will see that a robust strategy will work for *all* (even though it will not be optimal for each individual ). To see this, we will exploit the fact that the graph of the step function with the step at , i.e., of the indicator function of the semi-infinite interval , lies below the graph of the shifted exponential function for any :

The two graphs touch precisely at the point , and the bound is tighter for larger values of . Therefore, using (5) and the definition of in (3), we can write

Moreover, this upper bound is actually achieved by . The main thing to note here is the presence of the exponentially decaying factor . So, if an agent decides to use a robust strategy and if his model actually turns out to be correct, then he ends up ensuring not only a small average cost, but also small probability of large excursions of the cost! For this reason, the strategy (4) is often called *risk-sensitive*, the problem of minimizing the expected exponential cost is called *risk-sensitive optimization*, and the Lagrange multiplier is called the *risk aversion factor*.

For a nice treatment of robustness and risk-sensitive optimization in the context of information theory and statistical estimation, check out this paper by Neri Merhav.

Tarasaid, on July 21, 2012 at 10:39 amI would like to use some of your posts for the IT society’s newsletter… This one is quite quite pretty…

mraginskysaid, on July 25, 2012 at 12:52 pmI am certainly open to that idea, so we should definitely talk about that. (And thank you.)

Eric Youngsaid, on May 20, 2013 at 12:28 pmIn the interest of improving communication between disciplines, I invite you and your readers to look at our work on robustness and rational inattention, available at my website.

mraginskysaid, on May 22, 2013 at 10:14 amThank you very much for the link!

Fenchel duality, entropy, and the log partition function | An Ergodic Walksaid, on October 31, 2014 at 1:36 pm[…] As Max points out in the comments, this is really a specialized version of the Donsker-Varadhan formula, also mentioned by Mokshay in a comment here. I think the difficulty with concepts like these is […]