## Information theory in economics, Part II: Robustness

As we have seen in Part I, the rational inattention framework of Christopher Sims aims to capture the best a rational agent can do when his capacity for processing information is limited. This rationally inattentive agent, however, has no reason to question his statistical model. In this post we will examine the *robustness* framework of Thomas Sargent, which deals with the issue of model uncertainty, but does not assume any capacity limitations.

## Information theory in economics, Part I: Rational inattention

Economic activity involves making decisions. In order to make decisions, agents need information. Thus, the problem of acquisition, transmission, and uses of information has been occupying the economists’ attention for some time now (there is even a whole subfield of “information economics”). It is not surprising, therefore, that *information* theory, the brainchild of Claude Shannon, would eventually make its way into economics. In this post and the one to follow, I will briefly describe two specific strands of information-theoretic work in economics: the *rational inattention* framework of Christopher Sims and the *robustness* ideas of Thomas Sargent. (As an interesting aside: Sims and Sargent have shared the 2011 Nobel Memorial Prize in Economics, although not directly for their information-theoretic work, but rather for their work related to causality.)

## ISIT 2011: favorite talks

Obligatory disclaimer: YMMV, “favorite” does not mean “best,” etc. etc.

- Emmanuel Abbe and Andrew Barron, “Polar coding schemes for the AWGN channel” (pdf)
- Tom Cover, “On the St. Petersburg paradox”
- Paul Cuff, Tom Cover, Gowtham Kumar, Lei Zhao, “A lattice of gambles”
- Ioanna Ioannou, Charalambos Charalambous, Sergey Loyka, “Outage probability under channel distribution uncertainty” (pdf; longer version: arxiv:1102.1103)
- Mohammad Naghshvar, Tara Javidi, “Performance bounds for active sequential hypothesis testing”
- Chris Quinn, Negar Kiyavash, Todd Coleman, “Equivalence between minimal generative model graphs and directed information graphs” (pdf)
- Ofer Shayevitz, “On Rényi measures and hypothesis testing” (long version: arxiv:1012.4401)

The problem of constructing polar codes for channels with continuous input and output alphabets can be reduced, in a certain sense, to the problem of constructing finitely supported approximations to capacity-achieving distributions. This work analyzes several such approximations for the AWGN channel. In particular, one approximation uses quantiles and approaches capacity at a rate that decays *exponentially* with support size. The proof of this fact uses a neat trick of upper-bounding the Kullback-Leibler divergence by the chi-square distance and then exploiting the law of large numbers.

A fitting topic, since this year’s ISIT took place in St. Petersburg! Tom has presented a reformulation of the problem underlying this (in)famous paradox in terms of finding the best allocation of initial capital so as to optimize various notions of relative wealth. This reformulation obviates the need for various extra assumptions, such as diminishing marginal returns (i.e., concave utilities), and thus provides a means of resolving the paradox from first principles.

There is a well-known correspondence between martingales and “fair” gambling systems. Paul and co-authors explore another correspondence, between fair gambles and Lorenz curves used in econometric modeling, to study certain stochastic orderings and transformations of martingales. There are nice links to the theory of majorization and, through that, to Blackwell’s framework for comparing statistical experiments in terms of their expected risks.

The outage probability of a general channel with stochastic fading is the probability that the conditional input-output mutual information given the fading state falls below the given rate. In this paper, it is assumed that the state distribution is not known exactly, but there is an upper bound on its divergence from some fixed “nominal” distribution (this model of statistical uncertainty has been used previously in the context of robust control). The variational representation of the divergence (as a Legendre-Fenchel transform of the moment-generating function) then allows for a clean asymptotic analysis of the outage probability.

Mohammad and Tara show how dynamic programming techniques can be used to develop tight converse bounds for sequential hypothesis testing problems with feedback, in which it is possible to adaptively control the quality of the observation channel. This viewpoint is a lot cleaner and more conceptually straightforward than “classical” proofs based on martingales (à la Burnashev). This new technique is used to analyze asymptotically optimal strategies for sequential -ary hypothesis testing, variable-length coding with feedback, and noisy dynamic search.

For networks of interacting discrete-time stochastic processes possessing a certain conditional independence structure (motivating example: discrete-time approximations of smooth dynamical systems), Chris, Negar and Todd show the equivalence between two types of graphical models for these networks: (1) generative models that are minimal in a certain “combinatorial” sense and (2) information-theoretic graphs, in which the edges are drawn based on directed information.

Ofer obtained a new variational characterization of Rényi entropy and divergence that considerably simplifies their analysis, in many cases completely replacing delicate arguments based on Taylor expansions with purely information-theoretic proofs. He also develops a new operational characterization of these information measures in terms of distributed composite hypothesis testing.

## Missing all the action

**Update:** I fixed a couple of broken links.

I want to write down some thoughts inspired by Chernoff’s memo on backward induction that may be relevant to feedback information theory and networked control. Some of these points were brought up in discussions with Serdar Yüksel two years ago.

## The lost art of writing

From the opening paragraph of Herman Chernoff‘s unpublished 1963 memo “Backward induction in dynamic programming” (thanks to Armand Makowski for a scanned copy):

The solution of finite sequence dynamic programming problems involve a backward induction argument, the foundations of which are generally understood hazily. The purpose of this memo is to add some clarification which may be slightly redundant and whose urgency may be something less than vital.

Alas, nobody writes like that anymore.

## In Soviet Russia, the sigma-field conditions on you

Quote of the day, from *Asymptotics in Statistics: Some Basic Concepts* by Lucien Le Cam and Grace Yang (emphasis mine):

The idea of developing statistical procedures that minimize an expected loss goes back to Laplace … [and] reappears in papers of Edgeworth. According to Neyman in his Lectures and Conferences: “After Edgeworth, the idea of the loss function was lost from sight for more than two decades …” It was truly revived only by the appearance on the statistical scene of Wald. Wald’s books

Sequential AnalysisandStatistical Decision Functionsare based on that very idea of describing experiments by families of probability measures either on one given -field or on sequence of -fields to be chosen by the statistician. The idea seems logical enough if one is used to it. However, there is a paper by Fisher where he seems to express the opinion thatsuch concepts are misleading and good enough only for Russian or American engineers.

Follow the link to Fisher’s paper for more curmudgeonly remarks about “Russians” and their “five-year plans.”

## Value of information, Bayes risks, and rate-distortion theory

In the previous post we have seen that access to additional information is not always helpful in decision-making. On the other hand, extra information can never hurt, assuming one is precise about the *quantitative* meaning of “extra information.” In this post, I will show how Shannon’s information theory can be used to speak meaningfully about the *value of information* for decision-making. This particular approach was developed in the 1960s and 1970s by Ruslan Stratonovich (of the Stratonovich integral fame, among other things) and described in his book on information theory, which was published in Russian in 1975. As far as I know, it was never translated into English, which is a shame, since Stratonovich was an extremely original thinker, and the book contains a deep treatment of the three fundamental problems of information theory (lossless source coding, noisy channel coding, lossy source coding) from the viewpoint of statistical physics.

## Deadly ninja weapons: Blackwell’s principle of irrelevant information

Having more information when making decisions should always help, it seems. However, there are situations in which this is not the case. Suppose that you observe two pieces of information, and , which you can use to choose an action . Suppose also that, upon choosing , you incur a cost . For simplicity let us assume that , , and take values in finite sets , , and , respectively. Then it is obvious that, no matter which “strategy” for choosing you follow, you cannot do better than . More formally, for any strategy we have

Thus, the extra information is *irrelevant*. Why? Because the cost you incur does not depend on directly, though it may do so through .

Interestingly, as David Blackwell has shown in 1964 in a three-page paper, this seemingly innocuous argument does not go through when , , and are Borel subsets of Euclidean spaces, the cost function is bounded and Borel-measurable, and the strategies are required to be measurable as well. However, if and are *random variables* with a known joint distribution , then is indeed irrelevant for the purpose of minimizing *expected* cost.

**Warning:*** lots of measure-theoretic noodling below the fold; if that is not your cup of tea, you can just assume that all sets are finite and go with the poor man’s version stated in the first paragraph. Then all the results below will hold.*

## Divergence in everything: bounding the regret in online optimization

Let’s continue with our magical mystery tour through the lands of divergence.

(image yoinked from Sergio Verdú‘s 2007 Shannon Lecture slides)

Today’s stop is in the machine learning domain. The result I am about to describe has been floating around in various forms in many different papers, but it has been nicely distilled by Hari Narayanan and Sasha Rakhlin in their recent paper on a random walk approach to online convex optimization.

## Bell Systems Technical Journal: now online

The Bell Systems Technical Journal is now online. Mmmm, seminal articles … . Shannon, Wyner, Slepian, Witsenhausen — they’re all here!

(h/t Anand Sarwate)

5comments