As we have seen in Part I, the rational inattention framework of Christopher Sims aims to capture the best a rational agent can do when his capacity for processing information is limited. This rationally inattentive agent, however, has no reason to question his statistical model. In this post we will examine the robustness framework of Thomas Sargent, which deals with the issue of model uncertainty, but does not assume any capacity limitations.
Economic activity involves making decisions. In order to make decisions, agents need information. Thus, the problem of acquisition, transmission, and uses of information has been occupying the economists’ attention for some time now (there is even a whole subfield of “information economics”). It is not surprising, therefore, that information theory, the brainchild of Claude Shannon, would eventually make its way into economics. In this post and the one to follow, I will briefly describe two specific strands of information-theoretic work in economics: the rational inattention framework of Christopher Sims and the robustness ideas of Thomas Sargent. (As an interesting aside: Sims and Sargent have shared the 2011 Nobel Memorial Prize in Economics, although not directly for their information-theoretic work, but rather for their work related to causality.)
Obligatory disclaimer: YMMV, “favorite” does not mean “best,” etc. etc.
- Emmanuel Abbe and Andrew Barron, “Polar coding schemes for the AWGN channel” (pdf)
- Tom Cover, “On the St. Petersburg paradox”
- Paul Cuff, Tom Cover, Gowtham Kumar, Lei Zhao, “A lattice of gambles”
- Ioanna Ioannou, Charalambos Charalambous, Sergey Loyka, “Outage probability under channel distribution uncertainty” (pdf; longer version: arxiv:1102.1103)
- Mohammad Naghshvar, Tara Javidi, “Performance bounds for active sequential hypothesis testing”
- Chris Quinn, Negar Kiyavash, Todd Coleman, “Equivalence between minimal generative model graphs and directed information graphs” (pdf)
- Ofer Shayevitz, “On Rényi measures and hypothesis testing” (long version: arxiv:1012.4401)
The problem of constructing polar codes for channels with continuous input and output alphabets can be reduced, in a certain sense, to the problem of constructing finitely supported approximations to capacity-achieving distributions. This work analyzes several such approximations for the AWGN channel. In particular, one approximation uses quantiles and approaches capacity at a rate that decays exponentially with support size. The proof of this fact uses a neat trick of upper-bounding the Kullback-Leibler divergence by the chi-square distance and then exploiting the law of large numbers.
A fitting topic, since this year’s ISIT took place in St. Petersburg! Tom has presented a reformulation of the problem underlying this (in)famous paradox in terms of finding the best allocation of initial capital so as to optimize various notions of relative wealth. This reformulation obviates the need for various extra assumptions, such as diminishing marginal returns (i.e., concave utilities), and thus provides a means of resolving the paradox from first principles.
There is a well-known correspondence between martingales and “fair” gambling systems. Paul and co-authors explore another correspondence, between fair gambles and Lorenz curves used in econometric modeling, to study certain stochastic orderings and transformations of martingales. There are nice links to the theory of majorization and, through that, to Blackwell’s framework for comparing statistical experiments in terms of their expected risks.
The outage probability of a general channel with stochastic fading is the probability that the conditional input-output mutual information given the fading state falls below the given rate. In this paper, it is assumed that the state distribution is not known exactly, but there is an upper bound on its divergence from some fixed “nominal” distribution (this model of statistical uncertainty has been used previously in the context of robust control). The variational representation of the divergence (as a Legendre-Fenchel transform of the moment-generating function) then allows for a clean asymptotic analysis of the outage probability.
Mohammad and Tara show how dynamic programming techniques can be used to develop tight converse bounds for sequential hypothesis testing problems with feedback, in which it is possible to adaptively control the quality of the observation channel. This viewpoint is a lot cleaner and more conceptually straightforward than “classical” proofs based on martingales (à la Burnashev). This new technique is used to analyze asymptotically optimal strategies for sequential -ary hypothesis testing, variable-length coding with feedback, and noisy dynamic search.
For networks of interacting discrete-time stochastic processes possessing a certain conditional independence structure (motivating example: discrete-time approximations of smooth dynamical systems), Chris, Negar and Todd show the equivalence between two types of graphical models for these networks: (1) generative models that are minimal in a certain “combinatorial” sense and (2) information-theoretic graphs, in which the edges are drawn based on directed information.
Ofer obtained a new variational characterization of Rényi entropy and divergence that considerably simplifies their analysis, in many cases completely replacing delicate arguments based on Taylor expansions with purely information-theoretic proofs. He also develops a new operational characterization of these information measures in terms of distributed composite hypothesis testing.
The solution of finite sequence dynamic programming problems involve a backward induction argument, the foundations of which are generally understood hazily. The purpose of this memo is to add some clarification which may be slightly redundant and whose urgency may be something less than vital.
Alas, nobody writes like that anymore.
The idea of developing statistical procedures that minimize an expected loss goes back to Laplace … [and] reappears in papers of Edgeworth. According to Neyman in his Lectures and Conferences: “After Edgeworth, the idea of the loss function was lost from sight for more than two decades …” It was truly revived only by the appearance on the statistical scene of Wald. Wald’s books Sequential Analysis and Statistical Decision Functions are based on that very idea of describing experiments by families of probability measures either on one given -field or on sequence of -fields to be chosen by the statistician. The idea seems logical enough if one is used to it. However, there is a paper by Fisher where he seems to express the opinion that such concepts are misleading and good enough only for Russian or American engineers.
Follow the link to Fisher’s paper for more curmudgeonly remarks about “Russians” and their “five-year plans.”
In the previous post we have seen that access to additional information is not always helpful in decision-making. On the other hand, extra information can never hurt, assuming one is precise about the quantitative meaning of “extra information.” In this post, I will show how Shannon’s information theory can be used to speak meaningfully about the value of information for decision-making. This particular approach was developed in the 1960s and 1970s by Ruslan Stratonovich (of the Stratonovich integral fame, among other things) and described in his book on information theory, which was published in Russian in 1975. As far as I know, it was never translated into English, which is a shame, since Stratonovich was an extremely original thinker, and the book contains a deep treatment of the three fundamental problems of information theory (lossless source coding, noisy channel coding, lossy source coding) from the viewpoint of statistical physics.
Having more information when making decisions should always help, it seems. However, there are situations in which this is not the case. Suppose that you observe two pieces of information, and , which you can use to choose an action . Suppose also that, upon choosing , you incur a cost . For simplicity let us assume that , , and take values in finite sets , , and , respectively. Then it is obvious that, no matter which “strategy” for choosing you follow, you cannot do better than . More formally, for any strategy we have
Thus, the extra information is irrelevant. Why? Because the cost you incur does not depend on directly, though it may do so through .
Interestingly, as David Blackwell has shown in 1964 in a three-page paper, this seemingly innocuous argument does not go through when , , and are Borel subsets of Euclidean spaces, the cost function is bounded and Borel-measurable, and the strategies are required to be measurable as well. However, if and are random variables with a known joint distribution , then is indeed irrelevant for the purpose of minimizing expected cost.
Warning: lots of measure-theoretic noodling below the fold; if that is not your cup of tea, you can just assume that all sets are finite and go with the poor man’s version stated in the first paragraph. Then all the results below will hold.
Let’s continue with our magical mystery tour through the lands of divergence.
Today’s stop is in the machine learning domain. The result I am about to describe has been floating around in various forms in many different papers, but it has been nicely distilled by Hari Narayanan and Sasha Rakhlin in their recent paper on a random walk approach to online convex optimization.