In the previous post we have seen that access to additional information is not always helpful in decision-making. On the other hand, extra information can never hurt, assuming one is precise about the quantitative meaning of “extra information.” In this post, I will show how Shannon’s information theory can be used to speak meaningfully about the value of information for decision-making. This particular approach was developed in the 1960s and 1970s by Ruslan Stratonovich (of the Stratonovich integral fame, among other things) and described in his book on information theory, which was published in Russian in 1975. As far as I know, it was never translated into English, which is a shame, since Stratonovich was an extremely original thinker, and the book contains a deep treatment of the three fundamental problems of information theory (lossless source coding, noisy channel coding, lossy source coding) from the viewpoint of statistical physics.
For simplicity, I will assume, for the most part, that all alphabets are finite. Suppose we have some system whose state is described by a random variable with distribution . We also have a decision space and a cost function . Finally, suppose we receive some information about the system, i.e., observe a random variable taking values in some space and connected to the state by a conditional probability law . This set-up can be interpreted in many different ways, depending on your perspective:
- Statistical decision theory: we conduct a statistical experiment on the system by observing a random outcome , whose distribution is when the “true” state is .
- Control theory: the state and the observation are related through an information structure .
- Information theory: we observe the state through a noisy channel .
Regardless of the interpretation, the goal is to find a decision rule to minimize the expected cost
i.e., to find
This is a textbook problem, and the solution is to minimize the a posteriori expected cost:
This will result in the minimum expected cost (also known as the Bayes risk)
Intuitively, can be large or small depending on how “informative” the channel (information structure, experiment) is. The two extreme cases are:
- No information: for some fixed probability law over . In other words, the observation is independent of the state and, therefore, useless. In this case, the optimal decision can only based on the prior :
- Perfect information: and , i.e., for every . In this case, , resulting in
But what happens when is nestled somewhere between these two extremes? Since the prior is fixed, every channel induces a joint distribution on via , and so we can compute the mutual information
The mutual information can vary between (when and are independent, i.e., the no-information case) and , the entropy of . Let us fix a number and consider the set of all channels such that . Now we can ask the following question: given , which channel in this set should be preferred over all others? Well, clearly the one that will result in the largest reduction of the minimal expected cost compared to the no-information case! This motivates the following definition:
Definition 1 Given a state space , a decision space , an observation space , a prior , a cost function , and a number , the value of bits of information is the largest possible reduction in expected cost over all observation channels (experiments, information structures) satisfying the mutual information constraint :
Now let’s suppose that you can actually choose the observation space , not necessarily finite. Then it makes sense to define the overall value of bits of information as the largest value of information attainable with any choice of :
This is where things get interesting. It turns out that the problem of choosing the best “representation space” to maximize the value of information subject to a given mutual information constraint is closely related to the problem of computing the Shannon distortion-rate function. This result is originally due to Stratonovich, and the proof given below follows his. More or less the same result seems to have been rediscovered by Fumio Kanaya and Kenji Nakagawa in 1991.
is the Shannon distortion-rate function of the source with distortion function .
Proof: Note that Eq. (1) is equivalent to
Clearly, . Indeed, for any let attain the infimum in , and let be the corresponding Bayes decision strategy. Then
is a Markov chain, so by the data processing inequality. Hence, the cascade channel
belongs to the feasible set for the problem of computing . Consequently,
For the reverse inequality , let us take , the space of all probability distributions over . Let achieve the infimum in (2). From the variational formulation of Shannon’s rate-distortion problem, we know that has the form
for some and some probability distribution , where is the normalization factor. Moreover, denoting by the induced joint distribution, we can easily see that . Now, from (4) we can write
where is the corresponding posterior distribution. Let be related to through . Defining the Bayes update mapping , where , by
we see that
is a Markov chain. Therefore, by the data processing inequality, which means that the cascade channel
belongs to the feasible set for the problem of computing . This means that
where the posterior distribution is calculated as follows:
For a particular realization , we will have with probability one, so from (5) we will have
Taking the expectation of both sides with respect to , we will have
Therefore, .
Remark 1 The main point of the lengthy proof of the inequality was to be able to claim that the observation space of beliefs concerning the state is essentially optimal for maximizing the value of information (in the sense of maximal reduction of the Bayes risk relative to the no-information case) under a mutual information constraint, i.e.,
A much shorter, but also much less insightful, proof goes as follows:
Remark 2 Note that we can interpret as an auxiliary alphabet and write down the standard rate-distortion function , given by the functional inverse of (2), in a form more reminiscent of the Wyner–Ziv problem:
where the infimum is over all alphabets and all auxiliary -valued random variables jointly distributed with , such that
Thanks Maxim for taking your time to write these notes; I find all of them very interesting. (Since you asked me to post my private message on the blog; here it is:)
I am generally unhappy about mutual information being used as the value of information for such single-shot optimization problems. When one considers Wyner-Ziv or Orlitsky-Roche type infinite instance computation problems; yes, these mean a lot (or when matching applies; such as LQG and Gaussian type). But in one-shot problems perhaps:
– quantization constraints: such as output entropy constraints
– in the decentralized setting: Yao type graphical computational complexity issues
– optimization of information channels viewed as probability spaces.
– comparing them using Blackwell type comparisons (except for some specific cost functions where, one can go beyond ‘degradedness’)
seem to be more relevant. Mutual information really makes many things deceptively easier; one thing is by simply adding an additional variable (increasing cardinality by 1), it makes the problem a convex one..
For example, the entropy constrained quantization problem is more relevant than the distortion-rate function for a single shot problem, unless, as you indicated, the dimension is too large.
For multi-stage problems, the problem is still there, since it is no longer the same instance in the future time stages. One interesting direction here is, as time goes to infinity; to see how the average performance changes as a function of the channel (for example, how an invariant distribution changes and how it affects both the optimal policies and costs). This seems to be a very hard problem which has not been looked at.