Value of information, Bayes risks, and rate-distortion theory

In the previous post we have seen that access to additional information is not always helpful in decision-making. On the other hand, extra information can never hurt, assuming one is precise about the quantitative meaning of “extra information.” In this post, I will show how Shannon’s information theory can be used to speak meaningfully about the value of information for decision-making. This particular approach was developed in the 1960s and 1970s by Ruslan Stratonovich (of the Stratonovich integral fame, among other things) and described in his book on information theory, which was published in Russian in 1975. As far as I know, it was never translated into English, which is a shame, since Stratonovich was an extremely original thinker, and the book contains a deep treatment of the three fundamental problems of information theory (lossless source coding, noisy channel coding, lossy source coding) from the viewpoint of statistical physics.

For simplicity, I will assume, for the most part, that all alphabets are finite. Suppose we have some system whose state is described by a random variable {X \in {\mathsf X}} with distribution {P_X}. We also have a decision space {{\mathsf U}} and a cost function {c : {\mathsf X} \times {\mathsf U} \rightarrow {\mathbb R}^+}. Finally, suppose we receive some information about the system, i.e., observe a random variable {Z} taking values in some space {{\mathsf Z}} and connected to the state {X} by a conditional probability law {P_{Z|X}}. This set-up can be interpreted in many different ways, depending on your perspective:

  • Statistical decision theory: we conduct a statistical experiment on the system by observing a random outcome {Z}, whose distribution is {P_{Z|X=x}} when the “true” state is {x}.
  • Control theory: the state {X} and the observation {Z} are related through an information structure {P_{Z|X}}.
  • Information theory: we observe the state {X} through a noisy channel {P_{Z|X}}.

Regardless of the interpretation, the goal is to find a decision rule {\gamma : {\mathsf Z} \rightarrow {\mathsf U}} to minimize the expected cost

\displaystyle  J(\gamma) = \mathop{\mathbb E} c(X,\gamma(Z)) = \sum_{x \in {\mathsf X}} \sum_{z \in {\mathsf Z}} P_X(x)P_{Z|X}(z|x)c(x,\gamma(z)),

i.e., to find

\displaystyle  \gamma^* = \mathop{\text{arg\,min}}_{\gamma : {\mathsf Z} \rightarrow {\mathsf U}} J(\gamma).

This is a textbook problem, and the solution is to minimize the a posteriori expected cost:

\displaystyle  \gamma^*(z) = \mathop{\text{arg\,min}}_{u \in {\mathsf U}} \mathop{\mathbb E}[c(X,u)|Z=z].

This will result in the minimum expected cost (also known as the Bayes risk)

\displaystyle  \begin{array}{rcl}  J^* &=& J(\gamma^*) \\ &=& \sum_{z \in {\mathsf Z}} P_Z(z) \sum_{x \in {\mathsf X}} P_{X|Z}(x|z)c(x,\gamma^*(z)) \\ &=& \sum_{z \in {\mathsf Z}} P_Z(z) \min_{u \in {\mathsf U}} \sum_{x \in {\mathsf X}} P_{X|Z}(x|z)c(x,u) \\ &=& \mathop{\mathbb E}\Big[ \min_{u \in {\mathsf U}} \mathop{\mathbb E}[c(X,u)|Z]\Big]. \end{array}

Intuitively, {J^*} can be large or small depending on how “informative” the channel (information structure, experiment) {P_{Z|X}} is. The two extreme cases are:

  1. No information: {P_{Z|X} = Q_Z} for some fixed probability law {Q_Z} over {{\mathsf Z}}. In other words, the observation {Z} is independent of the state {X} and, therefore, useless. In this case, the optimal decision can only based on the prior {P_X}:

    \displaystyle  	J^* = \min_{u \in {\mathsf U}} \mathop{\mathbb E}[c(X,u)].

  2. Perfect information: {{\mathsf Z} = {\mathsf X}} and {Z=X}, i.e., {P_{Z|X=x} = \delta_x} for every {x \in {\mathsf X}}. In this case, {\gamma^*(x) = \mathop{\text{arg\,min}}_u c(x,u)}, resulting in

    \displaystyle  	J^* = \mathop{\mathbb E}\left[\min_{u \in {\mathsf U}} c(X,u)\right].

But what happens when {P_{Z|X}} is nestled somewhere between these two extremes? Since the prior {P_X} is fixed, every channel {P_{Z|X}} induces a joint distribution {P_{XZ}} on {{\mathsf X} \times {\mathsf Z}} via {P_{XZ}(x,z) = P_X(x)P_{Z|X}(z|x)}, and so we can compute the mutual information

\displaystyle  I(X;Z) = \sum_{x \in {\mathsf X}} \sum_{z \in {\mathsf Z}} P_X(x)P_{Z|X}(z|x) \log \frac{P_{Z|X}(z|x)}{\sum_{x' \in {\mathsf X}} P_X(x')P_{Z|X}(z|x')}.

The mutual information can vary between {0} (when {X} and {Z} are independent, i.e., the no-information case) and {H(X)}, the entropy of {X}. Let us fix a number {0 \le R \le H(X)} and consider the set of all channels {P_{Z|X}} such that {I(X;Z) \le R}. Now we can ask the following question: given {R}, which channel {P_{Z|X}} in this set should be preferred over all others? Well, clearly the one that will result in the largest reduction of the minimal expected cost compared to the no-information case! This motivates the following definition:

Definition 1 Given a state space {{\mathsf X}}, a decision space {{\mathsf U}}, an observation space {{\mathsf Z}}, a prior {P_X}, a cost function {c : {\mathsf X} \times {\mathsf U} \rightarrow {\mathbb R}^+}, and a number {0 \le R \le H(X)}, the value of {R} bits of information is the largest possible reduction in expected cost over all observation channels (experiments, information structures) {P_{Z|X}} satisfying the mutual information constraint {I(X;Z) \le R}:

\displaystyle  \begin{array}{rcl}  		V(R; {\mathsf Z}) &=& \sup_{P_{Z|X}\, : \, I(X;Z) \le R} \left\{\min_{u \in {\mathsf U}} \mathop{\mathbb E}[c(X,u)] - \mathop{\mathbb E}\left[ \min_{u \in {\mathsf U}} \mathop{\mathbb E}[c(X,u)|Z]\right] \right\} \\ 		&=& \min_{u \in {\mathsf U}} \mathop{\mathbb E}[c(X,u)] - \inf_{P_{Z|X}\,:\, I(X;Z) \le R}\mathop{\mathbb E}\left[ \min_{u \in {\mathsf U}} \mathop{\mathbb E}[c(X,u)|Z]\right]. 	\end{array}

Now let’s suppose that you can actually choose the observation space {{\mathsf Z}}, not necessarily finite. Then it makes sense to define the overall value of {R} bits of information as the largest value of information attainable with any choice of {{\mathsf Z}}:

\displaystyle  	V(R) = \sup_{{\mathsf Z}} V(R;{\mathsf Z}).

This is where things get interesting. It turns out that the problem of choosing the best “representation space” to maximize the value of information subject to a given mutual information constraint is closely related to the problem of computing the Shannon distortion-rate function. This result is originally due to Stratonovich, and the proof given below follows his. More or less the same result seems to have been rediscovered by Fumio Kanaya and Kenji Nakagawa in 1991.

Theorem 2

\displaystyle  		V(R) = \min_{u \in {\mathsf U}} \mathop{\mathbb E}[c(X,u)] - D(R), 	\ \ \ \ \ (1)


\displaystyle  		D(R) = \inf_{P_{U|X} \, : \, I(X; U) \le R} \mathop{\mathbb E} c(X,U). 	\ \ \ \ \ (2)

is the Shannon distortion-rate function of the source {P_X} with distortion function {c}.

Proof: Note that Eq. (1) is equivalent to

\displaystyle  		D(R) = \inf_{{\mathsf Z}} \underbrace{\inf_{P_{Z|X}\, : \, I(X;Z) \le R} \mathop{\mathbb E} \left[ \min_{u \in {\mathsf U}}\mathop{\mathbb E}[c(X,u)|Z]\right]}_{:= D(R; {\mathsf Z})}. 	\ \ \ \ \ (3)

Clearly, {D(R) \le \inf_{\mathsf Z} D(R;{\mathsf Z})}. Indeed, for any {{\mathsf Z}} let {P^*_{Z|X}} attain the infimum in {D(R; {\mathsf Z})}, and let {\gamma^* : {\mathsf Z} \rightarrow {\mathsf U}} be the corresponding Bayes decision strategy. Then

\displaystyle  X \xrightarrow{P^*_{Z|X}} Z^* \xrightarrow{\gamma^*} U

is a Markov chain, so {I(X; U) \le I(X; Z^*) \le R} by the data processing inequality. Hence, the cascade channel

\displaystyle  P_{U|X}(u|x) = \int_{{\mathsf Z}} 1_{\{ \gamma^*(z) = u\}} P(dz|x)

belongs to the feasible set for the problem of computing {D(R)}. Consequently,

\displaystyle  D(R) \le \mathop{\mathbb E} c(X,\gamma^*(Z^*)) = D(R; {\mathsf Z}).

For the reverse inequality {D(R) \ge \inf_{\mathsf Z} D(R;{\mathsf Z})}, let us take {{\mathsf Z} = {\mathcal P}({\mathsf X})}, the space of all probability distributions over {{\mathsf X}}. Let {P^*_{U|X}} achieve the infimum in (2). From the variational formulation of Shannon’s rate-distortion problem, we know that {P^*_{U|X}} has the form

\displaystyle  		P^*_{U|X}(u|x) = \frac{Q_U(u) e^{-\beta c(x,u)}}{\Lambda_\beta(x)} 	\ \ \ \ \ (4)

for some {\beta > 0} and some probability distribution {Q_U \in {\mathcal P}({\mathsf U})}, where {\Lambda_\beta(x)} is the normalization factor. Moreover, denoting by {P^*_{XU}} the induced joint distribution, we can easily see that {P^*_U = Q_U}. Now, from (4) we can write

\displaystyle  c(x,u) = -\frac{1}{\beta}\left( \log \Lambda_\beta(x) + \log \frac{P^*_{X|U}(x|u)}{P_X(x)}\right), \ \ \ \ \ (5)

where {P^*_{X|U}} is the corresponding posterior distribution. Let {U^* \in {\mathsf U}} be related to {X \sim P_X} through {P^*_{U|X}}. Defining the Bayes update mapping {\Phi : {\mathsf Z} \times {\mathsf U} \rightarrow {\mathsf Z}}, where {{\mathsf Z} = {\mathcal P}({\mathsf X})}, by

\displaystyle  [\Phi(P, u)](x) = \frac{P(x) P^*_{U|X}(u|x)}{\sum_{x' \in {\mathsf X}} P(x') P^*_{U|X}(u|x')}, \qquad \forall x \in {\mathsf X}

we see that

\displaystyle  X \xrightarrow{P^*_{U|X}} U^* \xrightarrow{\Phi(P_X,\cdot)} Z^*

is a Markov chain. Therefore, {I(X; Z^*) \le I(X; U^*) \le R} by the data processing inequality, which means that the cascade channel

\displaystyle  P_{Z^*|X}(z^*|x) = \sum_{u \in {\mathsf U}} 1_{\{ \Phi(P_X,u) = z^* \}} P^*_{U|X}(u|x)

belongs to the feasible set for the problem of computing {D(R; {\mathsf Z})}. This means that

\displaystyle  \mathop{\mathbb E} \left[ \min_{u \in {\mathsf U}} \mathop{\mathbb E}[c(X,u)|Z^*] \right] \ge \inf_{P_{Z|X} \, : \, I(X; Z) \le R}\mathop{\mathbb E} \left[ \min_{u \in {\mathsf U}} \mathop{\mathbb E}[c(X,u)|Z]\right] = D(R; {\mathsf Z}),

where the posterior distribution {P_{X|Z^*}} is calculated as follows:

\displaystyle  \begin{array}{rcl}  	P_{X|Z^*}(x|z^*) &=& \frac{P_{Z^*|X}(z^*|x) P_X(x)}{\sum_{x' \in {\mathsf X}} P_{Z^*|X}(z^*|x') P_X(x')} \\ 	&=& \frac{\sum_{u \in {\mathsf U}} 1_{\{ \Phi(P_X,u) = z^* \}} P^*_{U|X}(u|x) P_X(x)}{\sum_{x' \in {\mathsf X}}\sum_{u \in {\mathsf U}} 1_{\{ \Phi(P_X,u) = z^* \}} P^*_{U|X}(u|x')P_X(x')} \\ &=&	\frac{\sum_{u \in {\mathsf U}} 1_{\{ \Phi(P_X,u) = z^* \}} P^*_U(u)z^*(x)}{\sum_{x' \in {\mathsf X}}\sum_{u \in {\mathsf U}} 1_{\{ \Phi(P_X,u) = z^* \}}P^*_U(u) z^*(x')} \\ &=& z^*(x). \end{array}

For a particular realization {U^*=u^*}, we will have {Z^* = \Phi(P_X,u^*)} with probability one, so from (5) we will have

\displaystyle  \begin{array}{rcl}  \min_{u \in {\mathsf U}} \mathop{\mathbb E}[c(X,u)|Z^* = P^*_{X|U^*=u^*}] &=& \min_{u \in {\mathsf U}} \mathop{\mathbb E}[c(X,u) | U^* = u^*]\\ &=& -\frac{1}{\beta}\mathop{\mathbb E}[\Lambda_\beta(X)|U^*=u^*] + \frac{1}{\beta}\min_{u \in {\mathsf U}}\mathop{\mathbb E} \Bigg[ - \log \frac{P^*_{X|U}(X|u)}{P_X(X)}\Bigg| U^* = u^*\Bigg] \\ &\le& -\frac{1}{\beta}\mathop{\mathbb E}[\Lambda_\beta(X)|U^*=u^*] + \frac{1}{\beta}\mathop{\mathbb E} \Bigg[ - \log \frac{P^*_{X|U}(X|u^*)}{P_X(X)}\Bigg| U^*= u^*\Bigg]. \end{array}

Taking the expectation of both sides with respect to {P^*_U}, we will have

\displaystyle  \mathop{\mathbb E}\left[\min_{u \in {\mathsf U}} \mathop{\mathbb E}[ c(X,u) | U^*] \right] \le - \frac{1}{\beta}\mathop{\mathbb E}[\Lambda_\beta(X)] - \frac{1}{\beta}\mathop{\mathbb E}\left[ \log \frac{P^*_{X|U}(X|U^*)}{P_X(X)}\right] = \mathop{\mathbb E} c(X,U^*) = D(R).

Therefore, {D(R) \ge D(R; {\mathsf Z})}. \Box

Remark 1 The main point of the lengthy proof of the inequality {D(R) \ge \inf_{\mathsf Z} D(R; {\mathsf Z})} was to be able to claim that the observation space {{\mathsf Z} = {\mathcal P}({\mathsf X})} of beliefs concerning the state {X} is essentially optimal for maximizing the value of information (in the sense of maximal reduction of the Bayes risk relative to the no-information case) under a mutual information constraint, i.e.,

\displaystyle  	V(R; {\mathcal P}({\mathsf X})) = \sup_{\mathsf Z} V(R; {\mathsf Z}).

A much shorter, but also much less insightful, proof goes as follows:

\displaystyle  \begin{array}{rcl}  \inf_{\mathsf Z} D(R;{\mathsf Z}) &\le&	D(R; {\mathsf U}) \\ &=& \inf_{P_{U|X}\, : \, I(X; U) \le R} \,\, \inf_{\gamma: {\mathsf U} \rightarrow {\mathsf U}} \mathop{\mathbb E} c(X,\gamma(U)) \\ &\le& \inf_{P_{U|X}\, : \, I(X;U) \le R} \mathop{\mathbb E} c(X,U) = D(R). \end{array}

Remark 2 Note that we can interpret {{\mathsf Z}} as an auxiliary alphabet and write down the standard rate-distortion function {R(D)}, given by the functional inverse of (2), in a form more reminiscent of the Wyner–Ziv problem:

\displaystyle  	R(D) = \inf I(X;Z),

where the infimum is over all alphabets {{\mathsf Z}} and all auxiliary {{\mathsf Z}}-valued random variables {Z} jointly distributed with {X \sim P_X}, such that

\displaystyle  	\inf_{\gamma : {\mathsf Z} \rightarrow {\mathsf U}} \mathop{\mathbb E} c(X, \gamma(Z)) \le D.

3 thoughts on “Value of information, Bayes risks, and rate-distortion theory”

  1. Thanks Maxim for taking your time to write these notes; I find all of them very interesting. (Since you asked me to post my private message on the blog; here it is:)

    I am generally unhappy about mutual information being used as the value of information for such single-shot optimization problems. When one considers Wyner-Ziv or Orlitsky-Roche type infinite instance computation problems; yes, these mean a lot (or when matching applies; such as LQG and Gaussian type). But in one-shot problems perhaps:

    – quantization constraints: such as output entropy constraints
    – in the decentralized setting: Yao type graphical computational complexity issues
    – optimization of information channels viewed as probability spaces.
    – comparing them using Blackwell type comparisons (except for some specific cost functions where, one can go beyond ‘degradedness’)

    seem to be more relevant. Mutual information really makes many things deceptively easier; one thing is by simply adding an additional variable (increasing cardinality by 1), it makes the problem a convex one..

    For example, the entropy constrained quantization problem is more relevant than the distortion-rate function for a single shot problem, unless, as you indicated, the dimension is too large.

    For multi-stage problems, the problem is still there, since it is no longer the same instance in the future time stages. One interesting direction here is, as time goes to infinity; to see how the average performance changes as a function of the channel (for example, how an invariant distribution changes and how it affects both the optimal policies and costs). This seems to be a very hard problem which has not been looked at.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: