Deadly ninja weapons: Blackwell’s principle of irrelevant information

Having more information when making decisions should always help, it seems. However, there are situations in which this is not the case. Suppose that you observe two pieces of information, {x} and {y}, which you can use to choose an action {u}. Suppose also that, upon choosing {u}, you incur a cost {c(x,u)}. For simplicity let us assume that {x}, {y}, and {u} take values in finite sets {{\mathsf X}}, {{\mathsf Y}}, and {{\mathsf U}}, respectively. Then it is obvious that, no matter which “strategy” for choosing {u} you follow, you cannot do better than {u^*(x) = \displaystyle{\rm arg\,min}_{u \in {\mathsf U}} c(x,u)}. More formally, for any strategy {\gamma : {\mathsf X} \times {\mathsf Y} \rightarrow {\mathsf U}} we have

\displaystyle  c(x,u^*(x)) = \min_{u \in {\mathsf U}} c(x,u) \le c(x,\gamma(x,y)).

Thus, the extra information {y} is irrelevant. Why? Because the cost you incur does not depend on {y} directly, though it may do so through {u}.

Interestingly, as David Blackwell has shown in 1964 in a three-page paper, this seemingly innocuous argument does not go through when {{\mathsf X}}, {{\mathsf Y}}, and {{\mathsf U}} are Borel subsets of Euclidean spaces, the cost function {c} is bounded and Borel-measurable, and the strategies {\gamma} are required to be measurable as well. However, if {x} and {y} are random variables with a known joint distribution {P}, then {y} is indeed irrelevant for the purpose of minimizing expected cost.

Warning: lots of measure-theoretic noodling below the fold; if that is not your cup of tea, you can just assume that all sets are finite and go with the poor man’s version stated in the first paragraph. Then all the results below will hold.

Formally, Blackwell has proved the following result (which Peter Whittle, in his two-volume book Optimization Over Time, has dubbed “Blackwell’s principle of irrelevant information”):

Theorem 1 (Blackwell’s principle of irrelevant information) Let {{\mathsf X}}, {{\mathsf Y}}, and {{\mathsf U}} be standard Borel spaces, let {P} be any probability distribution on {{\mathsf X} \times {\mathsf Y}}, and let {c : {\mathsf X} \times {\mathsf U} \rightarrow {\mathbb R}} be a bounded Borel-measurable function. Then for any Borel-measurable function {\gamma : {\mathsf X} \times {\mathsf Y} \rightarrow {\mathsf U}} there exists another Borel-measurable function {\gamma^\circ : {\mathsf X} \rightarrow {\mathsf U}}, such that

\displaystyle  	\int_{{\mathsf X}} c(x,\gamma^\circ(x)) P(dx) \le \int_{{\mathsf X} \times {\mathsf Y}} c(x,\gamma(x,y)) P(dx,dy).

Remark 1 This is only one part of Blackwell’s result; the other part says that strategies based only on {x} are nearly optimal with probability one.

Proof: Given {\gamma}, let {P^\gamma(du|x)} be the regular conditional probability distribution of {U} given {X=x}. That is, {P^\gamma(\cdot|x)} is a probability distribution over {{\mathsf U}} for any fixed {x}, and {P^\gamma(A|\cdot)} is a Borel-measurable function on {{\mathsf X}} for any fixed Borel set {A \subseteq {\mathsf U}}. Such a beast always exists because we are working with standard Borel spaces. Then (by Fubini’s theorem)

\displaystyle  	\int_{\mathsf X} c(x,\gamma(x,y)) P(dx,dy) = \int_{\mathsf X} \left( \int_{\mathsf U} c(x,u) P^\gamma(du|x)\right) P(dx).

Let us denote the inner integral by {h^\gamma(x)}:

\displaystyle  h^\gamma(x) := \int_{\mathsf U} c(x,u) P^\gamma(du|x).

Consider the set

\displaystyle  D := \{ (x, u) \in {\mathsf X} \times {\mathsf U} : c(x,u) \le h^\gamma(x) \}

and its {{\mathsf X}}-sections {D_x := \{ u \in {\mathsf U} : (x,u) \in D\}} for all {x \in {\mathsf X}}. Then

\displaystyle  P^\gamma(D_x | x) > 0, \qquad \forall x \in {\mathsf X} \ \ \ \ \ (1)

for otherwise we would have

\displaystyle  \int_{\mathsf U} c(x,u) P^\gamma(du|x) = \int_{D^c_x} c(x,u) P^\gamma(du|x) > h^\gamma(x) \int_{D^c_x} P^\gamma(du|x) = h^\gamma(x) \int_{\mathsf U} P^\gamma(du|x) = h^\gamma(x),

a contradiction. By a theorem of Blackwell and Czeslaw Ryll-Nardzewski (yet another three-page paper, this time from 1963), (1) implies that there exists a Borel-measurable function {\gamma^\circ : {\mathsf X} \rightarrow {\mathsf U}}, whose graph is contained in {D}, i.e., {\{ (x, \gamma^\circ(x)) : x \in {\mathsf X}\} \subseteq D}. In other words, {\gamma^\circ(x) \le h^\gamma(x)} for every {x \in {\mathsf X}}. So,

\displaystyle  \int_{\mathsf X} c(x,\gamma^\circ(x)) P(dx) \le \int_{\mathsf X} h^\gamma(x) P(dx) \equiv \int_{{\mathsf X} \times {\mathsf Y}} c(x,\gamma(x,y))P(dx,dy),

and the theorem is proved. \Box

This principle may seem obvious (measure-theoretic incantations notwithstanding), yet its power is actually immense. Consider a multistage decision problem, where the decision at each stage may potentially depend on all previously recorded observations and decisions. Finding an optimal sequence of decisions (i.e., a strategy or a policy) can be very difficult. However, if the cost at each stage involves only a subset of all current data, Blackwell’s principle can be used to show that the decision at that stage can be based only on this relevant subset. Statement of this type are often referred to as structural results, since they imply that, when searching for an optimal policy, there is no loss of optimality if we restrict our attention to a subclass of policies with a certain structure (specifying what the decision at each stage may depend on).

Application: Markov decision problems

As an illustration, I will use Blackwell’s principle to prove that Markov policies are optimal in finite-horizon Markov decision problems. I learned this proof from Aditya Mahajan (see his own notes on the matter), who in turn was inspired by the techniques in a 1979 paper by Hans Witsenhausen on real-time source codes. As it turned out, this idea dates back to Blackwell, although he had spelled it out only for a two-step problem. (Incidentally, I discovered Blackwell’s paper purely by accident, while reading another paper in an economics journal.)

The statement that Markov policies are optimal for MDPs is proved in almost every textbook on stochastic control. The proof, however, uses a dynamic programming argument and is far from intuitive. On the other hand, the proof based on Blackwell’s result proceeds from first principles; in fact, the structural result can be presented first, and then a dynamic programming solution can be discussed. In my opinion, this way is more pedagogical. Another advantage of separating the derivation of the structural result from the derivation of the dynamic program is that we can apply the structural result even in settings where a dynamic program is difficult to obtain, i.e., decentralized systems (thanks to Aditya for this crisp formulation).

Anyway, let’s formulate the problem. To that end, I will use the stochastic kernel formalism, which I have described in an earlier post.

Markov decision problem (MDP). We have a stochastic control problem with finite time horizon {T}. At time {t}, the state of the system being controlled (the plant) is {X_t \in {\mathsf X}}, which we assume to be fully observed at the controller, and the control is {U_t \in {\mathsf U}}. Both {{\mathsf X}} and {{\mathsf U}} are assumed to be standard Borel spaces. Then:

  • The causal ordering is {X_1,U_1,\ldots,X_T,U_T}
  • The system variables are {X_1,\ldots,X_T}, the decision variables are {U_1,\ldots,U_T}
  • The system kernels are {(P_{X_t|X^{t-1},U^{t-1}})^T_{t=1}}, where we assume that

    \displaystyle  	(X^{t-2},U^{t-2}) \rightarrow (X_{t-1},U_{t-1}) \rightarrow X_t

    are Markov chains. In other words, the state at time {t} depends only on the most recent state {X_{t-1}} and the most recently applied control {U_{t-1}}.

  • The control at time {t}, {U_t}, can depend on all past observations and controls.

A control policy {\gamma} is a sequence {\{ \gamma_t \}^T_{t=1}} of (measurable) mappings {\gamma_t : {\mathsf X}^{t} \times {\mathsf U}^{t-1} \rightarrow {\mathsf U}}, such that {U_t = \gamma_t(X^t,U^{t-1})}. With each stage {t=1,\ldots,T} we associate a bounded cost function {c_t : {\mathsf X} \times {\mathsf U} \rightarrow {\mathbb R}}. The total expected cost of {\gamma} is given by

\displaystyle  J(\gamma) := \mathop{\mathbb E}^\gamma \left\{ \sum^T_{t=1} c_t(X_t,U_t) \right\} = \mathop{\mathbb E}^\gamma \left\{ \sum^T_{t=1} c_t(X_t,\gamma_t(X^t,U^{t-1})) \right\},

where the superscript {\gamma} indicates that the expectation is taken w.r.t. the probability measure constructed by interconnecting the system kernels {\{P_{X_t|X_{t-1},U_{t-1}}\}^T_{t=1}} with the deterministic policy {\gamma}. The goal is to minimize {J(\gamma)} over all admissible policies.

At first glance, this is a very hard problem. The amount of data available to the controller grows with time, and an optimal control policy may be arbitrarily complicated. However, the structural result for MDPs states that the control at each stage can be chosen only based on the most recent state:

Theorem 2 (Structural result for MDPs) There is no loss of optimality in restricting attention to Markov policies of the form {U_t = \gamma_t(X_t)}, {t=1\ldots,T}.

Preparation: two lemmas

In preparation for the proof, I will state and prove two lemmas, both due essentially to Witsenhausen. The first lemma (which was also proved by Blackwell in his 1964 paper) states that the structural result holds for {T=2}; the second lemma states that when {T=3} and the last policy mapping, i.e., {\gamma_3}, is Markov (that is, {U_3 = \gamma_3(X_3)}), then there is no loss of optimality if {\gamma_2} is chosen to be Markov as well. Once these two lemmas are proved, the result for arbitrary {T} is easy to prove by aggregating time. So, without further ado:

Lemma 3 (Two-stage lemma) If {T=2}, then for any policy {\gamma = (\gamma_1,\gamma_2)} there exists a mapping {\gamma^\circ_2 : {\mathsf X} \rightarrow {\mathsf U}} with {U_2 = \gamma^\circ_2(X_2)}, such that {J(\gamma_1,\gamma^\circ_2) \le J(\gamma_1,\gamma_2)}.

Remark 2 Note that {\gamma_1} is a fortiori Markov, since it can only depend on {X_1}.

Proof: Let us write out {J(\gamma_1,\gamma_2)}:

\displaystyle  	J(\gamma_1,\gamma_2) = \mathop{\mathbb E} c_1(X_1,\gamma_1(X_1)) + \mathop{\mathbb E} c_2(X_2,\gamma_2(X_2,U_1)).

The first term is not affected by {\gamma_2}. Now apply Blackwell’s principle to the second term to conclude that there exists some {\gamma^\circ_2 : {\mathsf X} \rightarrow {\mathsf U}}, such that {\mathop{\mathbb E} c_2(X_2,\gamma^\circ_2(X_2)) \le \mathop{\mathbb E} c_2(X_2,\gamma_2(X_2,U_2))}. The claimed result follows. \Box

Lemma 4 (Three-stage lemma) Let {T=3} and consider a policy {\gamma = (\gamma_1,\gamma_2,\gamma_3)}, where {\gamma_3} is Markov, i.e., {U_3 = \gamma_3(X_3)}. Then there exists a mapping {\gamma^\circ_2 : {\mathsf X} \rightarrow {\mathsf U}} with {U_2 = \gamma^\circ_2(X_2)}, such that {J(\gamma_1,\gamma^\circ_2,\gamma_3) \le J(\gamma_1,\gamma_2,\gamma_3)}.

Proof: Again, let us write out {J(\gamma_1,\gamma_2,\gamma_3)}:

\displaystyle  	J(\gamma_1,\gamma_2,\gamma_3) = \mathop{\mathbb E} c_1(X_1, \gamma_1(X_1)) + \mathop{\mathbb E} c_2(X_2, \gamma_2(X_1,X_2,U_1)) + \mathop{\mathbb E} c_3(X_3,\gamma_3(X_3)).

The first term is not affected by {\gamma_2} or {\gamma_3}; thus, let us focus on the second and the third terms. Since {X_3} depends on {X_2} and {U_2}, both terms are affected by {\gamma_2}. Let us inspect the third term:

\displaystyle  \begin{array}{rcl}  	\mathop{\mathbb E} c_3(X_3, \gamma_3(X_3)) &=& \mathop{\mathbb E} \left\{ \mathop{\mathbb E} \{ c_3(X_3,\gamma_3(X_3) | X_1, X_2, U_1, U_2 \}\right\} \qquad \text{(by law of iterated expectation)}\\ 	&=& \mathop{\mathbb E} \{ \mathop{\mathbb E} \{ c_3(X_3, \gamma_3(X_3)) | X_2, U_2 \} \} \qquad \text{(by Markov property)} \\ 	&\equiv& \mathop{\mathbb E} h(X_2,U_2), \end{array}

where we have denoted by {h(X_2,U_2)} the conditional expectation {\mathop{\mathbb E}\{c_3(X_3,\gamma_3(X_3))|X_2, U_2 \}}. Note that the functional form of {h} does not depend on {\gamma_2}:

\displaystyle  \begin{array}{rcl}  h(x,u) &=& \mathop{\mathbb E} \{c_3(X_3,\gamma_3(X_3)) | X_2 = x, U_2 = u \} \\ &=& \int c_3(x_3,\gamma_3(x_3)) P(dx_3|X_2=x,U_2=u), \end{array}

where the system kernel {P_{X_3|X_2,U_2}(\cdot|X_2=x,U_2=u)} does not depend on the choice of {\gamma_2}, but only on the realized values {x} and {u} of {X_2} and {U_2} (Witsenhausen refers to this as policy independence of conditional expectations). Moreover, the function {h} is measurable and bounded. Hence, defining the function {c(x,u) := c_2(x,u) + h(x,u)}, we can write

\displaystyle  \mathop{\mathbb E}^\gamma \{c_2(X_2,U_1) + c_3(X_3,U_3)\} = \mathop{\mathbb E} c(X_2,\gamma_2(X_1,X_2,U_1)).

Applying Blackwell’s principle, we see that there exists a measurable {\gamma^\circ_2 : {\mathsf X} \rightarrow {\mathsf U}} with {U_2 = \gamma^\circ_2(X_2)}, such that

\displaystyle  \begin{array}{rcl}  \mathop{\mathbb E} c(X_2,\gamma^\circ_2(X_2)) &\le& \mathop{\mathbb E} c(X_2,\gamma_2(X_1,X_2,U_1)). \end{array}


\displaystyle  \begin{array}{rcl}  	J(\gamma_1,\gamma^\circ_2,\gamma_3) &=& \mathop{\mathbb E} c_1(X_1,\gamma_1(X_1)) + \mathop{\mathbb E} c(X_2,\gamma^\circ_2(X_2)) \\ 	&\le& \mathop{\mathbb E} c_1(X_1,\gamma_1(X_1)) + \mathop{\mathbb E} c(X_2, \gamma_2(X_1,X_2,U_1)) \\ 	&=& \mathop{\mathbb E} c_1(X_1,\gamma_1(X_1)) + \mathop{\mathbb E} c_2(X_2,\gamma_2(X_1,X_2,U_1)) + \mathop{\mathbb E} \{ \mathop{\mathbb E} c_3(X_3,\gamma_3(X_3)) | X_2, \gamma_2(X_1,X_2,U_1) \} \} \\ 	&=& \mathop{\mathbb E}^\gamma\left\{ c_1(X_1,\gamma_1(X_1)) + c_2(X_2,\gamma_2(X_1,X_2,U_1)) + c_3(X_3,\gamma_3(X_3))\right\} \\ 	&\equiv& J(\gamma_1,\gamma_2,\gamma_3), \end{array}

and the lemma is proved. \Box

Proof of the structural result for MDPs

Now that these preparatory steps are out of the way, the actual proof of the structural result is extremely simple.

Consider an arbitrary policy {\gamma = (\gamma_1,\ldots,\gamma_T)}. First let us lump together the stages {t=1} through {t=T-1} into one “super-stage.” Then we have a two-stage problem, so by the two-stage lemma we can assume that {\gamma_T} is Markov: {U_T = \gamma_T(X_T)}.

Now consider {t=T-1}. Let us lump {t=1} through {t=T-2} into a super-stage, followed by {t=T-1} and then by {t=T}. This is a three-stage problem now, where the policy in the third stage is Markov. By the three-stage lemma, we can take {\gamma_{T-1}} to be Markov: {U_{T-1} = \gamma_{T-1}(X_{T-1})}. We now proceed inductively: for {t=T-k}, lump {t=1} through {t=T-k-1} into one super-stage, followed by {t=T-k}, followed by another super-stage stretching from {t=T-k+1} to {t=T}. Again, we have a three-stage problem, where the third-stage policy is Markov (by the inductive assumption). Hence, {\gamma_{T-k}} can be taken Markov as well. The proof is concluded.

Blackwell’s principle FTW

This is just the tip of the iceberg. As Aditya and Sekhar Tatikonda show in their work, one can use factor graphs to represent arbitrary interconnections of system kernels, policies, and cost functions, and then systematically apply Blackwell’s principle to eliminate dependencies on irrelevant data.

4 thoughts on “Deadly ninja weapons: Blackwell’s principle of irrelevant information”

  1. I am honored at being mentioned in the same post as Blackwell. Thank you.

    The same technique (with some measure theory finesse) can also be used to prove the structure result for POMDPs. Such a derivation does not require one to establish that the belief states form a MDP. The advantage is the same as in this case—a clearer separation between structural results and DP, and easier application to decentralized systems.

    1. Well, I give credit where credit is due.

      As for POMDPs, I understand the argument would go (roughly) something like this: you can define a cost \tilde{c}_t(\pi, u) = \int c_t(x,u) \pi(dx) for any prior \pi on your state space. And since \pi_t(dx_t|y^t,u^{t-1}) can be computed from y^t,u^{t-1} according to a policy-independent mapping, you can always write {\bf E}c_t(X_t,U_t) = {\bf E}\tilde{c}_t(\pi_t,\gamma_t(Y^t,U^{t-1},\pi_t)) and then use Blackwell. Of course, the belief states must live in a standard Borel space, but if state, observation, and action spaces are standard Borel, then you can topologize the space of Borel measures on the state space using the weak topology and get another standard Borel space.

      1. Compare this derivation with the derivation in Stribel “Sufficient statistics in optimum control of stochastic systems”.

        An intriguing fact is that no assumption of classical info structure is required. What’s the catch? We have not asked why such a structural result is useful? After all the Borel space on $Y^t \times U^{t-1}$ is isomorphic to the Borel space of probability measures. I have my own opinions on this, but I have not seen this point discussed in any book or paper. Have you?

        I just realized that Blackwell shows that for a one step system, the structural result is $\epsilon$-optimal on a set of measure 1 (your Remark 1). I wonder if that can be used to show that Markov policies are sample path optimal on a set of measure one (a bit tricky, as the choice of total measure depends on the policy).

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: