Having more information when making decisions should always help, it seems. However, there are situations in which this is not the case. Suppose that you observe two pieces of information, and , which you can use to choose an action . Suppose also that, upon choosing , you incur a cost . For simplicity let us assume that , , and take values in finite sets , , and , respectively. Then it is obvious that, no matter which “strategy” for choosing you follow, you cannot do better than . More formally, for any strategy we have
Thus, the extra information is irrelevant. Why? Because the cost you incur does not depend on directly, though it may do so through .
Interestingly, as David Blackwell has shown in 1964 in a three-page paper, this seemingly innocuous argument does not go through when , , and are Borel subsets of Euclidean spaces, the cost function is bounded and Borel-measurable, and the strategies are required to be measurable as well. However, if and are random variables with a known joint distribution , then is indeed irrelevant for the purpose of minimizing expected cost.
Warning: lots of measure-theoretic noodling below the fold; if that is not your cup of tea, you can just assume that all sets are finite and go with the poor man’s version stated in the first paragraph. Then all the results below will hold.
Formally, Blackwell has proved the following result (which Peter Whittle, in his two-volume book Optimization Over Time, has dubbed “Blackwell’s principle of irrelevant information”):
Theorem 1 (Blackwell’s principle of irrelevant information) Let , , and be standard Borel spaces, let be any probability distribution on , and let be a bounded Borel-measurable function. Then for any Borel-measurable function there exists another Borel-measurable function , such that
Remark 1 This is only one part of Blackwell’s result; the other part says that strategies based only on are nearly optimal with probability one.
Proof: Given , let be the regular conditional probability distribution of given . That is, is a probability distribution over for any fixed , and is a Borel-measurable function on for any fixed Borel set . Such a beast always exists because we are working with standard Borel spaces. Then (by Fubini’s theorem)
Let us denote the inner integral by :
Consider the set
for otherwise we would have
a contradiction. By a theorem of Blackwell and Czeslaw Ryll-Nardzewski (yet another three-page paper, this time from 1963), (1) implies that there exists a Borel-measurable function , whose graph is contained in , i.e., . In other words, for every . So,
and the theorem is proved.
This principle may seem obvious (measure-theoretic incantations notwithstanding), yet its power is actually immense. Consider a multistage decision problem, where the decision at each stage may potentially depend on all previously recorded observations and decisions. Finding an optimal sequence of decisions (i.e., a strategy or a policy) can be very difficult. However, if the cost at each stage involves only a subset of all current data, Blackwell’s principle can be used to show that the decision at that stage can be based only on this relevant subset. Statement of this type are often referred to as structural results, since they imply that, when searching for an optimal policy, there is no loss of optimality if we restrict our attention to a subclass of policies with a certain structure (specifying what the decision at each stage may depend on).
Application: Markov decision problems
As an illustration, I will use Blackwell’s principle to prove that Markov policies are optimal in finite-horizon Markov decision problems. I learned this proof from Aditya Mahajan (see his own notes on the matter), who in turn was inspired by the techniques in a 1979 paper by Hans Witsenhausen on real-time source codes. As it turned out, this idea dates back to Blackwell, although he had spelled it out only for a two-step problem. (Incidentally, I discovered Blackwell’s paper purely by accident, while reading another paper in an economics journal.)
The statement that Markov policies are optimal for MDPs is proved in almost every textbook on stochastic control. The proof, however, uses a dynamic programming argument and is far from intuitive. On the other hand, the proof based on Blackwell’s result proceeds from first principles; in fact, the structural result can be presented first, and then a dynamic programming solution can be discussed. In my opinion, this way is more pedagogical. Another advantage of separating the derivation of the structural result from the derivation of the dynamic program is that we can apply the structural result even in settings where a dynamic program is difficult to obtain, i.e., decentralized systems (thanks to Aditya for this crisp formulation).
Anyway, let’s formulate the problem. To that end, I will use the stochastic kernel formalism, which I have described in an earlier post.
Markov decision problem (MDP). We have a stochastic control problem with finite time horizon . At time , the state of the system being controlled (the plant) is , which we assume to be fully observed at the controller, and the control is . Both and are assumed to be standard Borel spaces. Then:
- The causal ordering is
- The system variables are , the decision variables are
- The system kernels are , where we assume that
are Markov chains. In other words, the state at time depends only on the most recent state and the most recently applied control .
- The control at time , , can depend on all past observations and controls.
A control policy is a sequence of (measurable) mappings , such that . With each stage we associate a bounded cost function . The total expected cost of is given by
where the superscript indicates that the expectation is taken w.r.t. the probability measure constructed by interconnecting the system kernels with the deterministic policy . The goal is to minimize over all admissible policies.
At first glance, this is a very hard problem. The amount of data available to the controller grows with time, and an optimal control policy may be arbitrarily complicated. However, the structural result for MDPs states that the control at each stage can be chosen only based on the most recent state:
Theorem 2 (Structural result for MDPs) There is no loss of optimality in restricting attention to Markov policies of the form , .
Preparation: two lemmas
In preparation for the proof, I will state and prove two lemmas, both due essentially to Witsenhausen. The first lemma (which was also proved by Blackwell in his 1964 paper) states that the structural result holds for ; the second lemma states that when and the last policy mapping, i.e., , is Markov (that is, ), then there is no loss of optimality if is chosen to be Markov as well. Once these two lemmas are proved, the result for arbitrary is easy to prove by aggregating time. So, without further ado:
Lemma 3 (Two-stage lemma) If , then for any policy there exists a mapping with , such that .
Remark 2 Note that is a fortiori Markov, since it can only depend on .
Proof: Let us write out :
The first term is not affected by . Now apply Blackwell’s principle to the second term to conclude that there exists some , such that . The claimed result follows.
Lemma 4 (Three-stage lemma) Let and consider a policy , where is Markov, i.e., . Then there exists a mapping with , such that .
Proof: Again, let us write out :
The first term is not affected by or ; thus, let us focus on the second and the third terms. Since depends on and , both terms are affected by . Let us inspect the third term:
where we have denoted by the conditional expectation . Note that the functional form of does not depend on :
where the system kernel does not depend on the choice of , but only on the realized values and of and (Witsenhausen refers to this as policy independence of conditional expectations). Moreover, the function is measurable and bounded. Hence, defining the function , we can write
Applying Blackwell’s principle, we see that there exists a measurable with , such that
and the lemma is proved.
Proof of the structural result for MDPs
Now that these preparatory steps are out of the way, the actual proof of the structural result is extremely simple.
Consider an arbitrary policy . First let us lump together the stages through into one “super-stage.” Then we have a two-stage problem, so by the two-stage lemma we can assume that is Markov: .
Now consider . Let us lump through into a super-stage, followed by and then by . This is a three-stage problem now, where the policy in the third stage is Markov. By the three-stage lemma, we can take to be Markov: . We now proceed inductively: for , lump through into one super-stage, followed by , followed by another super-stage stretching from to . Again, we have a three-stage problem, where the third-stage policy is Markov (by the inductive assumption). Hence, can be taken Markov as well. The proof is concluded.
Blackwell’s principle FTW
This is just the tip of the iceberg. As Aditya and Sekhar Tatikonda show in their work, one can use factor graphs to represent arbitrary interconnections of system kernels, policies, and cost functions, and then systematically apply Blackwell’s principle to eliminate dependencies on irrelevant data.