**Update:** I fixed a couple of broken links.

I want to write down some thoughts inspired by Chernoff’s memo on backward induction that may be relevant to feedback information theory and networked control. Some of these points were brought up in discussions with Serdar Yüksel two years ago.

Imagine an agent who must make a series of decisions (or take a series of actions) in a stochastic environment, where each subsequent action depends on the currently available information, and the information given to the agent depends on the underlying state of the environment and on the history of actions taken so far. Suppose that the number of actions to be taken is finite, and that, once all the actions have been taken, the agent incurs a loss that depends on the state, on all currently available information, and on the action sequence. Then Bellman’s principle of optimality says the following:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

This implies, in turn, that an optimal policy can be found through *backward induction*: Suppose that all the actions but the final one have been taken. Then the final action is the one that minimizes the conditional expectation of the loss given all available information. Once the optimal final action is determined, the number of actions to be taken is reduced by one, and the agent applies the same reasoning to select the next-to-last action, etc.

Backward induction is valid whenever the agent has *perfect recall* and can store all past observations and actions. In that case, the policy is irrelevant, assuming the agent has *system knowledge*, i.e., knows how the state and the current action will affect subsequent observations. Backward induction is also valid if, at each moment, the agent knows the part of the policy used up to that moment — i.e., has system knowledge and *policy knowledge*. However, it is not necessarily applicable if the agent is forgetful in such a way that he can no longer compute the conditional distribution of the state given the available information.

Chernoff then goes on to present some sufficient conditions under which backward induction remains applicable even when the agent has faulty memory. Ia a nutshell, his conditions ensure that the relevant conditional distributions are independent of the policy (cf. Witsenhausen), and that at each point the agent can use the current information to compute these conditional distributions.

I will not discuss these results any further. Instead, I want to focus on one example Chernoff uses to show how system and policy knowledge (or lack thereof) affects the agent’s ability to perform backward induction. My goal is to connect this example to *semantics* of control problems: what is an action? what constitutes system and policy knowledge?

Here is the example. Let be a random variable taking values in according to a known distribution . Let the action be determined by the output of a one-bit quantizer with threshold and quantizer levels at and , i.e.,

Suppose that the agent “forgets” and only sees . The conditional distribution of given depends on the threshold . If the agent somehow forgets , then is completely useless. (Incidentally, even though Chernoff does not point this out, the same considerations apply even if the quantizer has levels at, say, , but the agent forgets where the boundaries of the quantization intervals were. Then the knowledge of is useless regardless of how large is!)

There are two ways of thinking about this example. One is to simply blame the lack of policy knowledge (i.e., the agent does not know the threshold ). This is essentially what Chernoff does. Another way, however, is to regard , rather than , as the action. This perspective makes sense in coding/decoding problems involving feedback, as demonstrated, for instance, here, here, here, and here. The moral of the story is that it is often more appropriate to say that we take an action whenever we choose a source or channel code; the *output* of the code can then be viewed as part of the resulting state. From this perspective, the trouble arises not because the agent forgot the policy that yielded the action , but rather because he forgot the *action* that resulted in *and* cannot reconstruct it based on the available information.

What does this change of perspective give us? One clearly beneficial outcome is that it may transform a sequential decision problem with long-term memory into a *Markov decision problem*. This will be the case if we can prove that the choice of an appropriate encoder or decoder can be based on the most recent state, provided the latter is specified correctly. Once we identify this Markov structure, we can discard irrelevant information in a principled manner and then apply backward induction. This, of course, requires the agent to remember the relevant codebooks (or, at the very least, to remember enough to be able to reconstruct them), but that’s another story.

Reading this post after a long time, but I cannot understand what you mean by “a sequential decision problem with long-term memory” in the last paragraph: long-term memory at controller (as in perfect-recall) or long-term memory in the dynamics (as in higher-order controlled Markov processes), or something completely different?

Coming back to this post after a long time, I can’t remember exactly what I meant, but most likely it was both (or either) of these.