## Stochastic kernels of the world, interconnect!

This is an expository post on a particularly nice way of thinking about stochastic systems in information theory, control, statistical learning and inference, experimental design, etc. I will closely follow the exposition in Chapter 2 of Sekhar Tatikonda‘s doctoral thesis, which in turn builds on ideas articulated by Hans Witsenhausen and Roland Dobrushin.

A general stochastic system consists of smaller interconnected subsystems that influence one another’s behavior. For example, in a communications setting we have sources, encoders, channels and decoders; in control we have plants, sensors, and actuators. The formalism I am about to describe will allow us to treat all these different components on an equal footing.

We will work in the Bayesian framework, by which I mean the following: if we attach a variable to the relevant parts of the overall system, then the system’s behavior can be described by a probability measure over the variables. Typically, there will be constraints on the form this probability measure may take. Some of these constraints are forced on us by the type of components we have at our disposal. For example, in a communication system the source is described a stochastic process with a given process distribution that must be communicated over a noisy channel with a fixed transition law. These constraints, then, specify conditional probability measures between some of the variables. The task of the system designer is to *complete* this partial specification to a global probability measure. This is where additional constraints come in. Depending on the system, the flow of information may be restricted in various ways. For example, in a partially observed Markov decision process (POMDP), the sensors are noisy, so the true system state is hidden from the controller. There may also be memory limitations — for example, the controller may use only a fixed number of bits to store information about past observations and controls. The constraints of this form specify what’s called an *information structure* associated with the system: who knows what and when they know it. Finally, we must specify what they can do with what they know. One way to accomplish this is to impose an arrow of time by prescribing the order in which the different subsystems may act. We will call this the *causal ordering* associated to the system.

**1. The basic definition **

We now formalize the above discussion. First of all, let us fix a measurable space . Following the time-honored practice, we will use this space as a dump for all the randomness and chance effects. Let’s suppose that there are variables associated with the system of interest, which we will describe by (measurable) functions , where for simplicity we will assume that each is a finite set. Note that at this point the ‘s are not yet *random variables* since we have not defined a measure on . We now describe how this is to be done.

First of all, let us fix a time ordering on the variables : precedes , which precedes , and so on. With the assumed ordering, any probability measure can be factored as follows:

This factorization is made up of two types of factors (stochastic kernels) : those specified by the system description and those chosen by the designer. In order to make use of this distinction, we partition the set into two disjoint subsets:

- The
*system set*, where - The
*decision set*, where .

The *system specification* then is a collection of stochastic kernels . These prescribe the behavior of each system variable , , given the values of the preceding system and decision variables. The remaining factors, , describe the behavior of the decision variables , , given the values of the preceding system and decision variables. These kernels are to be chosen by the system’s designer. We also say that the designer uses these factors to *interconnect* the system stochastic kernels in order to form an overall probability measure .

Now, this is where the information structure comes to the fore. To each index we associate a set and we stipulate that the stochastic kernel is functionally dependent only to . In symbols,

The variables form what is called the *information pattern* of the th decision variable , and the collection is the *information structure*. The collection is often called a *policy*.

Putting all of these ingredients together, we may define our *system model* as the collection of all probability measures that admit factorizations of the form (1) with the given system kernels and all decision kernels obeying the conditions (2) for a given information structure .

**2. Examples **

Here are two simple examples, one from information theory and one from control, to illustrate the above abstract set-up.

**Transmission of a message over a communication channel.** We wish to transmit a random message taking one of possible values over a noisy communication channel with input alphabet and output alphabet , where we are allowed to use the channel only times. Let denote the symbol appearing at the channel input at time , let denote the symbol that appears at the channel output at time , and let denote the decoder’s estimated message after the transmission ends. We can lay out the ingredients as follows:

- The causal ordering is
- The system variables are , the decision variables are
- The system kernels are and , where we assume that is a Markov chain (in other words, the channel output at time may depend on all past inputs and outputs, but not on the message being transmitted)
- The information structure depends on whether or not we are allowed to use feedback:
- No feedback allowed — for each , the kernel is a function of (the message) and (past channel inputs) only
- Feedback allowed — for each , is a function of the message, all past channel inputs, and all past channel outputs

In both cases, is a function of the channel outputs only

Additional simplifications arise if the channel is *memoryless*, i.e., if, for any , is a Markov chain.

**Partially observed Markov decision processes (POMDPs).** We have a stochastic control problem with finite time horizon . At time , the state of the system being controlled (the plant) is , the observation at the sensor is , and the control selected by the actuator is . We have:

- The causal ordering is
- The system variables are and , the decision variables are
- The system kernels are and , where we assume that and are Markov chains. In other words, the state at time depends only on the most recent state and the most recently applied control , while the observation at time depends only on the state at time
- The information structure stipulates that, at each time , is a Markov chain. In other words, the controls can be chosen only on the basis of all past observations and controls, while the states are not observed. We can also subsume open-loop control by requiring that be a Markov chain.

**3. Connection to graphical models **

All this talk of prescribing joint probability measures by means of stochastic kernels (conditional probability measures) immediately makes one think of probabilistic graphical models (or Bayes networks). Indeed, by specifying the system kernels and the information patterns, we fix conditional independence relations among the relevant system and decision variables. Hugo Touchette and Seth Lloyd use such graphical model representations in their paper on the information-theoretic properties of simple control systems. In many cases, graphical models can be used to simplify complicated information structures or replace them by equivalent ones that may be easier to deal with. Here is a typical graphical model:

Looking at this particular model, we notice two things:

- The causal ordering
- The conditional independence relation

Note also the obvious symmetry between and . Indeed, this symmetry leads to the following nice result:

Lemma 1Consider a causal ordering , such that

for all . Then there exists an equivalent model with causal ordering .

*Proof:* Equation (3) states that the variables form a Markov chain, i.e., and are conditionally independent given . We now use this fact to show that can be factorized according to the causal ordering :

The first step was to write down the factorization according to the original causal ordering. Then we have used the conditional independence of and given to simplify the kernel to . Finally, we have interchanged and . The resulting expression gives the factorization of according to the new causal ordering. Finally, note that we have not changed the functional form of any stochastic kernel in the above expression, so the two models are indeed equivalent.

Using this lemma and the principle of induction we may prove, for instance, that in open-loop control we may lump all the control variables together and simply optimize over their joint distribution (given the state transition law and the observation model). We can also use more sophisticated graphical representations (such as factor graphs) to include not only the system and the decision variables, but also cost (or reward) functions, as shown in a nice paper by Aditya Mahajan and Sekhar Tatikonda.

Tara Javidisaid, on September 1, 2010 at 12:03 pmHi Max,

This is great. Thank God, your post was not known to our NSF panel or they would have found you to fund instead of us! 😉

Mohammad and I are making some progress. Can’t wait to see you in Allerton and report.

That invitation to UCSD is still open, man. Come our way and let’s crack some of these problems together.

-T

mraginskysaid, on September 1, 2010 at 11:32 pmTara, thanks for the kind words! Looking forward to hearing what Mohammad and you have done. I’ve been thinking about a few things along similar lines myself lately, mainly in connection to experimental design and whatnot.

As for visiting UCSD — maybe November or first half of December (before CDC)?

Aditya Mahajansaid, on September 1, 2010 at 10:53 pmHey Max,

Very lucid description. And a nice plug for my paper. Thank you.

But, personally, I prefer Witsenhausen’s intrinsic model to the model that you described above.

The differences between the two are subtle. In the intrinsic mode, you start with a probability space (Omega, F, P) and define all random variables on this space. All these random variables constitute the

intrinsic eventor the move of nature. Each control variable is defined on a measurable space (Z_i,G_i). Then, work on the measurable space (Omega \times \prod_i Z_i, F \times \prod_i G_i) and basically all that you have said can be adapted for this space.The advantage? You do not need to assume a total order on all controllers. If you have a partial order between them, the system is sequential and the equivalence to directed graphs holds. In fact, this equivalence is more stronger because the directed graphs are also capturing the partial order between the controllers. More importantly, if you do not have a partial order between the controllers, that is, you have a non-sequential system, the model still holds. You can explore deadlocks, livelocks, and other new intricacies. It also illustrates the huge simplification that you get by assuming sequentiality.

Aditya

mraginskysaid, on September 1, 2010 at 11:40 pmAditya —

thanks for the compliments. As for the intrinsic model, you’re preaching to the converted. I’m a big fan of it myself, and one of these days I’ll write something about it for the blog. (Needless to say, there will be plenty of dumb mistakes, but hopefully you will be right there to point them out.)

Incidentally, I was leafing through Judea Pearl’s

Causalitythe other day, and I’ve come across a mention of Herbert Simon’s approach to eliciting causality from a functional relation among a set of variables. Basically, you can impose a time ordering if you can solve for each variable in terms of the ones you have solved for in the preceding iterations. I haven’t thought this through, but, presumably, if there are random elements in the picture (i.e., the ubiquitous (Omega, F, P) beast), then something like the Property C from Witsenhausen’s 1971 paper could be teased out. In other words, the order in which you can perform Simon’s procedure may itself be a random variable, but it will be well-defined for each fixed realization of omega.Aditya Mahajansaid, on September 2, 2010 at 12:02 amThat is interesting. I never managed to finish reading Causality. It was too controversial for my taste 🙂 Reading Simon, on the other hand, is usually a delight. Do you have a reference for his paper.

Testing causality for each omega is an interesting idea.(Although, from what I remember on the top of my head, whether it is possible to pose Property C in terms of omega (and not the entire sigma-algebra)). Mark Andersland used a similar idea to show that you can obtain a sequential decomposition of a non-sequential system.

mraginskysaid, on September 2, 2010 at 1:25 pmI have Pearl’s book at home, so I can’t look up the exact reference now, but this paper by Druzdzel and Simon may be related: http://www.pitt.edu/~druzdzel/abstracts/uai93.html

As far as the sigma-algebra issue goes — off the top of my head I would think that you’d probably need it to make sure that you select things in a measurable way at every iteration.

mraginskysaid, on September 7, 2010 at 10:55 amI believe this is the paper of Simon that Pearl was referring to:

http://cowles.econ.yale.edu/P/cm/m14/m14-03.pdf

This paper dates back to 1953; Witsenhausen’s first paper on the intrinsic model came out in 1971, 18 years later. It would be interesting to read Simon’s paper and see whether (and if so, how) his approach is related to Witsenhausen’s.

Some notes on Massey’s “Causality, feedback and directed information” « The Information Structuralistsaid, on September 12, 2010 at 3:49 pm[…] This paper by Massey appeared twenty years ago and drew attention to a number of subtle issues that arise when feedback is present in a communication system. Back then, information-theoretic results concerning feedback were few and far between, and what Massey had to say challenged the conventional wisdom of the day. But now we have the benefit of hindsight and can actually see that what Massey was getting at was the idea that the model of a communication system is an interconnection of the stochastic kernels that describe the behavior of each component (source, encoder, channel, decoder). Hmmm, haven’t we seen this somewhere? […]

Deadly ninja weapons: Blackwell’s principle of irrelevant information « The Information Structuralistsaid, on November 8, 2010 at 1:18 pm[…] Anyway, let’s formulate the problem. To that end, I will use the stochastic kernel formalism, which I have described in an earlier post. […]

Lower bounds for passive and active learning « The Information Structuralistsaid, on November 10, 2011 at 3:02 pm[…] process, so in the passive case it’s just , while in the active case it is given by interconnecting the agent’s (possibly stochastic) rules for selecting on the basis of for each with the […]