# Stochastic kernels of the world, interconnect!

This is an expository post on a particularly nice way of thinking about stochastic systems in information theory, control, statistical learning and inference, experimental design, etc. I will closely follow the exposition in Chapter 2 of Sekhar Tatikonda‘s doctoral thesis, which in turn builds on ideas articulated by Hans Witsenhausen and Roland Dobrushin.

A general stochastic system consists of smaller interconnected subsystems that influence one another’s behavior. For example, in a communications setting we have sources, encoders, channels and decoders; in control we have plants, sensors, and actuators. The formalism I am about to describe will allow us to treat all these different components on an equal footing.

We will work in the Bayesian framework, by which I mean the following: if we attach a variable to the relevant parts of the overall system, then the system’s behavior can be described by a probability measure over the variables. Typically, there will be constraints on the form this probability measure may take. Some of these constraints are forced on us by the type of components we have at our disposal. For example, in a communication system the source is described a stochastic process with a given process distribution that must be communicated over a noisy channel with a fixed transition law. These constraints, then, specify conditional probability measures between some of the variables. The task of the system designer is to complete this partial specification to a global probability measure. This is where additional constraints come in. Depending on the system, the flow of information may be restricted in various ways. For example, in a partially observed Markov decision process (POMDP), the sensors are noisy, so the true system state is hidden from the controller. There may also be memory limitations — for example, the controller may use only a fixed number of bits to store information about past observations and controls. The constraints of this form specify what’s called an information structure associated with the system: who knows what and when they know it. Finally, we must specify what they can do with what they know. One way to accomplish this is to impose an arrow of time by prescribing the order in which the different subsystems may act. We will call this the causal ordering associated to the system.

1. The basic definition

We now formalize the above discussion. First of all, let us fix a measurable space ${(\Omega,{\mathcal F})}$. Following the time-honored practice, we will use this space as a dump for all the randomness and chance effects. Let’s suppose that there are ${N}$ variables associated with the system of interest, which we will describe by ${N}$ (measurable) functions ${Z_i : \Omega \rightarrow {\mathsf Z}_i}$, where for simplicity we will assume that each ${{\mathsf Z}_i}$ is a finite set. Note that at this point the ${Z_i}$‘s are not yet random variables since we have not defined a measure on ${(\Omega,{\mathcal F})}$. We now describe how this is to be done.

First of all, let us fix a time ordering on the variables ${(Z_i)^N_{i = I}}$: ${Z_1}$ precedes ${Z_2}$, which precedes ${Z_3}$, and so on. With the assumed ordering, any probability measure ${P_{Z^N}(z^N)}$ can be factored as follows:

$\displaystyle P_{Z^N}(z^N) = \prod^N_{i=1} P_{Z_i|Z^{i-1}}(z_i|z^{i-1}). \ \ \ \ \ (1)$

This factorization is made up of two types of factors (stochastic kernels) ${P_{Z_i|Z^{i-1}}}$: those specified by the system description and those chosen by the designer. In order to make use of this distinction, we partition the set ${\{1,\ldots,n\}}$ into two disjoint subsets:

• The system set ${S = \{i_1,\ldots,i_K\}}$, where ${1 \le i_1 < i_2 < \ldots < i_K \le N}$
• The decision set ${D = \{j_1,\ldots,j_L\}}$, where ${1 \le j_1 < j_2 < \ldots < j_L \le N}$.

The system specification then is a collection of stochastic kernels ${( P_{Z_{i_k}|Z^{i_k-1}} )^K_{i=1}}$. These prescribe the behavior of each system variable ${Z_{i_k}}$, ${i_k \in S}$, given the values of the preceding system and decision variables. The remaining factors, ${( P_{Z_{j_\ell}|Z^{j_\ell-1}})^L_{\ell=1}}$, describe the behavior of the decision variables ${Z_{j_\ell}}$, ${j_\ell \in D}$, given the values of the preceding system and decision variables. These kernels are to be chosen by the system’s designer. We also say that the designer uses these factors to interconnect the system stochastic kernels in order to form an overall probability measure ${P_{Z^N}}$.

Now, this is where the information structure comes to the fore. To each index ${j_\ell \in D}$ we associate a set ${S_{\ell} \subseteq \{1,\ldots,j_\ell-1\}}$ and we stipulate that the stochastic kernel ${Q_{Z_{j_\ell}|Z^{j_\ell-1}}(\cdot|z^{j_\ell-1})}$ is functionally dependent only to ${\{ z_i : i \in S_{\ell}\}}$. In symbols,

$\displaystyle P_{Z_{j_\ell}|Z^{j_\ell-1}}(\cdot | z^{j_\ell-1}) = P_{Z_{j_\ell}|Z^{j_\ell-1}}(\cdot| \tilde{z}^{j_\ell-1}) \quad \text{if } z^{j_\ell-1} = \tilde{z}^{j_\ell-1} \text{ on } S_\ell. \ \ \ \ \ (2)$

The variables ${Z^{S_\ell}}$ form what is called the information pattern of the ${\ell}$th decision variable ${Z_{j_\ell}}$, and the collection ${(S_\ell)^L_{\ell=1}}$ is the information structure. The collection ${(P_{Z_{j_\ell}|Z^{j_\ell-1}})^L_{\ell=1}}$ is often called a policy.

Putting all of these ingredients together, we may define our system model as the collection of all probability measures ${P_{Z^N}}$ that admit factorizations of the form (1) with the given system kernels and all decision kernels obeying the conditions (2) for a given information structure ${(S_\ell)^L_{\ell=1}}$.

2. Examples

Here are two simple examples, one from information theory and one from control, to illustrate the above abstract set-up.

Transmission of a message over a communication channel. We wish to transmit a random message ${W}$ taking one of ${M}$ possible values over a noisy communication channel with input alphabet ${{\mathsf X}}$ and output alphabet ${{\mathsf Y}}$, where we are allowed to use the channel only ${T}$ times. Let ${X_t}$ denote the symbol appearing at the channel input at time ${t}$, let ${Y_t}$ denote the symbol that appears at the channel output at time ${t}$, and let ${\hat{W}}$ denote the decoder’s estimated message after the transmission ends. We can lay out the ingredients as follows:

• The causal ordering is ${W, X_1, Y_1, \ldots, X_T, Y_T, \hat{W}}$
• The system variables are ${W, Y_1, \ldots, Y_T}$, the decision variables are ${X_1,\ldots,X_T,\hat{W}}$
• The system kernels are ${P_W}$ and ${(P_{Y_t|X^t,Y^{t-1},W})^{T}_{t=1}}$, where we assume that ${W \rightarrow (X^t,Y^{t-1}) \rightarrow Y_t}$ is a Markov chain (in other words, the channel output at time ${t}$ may depend on all past inputs and outputs, but not on the message being transmitted)
• The information structure depends on whether or not we are allowed to use feedback:
• No feedback allowed — for each ${t}$, the kernel ${P_{X_t|X^{t-1},Y^{t-1},W}(\cdot|x^{t-1},y^{t-1},w)}$ is a function of ${w}$ (the message) and ${x^{t-1}}$ (past channel inputs) only
• Feedback allowed — for each ${t}$, ${P_{X_t|X^{t-1},Y^{t-1},W}(\cdot|x^{t-1},y^{t-1},w)}$ is a function of the message, all past channel inputs, and all past channel outputs

In both cases, ${P_{\hat{W}|X^T,Y^T,W}(\cdot|x^T,y^T,w)}$ is a function of the channel outputs ${y^T}$ only

Additional simplifications arise if the channel is memoryless, i.e., if, for any ${t}$, ${(X^{t-1},Y^{t-1}) \rightarrow X_t \rightarrow Y_t}$ is a Markov chain.

Partially observed Markov decision processes (POMDPs). We have a stochastic control problem with finite time horizon ${T}$. At time ${t}$, the state of the system being controlled (the plant) is ${X_t \in {\mathsf X}}$, the observation at the sensor is ${Y_t \in {\mathsf Y}}$, and the control selected by the actuator is ${U_t \in {\mathsf U}}$. We have:

• The causal ordering is ${X_1,Y_1,U_1,\ldots,X_T,Y_T,U_T}$
• The system variables are ${X_1,\ldots,X_T}$ and ${Y_1,\ldots,Y_T}$, the decision variables are ${U_1,\ldots,U_T}$
• The system kernels are ${(P_{X_t|X^{t-1},Y^{t-1},U^{t-1}})^T_{t=1}}$ and ${(P_{Y_t|X^t,Y^{t-1},U^{t-1}})^T_{t=1}}$, where we assume that ${(X^{t-2},Y^{t-1},U^{t-2}) \rightarrow (X_{t-1},U_{t-1}) \rightarrow X_t}$ and ${(X^{t-1},Y^{t-1},U^{t-1}) \rightarrow X_t \rightarrow Y_t}$ are Markov chains. In other words, the state at time ${t}$ depends only on the most recent state ${X_{t-1}}$ and the most recently applied control ${U_{t-1}}$, while the observation at time ${t}$ depends only on the state at time ${t}$
• The information structure stipulates that, at each time ${t}$, ${X^t \rightarrow (Y^t,U^{t-1}) \rightarrow U_t}$ is a Markov chain. In other words, the controls can be chosen only on the basis of all past observations and controls, while the states are not observed. We can also subsume open-loop control by requiring that ${(X^t,Y^t) \rightarrow U^{t-1} \rightarrow U_t}$ be a Markov chain.

3. Connection to graphical models

All this talk of prescribing joint probability measures by means of stochastic kernels (conditional probability measures) immediately makes one think of probabilistic graphical models (or Bayes networks). Indeed, by specifying the system kernels and the information patterns, we fix conditional independence relations among the relevant system and decision variables. Hugo Touchette and Seth Lloyd use such graphical model representations in their paper on the information-theoretic properties of simple control systems. In many cases, graphical models can be used to simplify complicated information structures or replace them by equivalent ones that may be easier to deal with. Here is a typical graphical model:

Looking at this particular model, we notice two things:

• The causal ordering ${W,X,Y,Z}$
• The conditional independence relation ${X \rightarrow W \rightarrow Y}$

Note also the obvious symmetry between ${X}$ and ${Y}$. Indeed, this symmetry leads to the following nice result:

Lemma 1 Consider a causal ordering ${W, X, Y, Z}$, such that

$\displaystyle P_{Y|XW}(y|x,w) = P_{Y|W}(y|w) \ \ \ \ \ (3)$

for all ${w,x,y}$. Then there exists an equivalent model with causal ordering ${W, Y, X, Z}$.

Proof: Equation (3) states that the variables ${X \rightarrow W \rightarrow Y}$ form a Markov chain, i.e., ${X}$ and ${Y}$ are conditionally independent given ${W}$. We now use this fact to show that ${P_{WXYZ}}$ can be factorized according to the causal ordering ${W, Y, X, Z}$:

$\displaystyle \begin{array}{rcl} P_{WXYZ}(w,x,y,z) &=& P_{W}(w) P_{X|W}(x|w) P_{Y|WX}(y|w,x) P_{Z|WXY}(z|w,x,y) \\ &=& P_{W}(w) P_{X|W}(x|w) P_{Y|W}(y|w) P_{Z|WXY}(z|w,x,y) \\ &=& P_{W}(w) P_{Y|W}(y|w) P_{X|W}(x|w) P_{Z|WXY}(z|w,x,y) \end{array}$

The first step was to write down the factorization according to the original causal ordering. Then we have used the conditional independence of ${X}$ and ${Y}$ given ${W}$ to simplify the kernel ${P_{Y|XW}}$ to ${P_{Y|W}}$. Finally, we have interchanged ${P_{Y|W}}$ and ${P_{X|W}}$. The resulting expression gives the factorization of ${P_{WXYZ}}$ according to the new causal ordering. Finally, note that we have not changed the functional form of any stochastic kernel in the above expression, so the two models are indeed equivalent. $\Box$

Using this lemma and the principle of induction we may prove, for instance, that in open-loop control we may lump all the control variables together and simply optimize over their joint distribution (given the state transition law and the observation model). We can also use more sophisticated graphical representations (such as factor graphs) to include not only the system and the decision variables, but also cost (or reward) functions, as shown in a nice paper by Aditya Mahajan and Sekhar Tatikonda.

## 10 thoughts on “Stochastic kernels of the world, interconnect!”

1. Tara Javidi says:

Hi Max,

This is great. Thank God, your post was not known to our NSF panel or they would have found you to fund instead of us! 😉

Mohammad and I are making some progress. Can’t wait to see you in Allerton and report.

That invitation to UCSD is still open, man. Come our way and let’s crack some of these problems together.

-T

1. Tara, thanks for the kind words! Looking forward to hearing what Mohammad and you have done. I’ve been thinking about a few things along similar lines myself lately, mainly in connection to experimental design and whatnot.

As for visiting UCSD — maybe November or first half of December (before CDC)?

Hey Max,

Very lucid description. And a nice plug for my paper. Thank you.

But, personally, I prefer Witsenhausen’s intrinsic model to the model that you described above.

The differences between the two are subtle. In the intrinsic mode, you start with a probability space (Omega, F, P) and define all random variables on this space. All these random variables constitute the intrinsic event or the move of nature. Each control variable is defined on a measurable space (Z_i,G_i). Then, work on the measurable space (Omega \times \prod_i Z_i, F \times \prod_i G_i) and basically all that you have said can be adapted for this space.

The advantage? You do not need to assume a total order on all controllers. If you have a partial order between them, the system is sequential and the equivalence to directed graphs holds. In fact, this equivalence is more stronger because the directed graphs are also capturing the partial order between the controllers. More importantly, if you do not have a partial order between the controllers, that is, you have a non-sequential system, the model still holds. You can explore deadlocks, livelocks, and other new intricacies. It also illustrates the huge simplification that you get by assuming sequentiality.

thanks for the compliments. As for the intrinsic model, you’re preaching to the converted. I’m a big fan of it myself, and one of these days I’ll write something about it for the blog. (Needless to say, there will be plenty of dumb mistakes, but hopefully you will be right there to point them out.)

Incidentally, I was leafing through Judea Pearl’s Causality the other day, and I’ve come across a mention of Herbert Simon’s approach to eliciting causality from a functional relation among a set of variables. Basically, you can impose a time ordering if you can solve for each variable in terms of the ones you have solved for in the preceding iterations. I haven’t thought this through, but, presumably, if there are random elements in the picture (i.e., the ubiquitous (Omega, F, P) beast), then something like the Property C from Witsenhausen’s 1971 paper could be teased out. In other words, the order in which you can perform Simon’s procedure may itself be a random variable, but it will be well-defined for each fixed realization of omega.

That is interesting. I never managed to finish reading Causality. It was too controversial for my taste 🙂 Reading Simon, on the other hand, is usually a delight. Do you have a reference for his paper.

Testing causality for each omega is an interesting idea.(Although, from what I remember on the top of my head, whether it is possible to pose Property C in terms of omega (and not the entire sigma-algebra)). Mark Andersland used a similar idea to show that you can obtain a sequential decomposition of a non-sequential system.

2. I have Pearl’s book at home, so I can’t look up the exact reference now, but this paper by Druzdzel and Simon may be related: http://www.pitt.edu/~druzdzel/abstracts/uai93.html

As far as the sigma-algebra issue goes — off the top of my head I would think that you’d probably need it to make sure that you select things in a measurable way at every iteration.

3. I believe this is the paper of Simon that Pearl was referring to:

Click to access m14-03.pdf

This paper dates back to 1953; Witsenhausen’s first paper on the intrinsic model came out in 1971, 18 years later. It would be interesting to read Simon’s paper and see whether (and if so, how) his approach is related to Witsenhausen’s.