Stochastic kernels of the world, interconnect!
This is an expository post on a particularly nice way of thinking about stochastic systems in information theory, control, statistical learning and inference, experimental design, etc. I will closely follow the exposition in Chapter 2 of Sekhar Tatikonda‘s doctoral thesis, which in turn builds on ideas articulated by Hans Witsenhausen and Roland Dobrushin.
A general stochastic system consists of smaller interconnected subsystems that influence one another’s behavior. For example, in a communications setting we have sources, encoders, channels and decoders; in control we have plants, sensors, and actuators. The formalism I am about to describe will allow us to treat all these different components on an equal footing.
We will work in the Bayesian framework, by which I mean the following: if we attach a variable to the relevant parts of the overall system, then the system’s behavior can be described by a probability measure over the variables. Typically, there will be constraints on the form this probability measure may take. Some of these constraints are forced on us by the type of components we have at our disposal. For example, in a communication system the source is described a stochastic process with a given process distribution that must be communicated over a noisy channel with a fixed transition law. These constraints, then, specify conditional probability measures between some of the variables. The task of the system designer is to complete this partial specification to a global probability measure. This is where additional constraints come in. Depending on the system, the flow of information may be restricted in various ways. For example, in a partially observed Markov decision process (POMDP), the sensors are noisy, so the true system state is hidden from the controller. There may also be memory limitations — for example, the controller may use only a fixed number of bits to store information about past observations and controls. The constraints of this form specify what’s called an information structure associated with the system: who knows what and when they know it. Finally, we must specify what they can do with what they know. One way to accomplish this is to impose an arrow of time by prescribing the order in which the different subsystems may act. We will call this the causal ordering associated to the system.
1. The basic definition
We now formalize the above discussion. First of all, let us fix a measurable space . Following the time-honored practice, we will use this space as a dump for all the randomness and chance effects. Let’s suppose that there are variables associated with the system of interest, which we will describe by (measurable) functions , where for simplicity we will assume that each is a finite set. Note that at this point the ‘s are not yet random variables since we have not defined a measure on . We now describe how this is to be done.
This factorization is made up of two types of factors (stochastic kernels) : those specified by the system description and those chosen by the designer. In order to make use of this distinction, we partition the set into two disjoint subsets:
- The system set , where
- The decision set , where .
The system specification then is a collection of stochastic kernels . These prescribe the behavior of each system variable , , given the values of the preceding system and decision variables. The remaining factors, , describe the behavior of the decision variables , , given the values of the preceding system and decision variables. These kernels are to be chosen by the system’s designer. We also say that the designer uses these factors to interconnect the system stochastic kernels in order to form an overall probability measure .
The variables form what is called the information pattern of the th decision variable , and the collection is the information structure. The collection is often called a policy.
Putting all of these ingredients together, we may define our system model as the collection of all probability measures that admit factorizations of the form (1) with the given system kernels and all decision kernels obeying the conditions (2) for a given information structure .
Here are two simple examples, one from information theory and one from control, to illustrate the above abstract set-up.
Transmission of a message over a communication channel. We wish to transmit a random message taking one of possible values over a noisy communication channel with input alphabet and output alphabet , where we are allowed to use the channel only times. Let denote the symbol appearing at the channel input at time , let denote the symbol that appears at the channel output at time , and let denote the decoder’s estimated message after the transmission ends. We can lay out the ingredients as follows:
- The causal ordering is
- The system variables are , the decision variables are
- The system kernels are and , where we assume that is a Markov chain (in other words, the channel output at time may depend on all past inputs and outputs, but not on the message being transmitted)
- The information structure depends on whether or not we are allowed to use feedback:
- No feedback allowed — for each , the kernel is a function of (the message) and (past channel inputs) only
- Feedback allowed — for each , is a function of the message, all past channel inputs, and all past channel outputs
In both cases, is a function of the channel outputs only
Additional simplifications arise if the channel is memoryless, i.e., if, for any , is a Markov chain.
Partially observed Markov decision processes (POMDPs). We have a stochastic control problem with finite time horizon . At time , the state of the system being controlled (the plant) is , the observation at the sensor is , and the control selected by the actuator is . We have:
- The causal ordering is
- The system variables are and , the decision variables are
- The system kernels are and , where we assume that and are Markov chains. In other words, the state at time depends only on the most recent state and the most recently applied control , while the observation at time depends only on the state at time
- The information structure stipulates that, at each time , is a Markov chain. In other words, the controls can be chosen only on the basis of all past observations and controls, while the states are not observed. We can also subsume open-loop control by requiring that be a Markov chain.
3. Connection to graphical models
All this talk of prescribing joint probability measures by means of stochastic kernels (conditional probability measures) immediately makes one think of probabilistic graphical models (or Bayes networks). Indeed, by specifying the system kernels and the information patterns, we fix conditional independence relations among the relevant system and decision variables. Hugo Touchette and Seth Lloyd use such graphical model representations in their paper on the information-theoretic properties of simple control systems. In many cases, graphical models can be used to simplify complicated information structures or replace them by equivalent ones that may be easier to deal with. Here is a typical graphical model:
Looking at this particular model, we notice two things:
- The causal ordering
- The conditional independence relation
Note also the obvious symmetry between and . Indeed, this symmetry leads to the following nice result:
for all . Then there exists an equivalent model with causal ordering .
Proof: Equation (3) states that the variables form a Markov chain, i.e., and are conditionally independent given . We now use this fact to show that can be factorized according to the causal ordering :
The first step was to write down the factorization according to the original causal ordering. Then we have used the conditional independence of and given to simplify the kernel to . Finally, we have interchanged and . The resulting expression gives the factorization of according to the new causal ordering. Finally, note that we have not changed the functional form of any stochastic kernel in the above expression, so the two models are indeed equivalent.
Using this lemma and the principle of induction we may prove, for instance, that in open-loop control we may lump all the control variables together and simply optimize over their joint distribution (given the state transition law and the observation model). We can also use more sophisticated graphical representations (such as factor graphs) to include not only the system and the decision variables, but also cost (or reward) functions, as shown in a nice paper by Aditya Mahajan and Sekhar Tatikonda.