## Directed stochastic kernels and causal interventions

As I was thinking more about Massey’s paper on directed information and about the work of Touchette and Lloyd on the information-theoretic study of control systems (which we had started looking at during the last meeting of our reading group), I realized that directed stochastic kernels that feature so prominently in the general definition of directed information are known in the machine learning and AI communities under another name, due to Judea Pearl — *interventional distributions*.

Let us recall the definition of a directed stochastic kernel. Consider random variables with the obvious causal ordering. Factorize their joint distribution according to this causal ordering:

For any partition of into two disjoint sets and , the directed stochastic kernel of given is given by

We have already seen that this object figures prominently in the definition of the *directed information* between and , viz.

As opposed to the usual mutual information , which measures the amount of *statistical* correlation between and , the directed information quantifies the amount of the *causal influence* has on .

Now let’s connect directed stochastic kernels to probabilistic graphical models, or Bayes networks. Staying with the same causal ordering, let us suppose that, for each , we can split the set into two disjoint subsets, and , such that

is a Markov chain. Then for each the stochastic kernel depends only on those components of that lie in , and we can rewrite the causal factorization (1) as

We can represent the conditional independence relations (3) pictorially by means of a directed acyclic graph (DAG) with vertices, the th vertex corresponding to , where there is an edge connecting vertex to vertex if and only if . It is not hard to see that the resulting graph will, indeed, be acyclic. This DAG, or *Bayes network*, representation of (4), thanks to the efforts of Judea Pearl and many others, is now one of the main tools in machine learning and AI.

Now, it turns out that, for a given assignment of values to the vertices in , we can represent the operation of forming the directed kernel graphically as well. In order to do that, we simply locate the vertices , sever all edges incident on them, and consider the resulting DAG. After this operation is performed, the only vertices with edges incident on them are the ones in , and we end up with (2). Judea Pearl calls the resulting distribution of (given ) the *interventional distribution*, meaning that we have *intervened* into the causal model by disconnecting the vertices in from any causal influences upon them and then forcing them to take the values assigned by .

From that perspective, the directed information

represents the expected log-odds on when we get to observe as *statistical evidence* versus the situation in which we intervene into the system by setting to fixed values. The directed information is zero when there is no causal influence of on — i.e., when there is no difference between letting the system evolve passively and then recording our beliefs on given versus orchestrating the evolution of by setting ourselves.

Let me illustrate this point by a simple example of a control system that actually comes from Touchette and Lloyd, even if they do not ever talk about interventions or directed kernels. Consider a system in some initial state , which is driven into some final state via application of a control . We have the causal ordering . Imagine two modes of controlling the system:

- Open-loop — is chosen at random independently of , and then the final state is a given stochastic function of and .
- Closed-loop — is chosen stochastically as a function of (so the controller observes the initial state), and then, as before, is determined from and .

The causal factorization is

Let us assume that in both cases the distribution of the initial state , as well as the *state transition law* , are the same. Let us inspect the directed kernel . We have

in both cases. But the directed information is

In the open-loop case, the control is chosen independently of the initial state , and so an equivalent causal ordering would have been . In that case, the initial and the final state have no *causal* influence on because the latter might as well have been chosen aeons before, but they are certainly correlated with it. In the closed-loop case, though, the initial and the final state do exert causal influence on the control due to the presence of the feedback link. Thus, the effect of the intervention in the closed-loop case is to simply sever the feedback link, which will make a difference relative to the setting in which the feedback link is present. Moreover, the causal influence is the greater, the more information the controller can extract from the initial state.

leave a comment