# Directed stochastic kernels and causal interventions

As I was thinking more about Massey’s paper on directed information and about the work of Touchette and Lloyd on the information-theoretic study of control systems (which we had started looking at during the last meeting of our reading group), I realized that directed stochastic kernels that feature so prominently in the general definition of directed information are known in the machine learning and AI communities under another name, due to Judea Pearlinterventional distributions.

Let us recall the definition of a directed stochastic kernel. Consider ${N}$ random variables ${Z_1,\ldots,Z_N}$ with the obvious causal ordering. Factorize their joint distribution according to this causal ordering:

$\displaystyle P_{Z^N}(z^N) = \prod^N_{i=1}P_{Z_i|Z^{i-1}}(z_i|z^{i-1}). \ \ \ \ \ (1)$

For any partition of ${\{1,\ldots,n\}}$ into two disjoint sets ${S}$ and ${S^c}$, the directed stochastic kernel of ${Z^S = (Z_i)_{i \in S}}$ given ${Z^{S^c} = (Z_i)_{i \not\in S}}$ is given by

$\displaystyle \vec{P}_{Z^S|Z^{S^c}}(z^S|z^{S^c}) = \prod_{i \in S} P_{Z_i|Z^{i-1}}(z_i|z^{i-1}). \ \ \ \ \ (2)$

We have already seen that this object figures prominently in the definition of the directed information between ${Z^S}$ and ${Z^{S^c}}$, viz.

$\displaystyle I(Z^S \rightarrow Z^{S^c}) = D\Big( P_{Z^S,Z^{S^c}} \Big \| \vec{P}_{Z^S|Z^{S^c}} \times P_{Z^{S^c}}\Big).$

As opposed to the usual mutual information ${I(Z^S; Z^{S^c})}$, which measures the amount of statistical correlation between ${Z^S}$ and ${Z^{S^c}}$, the directed information quantifies the amount of the causal influence ${Z^S}$ has on ${Z^{S^c}}$.

Now let’s connect directed stochastic kernels to probabilistic graphical models, or Bayes networks. Staying with the same causal ordering, let us suppose that, for each ${i}$, we can split the set ${\{1,\ldots,i-1\}}$ into two disjoint subsets, ${\Pi_i}$ and ${\Pi^c_i}$, such that

$\displaystyle Z^{\Pi^c_i} \rightarrow Z^{\Pi_i} \rightarrow Z_i \ \ \ \ \ (3)$

is a Markov chain. Then for each ${i}$ the stochastic kernel ${P_{Z_i|Z^{i-1}}(\cdot|z^{i-1})}$ depends only on those components of ${z^{i-1}}$ that lie in ${\Pi_i}$, and we can rewrite the causal factorization (1) as

$\displaystyle P_{Z^N}(z^N) = \prod^N_{i=1} P_{Z_i|Z^{\Pi_i}}(z_i | z^{\Pi_i}). \ \ \ \ \ (4)$

We can represent the conditional independence relations (3) pictorially by means of a directed acyclic graph (DAG) with ${N}$ vertices, the ${i}$th vertex corresponding to ${Z_i}$, where there is an edge connecting vertex ${j \in \{1,\ldots,i-1\}}$ to vertex ${i}$ if and only if ${j \in \Pi_i}$. It is not hard to see that the resulting graph will, indeed, be acyclic. This DAG, or Bayes network, representation of (4), thanks to the efforts of Judea Pearl and many others, is now one of the main tools in machine learning and AI.

Now, it turns out that, for a given assignment ${Z^{S^c} = z^{S^c}}$ of values to the vertices in ${S^c}$, we can represent the operation of forming the directed kernel ${\vec{P}_{Z^S|Z^{S^c}}}$ graphically as well. In order to do that, we simply locate the vertices ${i \in S^c}$, sever all edges incident on them, and consider the resulting DAG. After this operation is performed, the only vertices with edges incident on them are the ones in ${S}$, and we end up with (2). Judea Pearl calls the resulting distribution of ${Z^S}$ (given ${Z^{S^c} = z^{S^c}}$) the interventional distribution, meaning that we have intervened into the causal model by disconnecting the vertices in ${Z^{S^c}}$ from any causal influences upon them and then forcing them to take the values assigned by ${z^{S^c}}$.

From that perspective, the directed information

$\displaystyle I(Z^S \rightarrow Z^{S^c}) = \mathop{\mathbb E} \left[ \log \frac{P_{Z^S|Z^{S^c}}(Z^S|Z^{S^c})}{\vec{P}_{Z^S|Z^{S^c}}(Z^S|Z^{S^c})} \right]$

represents the expected log-odds on ${Z^S}$ when we get to observe ${Z^{S^c}}$ as statistical evidence versus the situation in which we intervene into the system by setting ${Z^{S^c}}$ to fixed values. The directed information is zero when there is no causal influence of ${Z^S}$ on ${Z^{S^c}}$ — i.e., when there is no difference between letting the system evolve passively and then recording our beliefs on ${Z^S}$ given ${Z^{S^c}}$ versus orchestrating the evolution of ${Z^S}$ by setting ${Z^{S^c}}$ ourselves.

Let me illustrate this point by a simple example of a control system that actually comes from Touchette and Lloyd, even if they do not ever talk about interventions or directed kernels. Consider a system in some initial state ${X}$, which is driven into some final state ${X'}$ via application of a control ${U}$. We have the causal ordering ${X, U, X'}$. Imagine two modes of controlling the system:

1. Open-loop — ${U}$ is chosen at random independently of ${X}$, and then the final state ${X'}$ is a given stochastic function of ${X}$ and ${U}$.
2. Closed-loop — ${U}$ is chosen stochastically as a function of ${X}$ (so the controller observes the initial state), and then, as before, ${X'}$ is determined from ${X}$ and ${U}$.

The causal factorization is

$\displaystyle P_{X,U,X'}(x,u,x') = \begin{cases} P_X(x)P_{U}(u)P_{X'|X,U}(x'|x,u), & \text{open-loop}\\ P_X(x)P_{U|X}(u|x)P_{X'|X,U}(x'|x,u), & \text{closed-loop} \end{cases}$

Let us assume that in both cases the distribution of the initial state ${X}$, as well as the state transition law ${P_{X'|X,U}}$, are the same. Let us inspect the directed kernel ${\vec{P}_{X,X'|U}}$. We have

$\displaystyle \vec{P}_{X,X'|U}(x',x|u) = P_{X}(x) P_{X'|X,U}(x'|x,u)$

in both cases. But the directed information is

$\displaystyle I((X,X') \rightarrow U) = \begin{cases} \mathop{\mathbb E}\left[\log \displaystyle\frac{P_{X}(X)P_{X'|X,U}(X'|X,U)}{P_{X}(X)P_{X'|X,U}(X'|X,U)}\right] = 0, & \text{open-loop}\\ & \\ \mathop{\mathbb E}\left[\log \displaystyle \frac{P_{X|U}(X|U)}{P_X(X)}\right] = I(U;X), & \text{closed-loop} \end{cases}$

In the open-loop case, the control is chosen independently of the initial state ${X}$, and so an equivalent causal ordering would have been ${U,X,X'}$. In that case, the initial and the final state have no causal influence on ${U}$ because the latter might as well have been chosen aeons before, but they are certainly correlated with it. In the closed-loop case, though, the initial and the final state do exert causal influence on the control due to the presence of the feedback link. Thus, the effect of the intervention in the closed-loop case is to simply sever the feedback link, which will make a difference relative to the setting in which the feedback link is present. Moreover, the causal influence is the greater, the more information the controller can extract from the initial state.