Information and Control in Biology, Part 2: On Bergstrom and Rosvall’s “Transmission Sense of Information”

In the first post of this series, I have outlined the importance of having a proper operational framework in place before starting any talk of information-theoretic quantities. It is, however, a good idea to pin down even provisionally the kind of information-theoretic quantity one hopes to distill out of the operational formulation. This will be the topic of this post, and the starting point will be the paper by Carl Bergstrom and Martin Rosvall on the “transmission sense of information.”

Let me start by quoting the abstract of the paper for the common objections against applying information-theoretic ideas to biology:

Biologists think in terms of information at every level of investigation. Signaling pathways transduce information, cells process information, animal signals convey information. Information flows in ecosystems, information is encoded in the DNA, information is carried by nerve impulses. In some domains the utility of the information concept goes unchallenged: when a brain scientist says that nerves transmit information, nobody balks. But when geneticists or evolutionary biologists use information language in their day-to-day work, a few biologists and many philosophers become anxious about whether this language can be justified as anything more than facile metaphor.

The key claim of Bergstrom and Rosvall is that, by focusing on what they refer to as the transmission sense of information, it is possible to avoid the two pitfalls of mutual information, namely the “shallow” notion of correlation and the so-called parity thesis. In doing so, they emphasize the importance of operational notions in information theory “by taking the viewpoint of a communications engineer and focusing on the decision problem of how information is to be packaged for transport.” While this recognition of the importance of operationalization is not common in biology, I still think that it misses an important point: The relevant decision problem is not that of a communications engineer, it is that of a control engineer, whose primary aim to ensure reliable transmission of the global constraints (formal causes) governing the unfolding of phenotypes (starting with development and continuing on through the organism’s lifetime). In other words, the meaning and use (i.e., the semantics and the pragmatics) of the genome are primary, the symbolic packaging (the syntax) of the genome is secondary. This will be the subject of future posts. Here, my goal is more modest — a critical reconsideration of the Bergstrom and Rosvall paper and an alternative proposal to use directed information, rather than mutual information.

If you are not familiar with directed information, I recommend two old posts of mine (one, two) for the basic background. Briefly, the notion of directed information was proposed by Jim Massey in 1990 in order to analyze systems with feedback by focusing on functional, rather than statistical, dependencies:

… probabilistic dependence is quite distinct from causal dependence. Whether ${X}$ causes ${Y}$ or ${Y}$ causes ${X}$ , the random variables ${X}$ and ${Y}$ will be statistically dependent. Statistical dependence, unlike causality, has no inherent directivity.

This point will be important to us later on.

1. The model

We will consider a very simple model involving the temporal evolution of genotypes and phenotypes in a random environment. We will work in discrete time, using ${t = 1,2,\ldots}$ to index the subsequent generations, so that the triple ${(E_t,G_t,X_t)}$ will consist of the environment, the genotype, and the phenotype, with the obvious temporal ordering:

$\displaystyle E_1,G_1,X_1,E_2,G_2,X_2,\ldots,E_t,G_t,X_t,\ldots .$

Let us consider what happens in ${T}$ generations. To that end, we need to prescribe a joint distribution for ${E^T,G^T,X^T}$ :

$\displaystyle P_{E^T,G^T,X^T}(e^T,g^T,x^T) = \prod^T_{t=1} P_{E_t|E^{t-1}}(e_t|e^{t-1}) P_{G_t|G_{t-1},E_{t-1}}(g_t|g_{t-1},e_{t-1})P_{X_t|G_t,E_t}(x_t|g_t,e_t). \ \ \ \ \ (1)$

The factorization of the joint distribution in (1) encodes some important assumptions, which we need to spell out:

The genotype and the environment are modeled by two discrete-time stochastic processes. Notice that the environment is not necessarily Markovian, but the genome process is a Markov process in a random environment (i.e., at each time ${t}$ , the genome ${G_t}$ is conditionally independent of ${G^{t-2}}$ and ${E^{t-2}}$ given ${G_{t-1}}$ and ${E_{t-1}}$ ). This is a very simple model of asexual reproduction, and the process distribution of ${G^T}$ models mutations (through stochastic dependence of ${G_t}$ on ${G_{t-1}}$ ) and selection (via dependence of ${G_t}$ on ${E_{t-1}}$ ).
The phenotype at ${t}$ is determined by the genotype and by the environment at ${t}$ , but there is no causal influence of the phenotype on the genome (the Central Dogma of molecular biology: information cannot flow from proteins to genes).

Again, I need to emphasize that this is a very simple model (see, e.g., the work of Rivoire and Leibler) that does not account for various feedback mechanisms by which the organisms can act on their environments, such as niche construction.

2. The lack of directivity and the parity thesis

Bergstrom and Rosvall consider what happens at each ${t}$ (they call this the horizontal dimension) as well as the entire trajectory (they call this the vertical dimension); their key claim is that the “transmission sense of information” emerges only when we focus on the latter. Their argument runs as follows. Let us assume for the most part, as they do, that there is no direct feedback from the environment to the genome, so ${P_{G_t|G_{t-1},E_{t-1}} = P_{G_t|G_{t-1}}}$ for each ${t}$ . Consider first the mutual information ${I(G_t;X_t)}$ ; note that, since the genome of any organism has finitely many base pairs, this mutual information is always well-defined and finite. They bring up two objections to meaningful use of this quantity in biology:

Mutual information is a “shallow” notion of statistical correlation (in the words of Massey, it has no inherent directivity). Indeed, since the mutual information is symmetric, we have ${I(G_t;X_t) = I(X_t;G_t)}$ , and this symmetry runs counter to the central dogma (no flow of information from proteins to genes). In fact, we can also include the environment and consider ${I(G_t; X_t,E_t)}$ , which runs into the same issues. There is also the related objection that, whatever information there is in the genes, it is semantic in some sense, whereas the semantic aspects are notably absent from Shannon’s theory.
Once we bring the environment into the picture, we also have to deal with the so-called “parity thesis:” In the terminology of Bayes networks, the random triple ${(E_t,G_t,X_t)}$ forms a collider, i.e., ${G_t}$ and ${E_t}$ are mutually independent, and ${X_t}$ is conditionally dependent on both of them — ${P_{E_t,G_t,X_t}(e_t,g_t,x_t) = P_{E_t}(e_t)P_{G_t}(g_t)P_{X_t|E_t,G_t}(x_t|e_t,g_t)}$ . As a DAG, this Bayes network is symmetric with respect to ${E_t}$ and ${G_t}$ ; as a result, the mutual information fails to single out the DNA (genome) as the information-bearing entity. If we think of the environment as the source of developmental noise, then the picture that emerges fails to distinguish the “control instructions” (the genome) from the “actuation disturbance” (the developmental noise).

I agree with Bergstrom and Rosvall that these objections provide a strong argument against using mutual information in this context. However, this does not mean that there is no alternative information-theoretic construct that is free of the drawbacks of the mutual information. I claim that one has to use directed information instead. This should not be surprising, since we are in fact trying to capture functional, rather than statistical, dependencies. So, let us then compute the directed informations ${I(G_t \rightarrow (E_t,X_t))}$ and ${I((E_t,X_t) \rightarrow G_t)}$ , where the former will quantify the causal influence of the genome on the environment and on the phenotype, whereas the latter will quantify the causal influence of the environment and the phenotype on the genome. In order to do that, we need to compute the directed stochastic kernels ${\vec{P}_{G_t|E_t,X_t}}$ and ${\vec{P}_{E_t,X_t|G_t}}$ . These can be read off directly from the joint distribution

$\displaystyle P_{E_t,G_t,X_t}(e_t,g_t,x_t) = P_{E_t}(e_t)P_{G_t}(g_t) P_{X_t|G_t,E_t}(x_t|g_t,e_t)$

and are given by

$\displaystyle \vec{P}_{G_t|E_t,X_t}(g_t|e_t,x_t) = P_{G_t}(g_t) \qquad \text{and} \qquad \vec{P}_{E_t,X_t|G_t}(e_t,x_t|g_t) = P_{E_t}(e_t)P_{X_t|G_t,E_t}(x_t|g_t,e_t).$

With these, we can compute the corresponding directed informations:

$\displaystyle \begin{array}{rcl} I(G_t \rightarrow (E_t,X_t)) &=& D(P_{G_t,E_t,X_t}\|\vec{P}_{G_t|E_t,X_t} \times P_{E_t,X_t}) \\ &=& {\mathbf E}\left[\log \frac{P_{G_t,E_t,X_t}}{\vec{P}_{G_t|E_t,X_t} P_{E_t,X_t}}\right] \\ &=& {\mathbf E}\left[\log \frac{P_{G_t,E_t,X_t}}{P_{G_t} P_{E_t,X_t}} \right] \\ &=& I(G_t; E_t,X_t), \end{array}$

so this equals just the mutual information between ${G_t}$ and ${(E_t,X_t)}$ , whereas

$\displaystyle \begin{array}{rcl} I((E_t,X_t) \rightarrow G_t) &=& D(P_{E_t,X_t,G_t} \| \vec{P}_{E_t,X_t|G_t} \times P_{G_t}) \\ &=& {\mathbf E}\left[\log \frac{P_{E_t,X_t,G_t}}{\vec{P}_{E_t,X_t|G_t} P_{G_t}} \right] \\ &=& {\mathbf E}\left[\log \frac{P_{E_t,X_t,G_t}}{P_{E_t}P_{X_t|G_t,E_t} P_{G_t}} \right] \\ &=& {\mathbf E}\left[\log \frac{P_{E_t,X_t,G_t}}{P_{E_t,X_t,G_t}} \right] \\ &\equiv& 0. \end{array}$

Since directed information quantifies causal (or functional), rather than statistical, dependencies, this is to be expected — the genes together with the environment exert causal influence on the phenotype, but the phenotype and the environment cannot exert any causal influence on the genes. Here it may also be useful to recall the conservation law that relates mutual information and directed information:

$\displaystyle I(G_t; E_t,X_t) = I(G_t \rightarrow (E_t,X_t)) + I((E_t,X_t) \rightarrow G_t),$

and, since the second term on the right-hand side is identically zero, in this particular instance the mutual information between ${G_t}$ and ${(E_t,X_t)}$ captures all of the causal influence of the genes on the phenotype. This switch of perspective from the mutual information to the directed information dispenses with the shallowness objection.

The “parity thesis” is more interesting, and it is indeed a subtle issue. It is easy to see that we can interchange the roles of ${G_t}$ and ${E_t}$ in the above model and obtain the relations

$\displaystyle I(E_t \rightarrow (G_t,X_t)) = I(E_t; G_t,X_t) \qquad \text{and} \qquad I((G_t,X_t) \rightarrow E_t) = 0.$

This is, as I had mentioned earlier, a simple consequence of the collider structure ${G_t \rightarrow X_t \leftarrow E_t}$ , where ${G_t}$ and ${E_t}$ are two statistically independent objects that jointly influence ${X_t}$ . Bergstrom and Rosvall correctly point out that this is, in fact, a general statement about the interchangeability of the source and the channel noise, as far as the matters of correlation go. Indeed, we can consider a triple of random variables ${(U,W,X)}$ , where ${U}$ is a random source that is being transmitted over a channel, whose output ${X}$ is a noise-corrupted version of ${U}$ , i.e., ${X = f(U,W)}$ for some deterministic function ${f}$ , and ${W}$ is the channel noise which is statistically independent of the source ${U}$ . Again, we have the collider DAG ${U \rightarrow X \leftarrow W}$ , which is symmetric with respect to the roles of ${U}$ and ${W}$ . And indeed, just like in the case with genes, environment, and proteins, there is nothing in the topology of the DAG that privileges the source ${U}$ over the noise ${W}$ . There is no way to circumvent the parity thesis other than simply acknowledging, following Howard Pattee, that everything hinges on making the correct epistemic cut between the system (the interconnection of the controller and of the object of control) and the system’s environment. Indeed, if we recall the instructional or the control role of the genome, then the resolution of the parity thesis should come from the realization that the control law that governs the production of proteins should be (and, in fact, is) a great deal more sensitive to the control instruction itself than to the developmental noise. In other words, it is not enough to look at the mutual information or some such information theoretic quantity; in order to break the symmetry inherent in the parity thesis, we have to look at the relative sensitivity of the conditional distribution ${P_{X_t|G_t = g_t,E_t = e_t}}$ to modifications in ${G_t}$ versus those in ${E_t}$ . Just as a well-designed control policy should be more sensitive to the control input than to random disturbances, we expect the synthesis of proteins to be more sensitive to the instructions in the DNA than to the disturbances coming from the cellular milieu.

Remember, by the way, that in all of the above I have assumed that there is no dependence between the genome and the environment (i.e., genes may undergo mutation, but there is no selection). We can include selection as well, but the resulting mathematics will be a lot more complicated; in particular, we will need to consider so-called causally conditioned directed information, when we condition on the past realizations of ${G}$ and ${E}$ . The main point here is that ${G_t}$ and ${E_t}$ are conditionally independent given ${G^{t-1}}$ and ${E^{t-1}}$ .

3. The vertical dimension and the transmission sense

Again, let’s follow Bergstrom and Rosvall and simplify things for now by assuming no selection, so ${P_{G_t|G_{t-1},E_{t-1}} = P_{G_t|G_{t-1}}}$ for each ${t}$ . At this point, hopefully I have convinced you that we need to look at directed, rather than mutual, information. Now, however, we will work with directed information between processes, i.e., from ${G^T}$ to ${(E^T,X^T)}$ , as well as the one from ${(X^T,E^T)}$ to ${G^T}$ . The relevant directed stochastic kernels are given by

$\displaystyle \begin{array}{rcl} \vec{P}_{E^T|G^T,X^T}(e^T|g^T,x^T) &=& \prod^T_{t=1} P_{E_t|E^{t-1}}(e_t|e^{t-1}) \equiv P_{E^T}(e^T), \\ \vec{P}_{G^T|E^T,X^T}(g^T|e^T,x^T) &=& \prod^T_{t=1} P_{G_t|G_{t-1}}(g_t|g_{t-1}) \equiv P_{G^T}(g^T), \\ \vec{P}_{X^T|G^T,E^T}(x^T|g^T,e^T) &=& \prod^T_{t=1} P_{X_t|G_t,E_t}(x_t|g_t,e_t) \equiv P_{X^T|G^T,E^T}(x^T|g^T,e^T) \end{array}$

Thus, we can compute the directed information

$\displaystyle \begin{array}{rcl} I(G^T \rightarrow (E^T,X^T)) &=& {\mathbf E}\left[\log \frac{P_{G^T,E^T,X^T}}{\vec{P}_{G^T|E^T,X^T} \times P_{E^T,X^T}}\right] \\ &=& {\mathbf E}\left[\log \frac{P_{G^T,E^T,X^T}}{P_{G^T}P_{E^T,X^T}}\right] \\ &=& I(G^T; E^T,X^T), \end{array}$

and, once again, ${I((E^T,X^T) \rightarrow G^T) = 0}$ , so there is no causal influence from ${E^T,X^T}$ to ${G^T}$ . This situation is completely analogous to the one when ${t}$ was fixed. In fact, an entirely analogous argument to the one used for each ${t}$ shows that we can interchange the roles of ${G^T}$ and ${E^T}$ , so that

$\displaystyle I(E^T \rightarrow (G^T,X^T)) = I(E^T; G^T,X^T) \quad \text{and} \quad I((G^T,X^T) \rightarrow E^T) = 0$

so focusing on the temporal unfolding of genes, proteins, and environments does not rid us of the parity thesis. Since both genes and environments affect the phenotype, this is not surprising at all.

Bergstrom and Rosvall’s resolution is to cut through all of this by focusing on just the genome process ${G^T}$ . This is what they mean by the transmission sense of information from the viewpoint of a fictitious communications engineer, the role they ascribe to natural selection. Since in their basic model there is no feedback from the phenotype to the environment, we may as well focus on just ${G^T}$ and ${E^T}$ . It is easy to marginalize ${X^T}$ away in (1) to obtain the following factorization of the joint probability law of ${G^T}$ and ${E^T}$ :

$\displaystyle P_{G^T,E^T}(g^T,e^T) = \prod^T_{t=1} P_{E_t|E^{t-1}}(e_t|e_{t-1}) P_{G_t|G_{t-1},E_{t-1}}(g_t|g_{t-1},e_{t-1}). \ \ \ \ \ (2)$

From the viewpoint of a communications engineer, Equation (2) does indeed describe a communication channel: The environment ${E^T}$ is a fixed “noise process” that plays some part in the evolution of ${G^T}$ . At each time ${t}$ , the current “output” genome ${G_t}$ is determined stochastically by the previous “input” genome ${G_{t-1}}$ and by the environmental noise ${E_{t-1}}$ . If we further marginalize out the environment ${E^T}$ , we will obtain the process law of ${G^T}$ . The central claim of Bergstrom and Rosvall is that the transmission sense of biological information is reflected in the structural and the distributional properties of this process law, especially if one were to compare it to a “high-variety” random object like the environmental noise process ${E^T}$ . The main idea is that the combinatorial structure of the genomic alphabet and the information-theoretic characteristics of the proobability law of ${G^T}$ point to the genome’s function, which is to reduce uncertainty.

Now, uncertainty about what? In order to answer this question, we need to pose an operational interpretation, an optimization problem that our fictitious engineer must solve. And, as I have stressed before, the value of this optimization problem cannot be stated a priori in information-theoretic terms; any such quantity has to be distilled from the nature of the original problem and from the interplay of the source, the channel, and the overall goals. Bergstrom and Rosvall do not give such a formulation, but it is possible to come up with one — see, e.g., the works by Rivoire and Leibler given in the references (I may write a blog post on their work in the future).

I want to close with a critical look at the following quote from the Bergstrom and Rosvall paper:

When information theorists think about coding, they are not thinking about semantic properties. All of the semantic properties are stuffed into the codebook, the interface between source structure and channel structure, which to information theorists is as interesting as a phonebook is to sociologists. When an information theorist says “Tell me how data stream A codes for message set B,” she is not asking you to read her the codebook. She is asking you to tell her about compression, channel capacity, distortion structure, redundancy, and so forth. That is, she wants to know how the structure of the code reflects the statistical properties of the data source and the channel with respect to the decision problem of effectively packaging information for transport.

While the last part is correct (about the structure of the code reflecting the statistics of the source and the channel with respect to the underlying operational criterion), the opening claim about semantics is not entirely accurate. I have touched upon some of these issues in a Twitter thread:

In this thread, I will argue that the conventional wisdom that Shannon’s information theory is all about syntax and not about semantics stems from superficial reading. On the contrary, even his 1948 BSTJ paper is already concerned with syntax, semantics, *and* pragmatics. 1/14

— Maxim Raginsky (@mraginsky) January 29, 2021

I will probably expand that thread into a blog post soon, but here I just want to emphasize the basic nature of the decision problem faced by a communications engineer. The goal is to convey some information about a source ${S}$ through a given noisy channel. The relevant random objects are then ${S}$ (the source itself), ${W}$ (the discrete message containing a suitably compressed summary of the source, ${X}$ (the input to the noisy channel), ${Y}$ (the output of the noisy channel), and, finally, ${A}$ (the action taken by the receiver on the basis of ${Y}$ ). The joint probability law of all these random objects factorizes as

$\displaystyle P_{SWXYA} = P_S P_{W|S} P_{X|W} P_{Y|X} P_{A|Y}$

(and here I am simplifying quite a bit and not worrying about time structure, feedback, etc). Of the stochastic kernels making up this factorization, the two things that are not under the control of the system designer are the source probability law ${P_S}$ and the channel transition law ${P_{Y|X}}$ ; everything else is a decision variable — the source code ${P_{|W|S}}$ , the channel code ${P_{X|W}}$ , and the receiver’s action policy ${P_{A|Y}}$ . The overall optimization problem would then be to choose these ingredients to minimize, say, the expected cost ${{\mathbf E}[c(S,A)]}$ , possibly subject to additional constraints, such as input costs for the channel. In some sense, the goal of the system designer is to choose the missing stochastic kernels for an incomplete specification given by just ${P_S}$ and ${P_{Y|X}}$ , and some of these stochastic kernels do have a bearing on the semantics and on the pragmatics of the problem: both the source ${S}$ and the action ${A}$ reflect material events, while all the other objects, i.e., ${W,X,Y}$ , can be simply thought of as symbols that can be manipulated purely syntacticallly, without much thought given to their meaning (i.e., how they relate to ${S}$ and to ${A}$ ) or to their use (why one would want to take an action pertaining to the source in the first place). While the problem of designing channel codes can be plausibly thought of as free of semantics, both the source code and the action policy are interfaces between matter and symbols, and are thus necessarily semantic (cf. Howard Pattee’s work on this).

It seems that the transmission sense of Bergstrom and Rosvall could pertain to the structure of the channel input and output alphabets and to the distributional properties of ${X}$ once all the stochastic kernels have been specified. Indeed, in some settings it is possible to argue that the distribution of ${X}$ should be as close as possible to capacity-achieving for the channel ${X \rightarrow Y}$ , i.e., it should maximize the mutual information ${I(X;Y)}$ , and therefore depend only on the channel transition law ${P_{Y|X}}$ and not on any semantic or pragmatic aspects of the overall problem. This, however, will only be true under very special conditions (in particular, we would need to ensure that the source-channel separation principle is valid), and at any rate it is not clear to me that the relevant setting is communication and not control (orchestration of development and growth via formal causes).