## Information and Control in Biology, Part 1: Preliminary Considerations

Disclaimer: I am not a biologist, but I have become interested in biology and related matters over the past couple of years. One reason is obviously the pandemic, so the talk of biology, viruses, mRNA, and the like is everywhere. The other, main, reason is that I think we will not get anywhere interesting in AI unless we understand the concepts of autonomy, self-directedness, integration, and adaptation in even very simple biological systems.

This will be the first in a series of posts that are meant as an extended response to Yohan John‘s old post over at 3 Quarks Daily.

Yohan writes:

We are increasingly employing information as an explanation of phenomena outside the world of culture and technology — as the central metaphor with which to talk about the nature of life and mind. Molecular biology, for instance, tells us how genetic information is transferred from one generation to the next, and from one cell to the next. And neuroscience is trying to tell us how information from the external world and the body percolates through the brain, influencing behavior and giving rise to conscious experience.

But do we really know what information is in the first place? And is it really a helpful way to think about biological phenomena? I’d like to argue that explanations of natural phenomena that involve information make inappropriate use of our latent, unexamined intuitions about inter-personal communication, blurring the line between what we understand and what we don’t quite have a grip on yet.

Similar sentiments are quoted by Carl Bergstrom and Martin Rosvall:

Biologists think in terms of information at every level of investigation. Signaling pathways transduce information, cells process information, animal signals convey information. Information flows in ecosystems, information is encoded in the DNA, information is carried by nerve impulses. In some domains the utility of the information concept goes unchallenged: when a brain scientist says that nerves transmit information, nobody balks. But when geneticists or evolutionary biologists use information language in their day-to-day work, a few biologists and many philosophers become anxious about whether this language can be justified as anything more than facile metaphor.

Yohan argues that information theory is, on the whole, not an appropriate framework with which to reason about biological information. Carl and Martin argue otherwise, but propose their own framework, what they refer to as the transmission sense of information, which purportedly resolves the issues that trouble “a few biologists and many philosophers.” My goal in this series of posts is to argue that information theory can indeed be applied to biology, but that its proper application needs to be built up from first principles, starting with a serious engagement with its entire conceptual framework. Moreover, I agree with Yohan that digital communication is not the right conceptual schema; instead, we should be talking about control, programmability, and behaviors.

## Sampling Using Diffusion Processes, from Langevin to Schrödinger

These notes are based on the tutorial I gave at the Geometric Methods in Optimization and Sampling Boot Camp at the Simons Institute in Berkeley.

Suppose we wish to obtain samples from some probability measure ${\mu}$ on ${{\mathbb R}^d}$. If ${\mu}$ has a sufficiently well-behaved density ${f}$ with respect to the Lebesgue measure, i.e., ${\mu(dx) = f(x) dx}$, then we can use the (overdamped) continuous-time Langevin dynamics, governed by the Ito stochastic differential equation (SDE)

$\displaystyle d X_t = \frac{1}{2}\nabla \log f(X_t) dt + dW_t, \qquad t \ge 0 \ \ \ \ \ (1)$

where the initial condition ${X_0}$ is generated according to some probability law ${\mu_0}$, and ${(W_t)_{t \ge 0}}$ is the standard ${d}$-dimensional Brownian motion. Let ${\mu_t}$ denote the probability law of ${X_t}$. Then, under appropriate regularity conditions on ${f}$, one can establish the following:

• ${\mu}$ is the unique invariant distribution of (1), i.e., if ${\mu_0 = \mu}$, then ${\mu_t = \mu}$ for all ${t}$.
• ${\mu_t}$ converges to ${\mu}$ in a suitable sense as ${t \rightarrow \infty}$ — in fact, it is often possible to show that there exists a constant ${c > 0}$ that depends only on ${\mu}$, such that one has the exponential convergence to equilibrium

$\displaystyle {\rm dist}(\mu_t, \mu) \le e^{-t/c}{\rm dist}(\mu_0, \mu)$

for some distance between probability measures on ${{\mathbb R}^d}$.

In this sense, the Langevin process (1) gives only approximate samples from ${\mu}$. I would like to discuss an alternative approach that uses diffusion processes to obtain exact samples in finite time. This approach is based on ideas that appeared in two papers from the 1930s by Erwin Schrödinger in the context of physics, and is now referred to as the Schrödinger bridge problem.

## Stochastic kernels vs. conditional probability distributions

Larry Wasserman‘s recent post about misinterpretation of p-values is a good reminder about a fundamental distinction anyone working in information theory, control or machine learning should be aware of — namely, the distinction between stochastic kernels and conditional probability distributions.

## Lower bounds for passive and active learning

Sasha Rakhlin and I will be presenting our paper “Lower bounds for passive and active learning” at this year’s NIPS, which will be taking place in Granada, Spain from December 12 to December 15. The proofs of our main results rely heavily on information-theoretic techniques, specifically the data processing inequality for ${f}$-divergences and a certain type of constant-weight binary codes.

Just a couple of short items, while I catch my breath.

1. First of all, starting January 1, 2012 I will find myself amidst the lovely cornfields of Central Illinois, where I will be an assistant professor in the Department of Electrical and Computer Engineering at UIUC. This will be a homecoming of sorts, since I have spent three years there as a Beckman Fellow. My new home will be in the Coordinated Science Laboratory, where I will continue doing (and blogging about) the same things I do (and blog about).

2. Speaking of Central Illinois, last week I was at the Allerton Conference, where I had tried my best to preach Uncle Judea‘s gospel to anyone willing to listen information theorists and their fellow travelers. The paper, entitled “Directed information and Pearl’s causal calculus,” is now up on arxiv, and here is the abstract:

Probabilistic graphical models are a fundamental tool in statistics, machine learning, signal processing, and control. When such a model is defined on a directed acyclic graph (DAG), one can assign a partial ordering to the events occurring in the corresponding stochastic system. Based on the work of Judea Pearl and others, these DAG-based “causal factorizations” of joint probability measures have been used for characterization and inference of functional dependencies (causal links). This mostly expository paper focuses on several connections between Pearl’s formalism (and in particular his notion of “intervention”) and information-theoretic notions of causality and feedback (such as causal conditioning, directed stochastic kernels, and directed information). As an application, we show how conditional directed information can be used to develop an information-theoretic version of Pearl’s “back-door” criterion for identifiability of causal effects from passive observations. This suggests that the back-door criterion can be thought of as a causal analog of statistical sufficiency.

If you had seen my posts on stochastic kernels, directed information, and causal interventions, you will, more or less, know what to expect.

Incidentally, due to my forthcoming move to UIUC, this will be my last Allerton paper!

## ISIT 2011: favorite talks

Obligatory disclaimer: YMMV, “favorite” does not mean “best,” etc. etc.

• Emmanuel Abbe and Andrew Barron, “Polar coding schemes for the AWGN channel” (pdf)
• The problem of constructing polar codes for channels with continuous input and output alphabets can be reduced, in a certain sense, to the problem of constructing finitely supported approximations to capacity-achieving distributions. This work analyzes several such approximations for the AWGN channel. In particular, one approximation uses quantiles and approaches capacity at a rate that decays exponentially with support size. The proof of this fact uses a neat trick of upper-bounding the Kullback-Leibler divergence by the chi-square distance and then exploiting the law of large numbers.

• Tom Cover, “On the St. Petersburg paradox”
• A fitting topic, since this year’s ISIT took place in St. Petersburg! Tom has presented a reformulation of the problem underlying this (in)famous paradox in terms of finding the best allocation of initial capital so as to optimize various notions of relative wealth. This reformulation obviates the need for various extra assumptions, such as diminishing marginal returns (i.e., concave utilities), and thus provides a means of resolving the paradox from first principles.

• Paul Cuff, Tom Cover, Gowtham Kumar, Lei Zhao, “A lattice of gambles”
• There is a well-known correspondence between martingales and “fair” gambling systems. Paul and co-authors explore another correspondence, between fair gambles and Lorenz curves used in econometric modeling, to study certain stochastic orderings and transformations of martingales. There are nice links to the theory of majorization and, through that, to Blackwell’s framework for comparing statistical experiments in terms of their expected risks.

• Ioanna Ioannou, Charalambos Charalambous, Sergey Loyka, “Outage probability under channel distribution uncertainty” (pdf; longer version: arxiv:1102.1103)
• The outage probability of a general channel with stochastic fading is the probability that the conditional input-output mutual information given the fading state falls below the given rate. In this paper, it is assumed that the state distribution is not known exactly, but there is an upper bound on its divergence from some fixed “nominal” distribution (this model of statistical uncertainty has been used previously in the context of robust control). The variational representation of the divergence (as a Legendre-Fenchel transform of the moment-generating function) then allows for a clean asymptotic analysis of the outage probability.

• Mohammad Naghshvar, Tara Javidi, “Performance bounds for active sequential hypothesis testing”
• Mohammad and Tara show how dynamic programming techniques can be used to develop tight converse bounds for sequential hypothesis testing problems with feedback, in which it is possible to adaptively control the quality of the observation channel. This viewpoint is a lot cleaner and more conceptually straightforward than “classical” proofs based on martingales (à la Burnashev). This new technique is used to analyze asymptotically optimal strategies for sequential $M$-ary hypothesis testing, variable-length coding with feedback, and noisy dynamic search.

• Chris Quinn, Negar Kiyavash, Todd Coleman, “Equivalence between minimal generative model graphs and directed information graphs” (pdf)
• For networks of interacting discrete-time stochastic processes possessing a certain conditional independence structure (motivating example: discrete-time approximations of smooth dynamical systems), Chris, Negar and Todd show the equivalence between two types of graphical models for these networks: (1) generative models that are minimal in a certain “combinatorial” sense and (2) information-theoretic graphs, in which the edges are drawn based on directed information.

• Ofer Shayevitz, “On Rényi measures and hypothesis testing” (long version: arxiv:1012.4401)
• Ofer obtained a new variational characterization of Rényi entropy and divergence that considerably simplifies their analysis, in many cases completely replacing delicate arguments based on Taylor expansions with purely information-theoretic proofs. He also develops a new operational characterization of these information measures in terms of distributed composite hypothesis testing.

## Deadly ninja weapons: Blackwell’s principle of irrelevant information

Having more information when making decisions should always help, it seems. However, there are situations in which this is not the case. Suppose that you observe two pieces of information, ${x}$ and ${y}$, which you can use to choose an action ${u}$. Suppose also that, upon choosing ${u}$, you incur a cost ${c(x,u)}$. For simplicity let us assume that ${x}$, ${y}$, and ${u}$ take values in finite sets ${{\mathsf X}}$, ${{\mathsf Y}}$, and ${{\mathsf U}}$, respectively. Then it is obvious that, no matter which “strategy” for choosing ${u}$ you follow, you cannot do better than ${u^*(x) = \displaystyle{\rm arg\,min}_{u \in {\mathsf U}} c(x,u)}$. More formally, for any strategy ${\gamma : {\mathsf X} \times {\mathsf Y} \rightarrow {\mathsf U}}$ we have

$\displaystyle c(x,u^*(x)) = \min_{u \in {\mathsf U}} c(x,u) \le c(x,\gamma(x,y)).$

Thus, the extra information ${y}$ is irrelevant. Why? Because the cost you incur does not depend on ${y}$ directly, though it may do so through ${u}$.

Interestingly, as David Blackwell has shown in 1964 in a three-page paper, this seemingly innocuous argument does not go through when ${{\mathsf X}}$, ${{\mathsf Y}}$, and ${{\mathsf U}}$ are Borel subsets of Euclidean spaces, the cost function ${c}$ is bounded and Borel-measurable, and the strategies ${\gamma}$ are required to be measurable as well. However, if ${x}$ and ${y}$ are random variables with a known joint distribution ${P}$, then ${y}$ is indeed irrelevant for the purpose of minimizing expected cost.

Warning: lots of measure-theoretic noodling below the fold; if that is not your cup of tea, you can just assume that all sets are finite and go with the poor man’s version stated in the first paragraph. Then all the results below will hold.