Information and Control in Biology, Part 1: Preliminary Considerations

Disclaimer: I am not a biologist, but I have become interested in biology and related matters over the past couple of years. One reason is obviously the pandemic, so the talk of biology, viruses, mRNA, and the like is everywhere. The other, main, reason is that I think we will not get anywhere interesting in AI unless we understand the concepts of autonomy, self-directedness, integration, and adaptation in even very simple biological systems.

This will be the first in a series of posts that are meant as an extended response to Yohan John‘s old post over at 3 Quarks Daily.

Yohan writes:

We are increasingly employing information as an explanation of phenomena outside the world of culture and technology — as the central metaphor with which to talk about the nature of life and mind. Molecular biology, for instance, tells us how genetic information is transferred from one generation to the next, and from one cell to the next. And neuroscience is trying to tell us how information from the external world and the body percolates through the brain, influencing behavior and giving rise to conscious experience.

But do we really know what information is in the first place? And is it really a helpful way to think about biological phenomena? I’d like to argue that explanations of natural phenomena that involve information make inappropriate use of our latent, unexamined intuitions about inter-personal communication, blurring the line between what we understand and what we don’t quite have a grip on yet.

Similar sentiments are quoted by Carl Bergstrom and Martin Rosvall:

Biologists think in terms of information at every level of investigation. Signaling pathways transduce information, cells process information, animal signals convey information. Information flows in ecosystems, information is encoded in the DNA, information is carried by nerve impulses. In some domains the utility of the information concept goes unchallenged: when a brain scientist says that nerves transmit information, nobody balks. But when geneticists or evolutionary biologists use information language in their day-to-day work, a few biologists and many philosophers become anxious about whether this language can be justified as anything more than facile metaphor.

Yohan argues that information theory is, on the whole, not an appropriate framework with which to reason about biological information. Carl and Martin argue otherwise, but propose their own framework, what they refer to as the transmission sense of information, which purportedly resolves the issues that trouble “a few biologists and many philosophers.” My goal in this series of posts is to argue that information theory can indeed be applied to biology, but that its proper application needs to be built up from first principles, starting with a serious engagement with its entire conceptual framework. Moreover, I agree with Yohan that digital communication is not the right conceptual schema; instead, we should be talking about control, programmability, and behaviors.

Continue reading “Information and Control in Biology, Part 1: Preliminary Considerations”

Sampling Using Diffusion Processes, from Langevin to Schrödinger

These notes are based on the tutorial I gave at the Geometric Methods in Optimization and Sampling Boot Camp at the Simons Institute in Berkeley.

Suppose we wish to obtain samples from some probability measure {\mu} on {{\mathbb R}^d}. If {\mu} has a sufficiently well-behaved density {f} with respect to the Lebesgue measure, i.e., {\mu(dx) = f(x) dx}, then we can use the (overdamped) continuous-time Langevin dynamics, governed by the Ito stochastic differential equation (SDE)

\displaystyle  d X_t = \frac{1}{2}\nabla \log f(X_t) dt + dW_t, \qquad t \ge 0 \ \ \ \ \ (1)

where the initial condition {X_0} is generated according to some probability law {\mu_0}, and {(W_t)_{t \ge 0}} is the standard {d}-dimensional Brownian motion. Let {\mu_t} denote the probability law of {X_t}. Then, under appropriate regularity conditions on {f}, one can establish the following:

  • {\mu} is the unique invariant distribution of (1), i.e., if {\mu_0 = \mu}, then {\mu_t = \mu} for all {t}.
  • {\mu_t} converges to {\mu} in a suitable sense as {t \rightarrow \infty} — in fact, it is often possible to show that there exists a constant {c > 0} that depends only on {\mu}, such that one has the exponential convergence to equilibrium

    \displaystyle  		{\rm dist}(\mu_t, \mu) \le e^{-t/c}{\rm dist}(\mu_0, \mu)

    for some distance between probability measures on {{\mathbb R}^d}.

In this sense, the Langevin process (1) gives only approximate samples from {\mu}. I would like to discuss an alternative approach that uses diffusion processes to obtain exact samples in finite time. This approach is based on ideas that appeared in two papers from the 1930s by Erwin Schrödinger in the context of physics, and is now referred to as the Schrödinger bridge problem.

Continue reading “Sampling Using Diffusion Processes, from Langevin to Schrödinger”

Counting bits with Vapnik and Chervonenkis

Machine learning is about enabling computers to improve their performance on a given task as they get more data. Can we express this intuition quantitatively using information-theoretic techniques? In this post, I will discuss a classic paper by David Haussler, Michael Kearns, and Robert Schapire that (to the best of my knowledge) took the first step in this direction. In this post, I will describe some of their results, recast in a more explicitly information-theoretic way.

Continue reading “Counting bits with Vapnik and Chervonenkis”

Information flow on graphs

Models of complex systems built from simple, locally interacting components arise in many fields, including statistical physics, biology, artificial intelligence, communication networks, etc. The quest to understand and to quantify the fundamental limits on the ability of such systems to store and process information has led to a variety of interesting and insightful results that draw upon probability, combinatorics, information theory, discrete and continuous dynamical systems, etc. In this post, I would like to focus on a model of distributed storage that was analyzed in 1975 by Donald Dawson in a very nice paper, which deserves to be more widely known.

Continue reading “Information flow on graphs”

Two public service announcements

1. The 2014 IEEE North American Summer School on Information Theory will take place June 18-21, 2014 at the Fields Institute in Toronto, Canada.

2. For those of you who use Matlab or Octave, there is a new Information Theoretical Estimators (ITE) toolbox, an open-source toolbox “capable of estimating many different variants of entropy, mutual information, divergence, association measures, cross quantities, and kernels on distributions.” Some more details are available in the guest post by the toolbox’s creator, Zoltán Zsabó, at the Princeton Information Theory b-log.

A graph-theoretic derivation of the Gilbert-Varshamov bound

Just a quick note for my reference, but it may be of interest to others.

Let {A_q(n,d)} denote the size of the largest code over a {q}-ary alphabet that has blocklength {n} and minimum distance {d}. The well-known Gilbert-Varshamov bound says that

\displaystyle  A_q(n,d+1) \ge \frac{q^n}{V_q(n,d)},


\displaystyle  V_q(n,d) = \sum^d_{i=0}{n \choose i}(q-1)^i

is the volume of a Hamming ball of radius {d} in {\{0,\ldots,q-1\}^n}. The usual way of arriving at the GV bound is through a greedy construction: pick an arbitrary codeword {x}, then keep adding codewords that are at Hamming distance of at least {d+1} from all codewords that have already been picked. When this procedure terminates, the complement of the union of the Hamming balls of radius {d} around each of the codewords should be empty — otherwise, you will have at least one more codeword at distance of at least {d+1} from the ones already picked, and this would mean that the procedure could not have terminated.

As it turns out, there is another way of deriving the GV bound using graph theory that I have learned from a nice paper by Van Vu and Lei Wu. They use this graph-theoretic interpretation to arrive at an asymptotic improvement of the GV bound. Their result, which I will not go into here, extends an earlier result by Tao Jiang and Alex Vardy for binary codes. As far as I can tell, the graph-theoretic ideas go back to the Jiang-Vardy paper as well.

In order to proceed, we need some definitions and a lemma. Let {G = (V,E)} be an undirected graph. A set {S \subseteq V} of vertices is called independent if no two vertices in {S} are connected by an edge. The independence number of {G}, denoted by {I(G)}, is the cardinality of the largest independent set. The following lemma is folklore in graph theory:

Lemma 1 Suppose that {G} is {D}-regular, i.e., every vertex has exactly {D} neighbors. Then

\displaystyle  	I(G) \ge \frac{|V|}{D+1}. \ \ \ \ \ (1)

Proof: Let {I} be a maximal independent set. Any vertex {v \in V\backslash I} is connected by an edge to at least one {v' \in I}, because otherwise {v} would have to be included in {I}, which would contradict maximality. Therefore, there are at least {|V \backslash I| = |V| - |I|} edges with one vertex in {I} and another vertex in {V \backslash I}. On the other hand, because {G} is {D}-regular, there can be at most {D|I|} such edges. This means that

\displaystyle  	D|I| \ge |V| - |I|.

Rearranging, we get (1). \Box

Now let us construct the following graph (what Jiang and Vardy call the Gilbert graph): associate a vertex to each word in {\{0,\ldots,q-1\}^n}, and connect two vertices by an edge if and only if the Hamming distance between the corresponding words is at most {d}. This graph has {q^n} vertices, and each vertex has degree {D = V_q(n,d)-1}. Moreover, there is a one-to-one correspondence between independent sets of vertices and {q}-ary codes of length {n} and minimum distance at least {d+1}, and the independence number of the Gilbert graph is equal to {A_q(n,d+1)}. The bound (1) is then precisely the GV bound.

Briefly, the Vu-Wu improvement of the GV bound exploits the deep fact that, when the neighborhood of any vertex in a {D}-regular graph is very sparse (in the sense that it has a lot fewer than {{D \choose 2}} edges), the lower bound (1) can be significantly tightened. Apparently, actually counting the number of edges in such a neighborhood of any vertex of the Gilbert graph (by regularity, we may as well look at the neighborhood of the all-zero word) is rather complicated; Vu and Wu instead look at a suitable asymptotic regime when {n} is large and {d = \alpha n} for some {\alpha \le 1-1/q} and replace exact combinatorial bounds by entropy bounds.

Lossless source coding at Western Union

… circa 1935. Here is a quote from Single-Story America (translated as Little Golden America), an American travelogue by Ilya Ilf and Yevgeny Petrov:

There is a whole book of readymade telegrams, long and convincing, lavishly composed telegrams for all occasions. Sending such a telegram costs only twenty-five cents. You see, what gets transmitted over the telegraph is not the text of the telegram, but simply the number under which it is listed in the book, and the signature of the sender. This is quite a funny thing, reminiscent of Drugstore Breakfast #2. Everything is served up in a ready form, and the customer is totally freed from the unpleasant necessity to think, and to spend money on top of it.

Typical set encoding, anyone?

ISIT 2013: two plenaries on concentration of measure

Of the five plenary talks at this year’s ISIT, two were about concentration of measure: Katalin Marton’s Shannon lecture on “Distance-divergence inequalities” and Gabor Lugosi’s talk on “Concentration inequalities and the entropy method” the next morning. Since the topic of measure concentration is dear to my heart, I thought I would write down a few unifying themes.

Continue reading “ISIT 2013: two plenaries on concentration of measure”

Stochastic kernels vs. conditional probability distributions

Larry Wasserman‘s recent post about misinterpretation of p-values is a good reminder about a fundamental distinction anyone working in information theory, control or machine learning should be aware of — namely, the distinction between stochastic kernels and conditional probability distributions.

Continue reading “Stochastic kernels vs. conditional probability distributions”