Stochastic kernels vs. conditional probability distributions
Larry Wasserman‘s recent post about misinterpretation of p-values is a good reminder about a fundamental distinction anyone working in information theory, control or machine learning should be aware of — namely, the distinction between stochastic kernels and conditional probability distributions.
Updates! Get your updates here!
Just a couple of short items, while I catch my breath.
1. First of all, starting January 1, 2012 I will find myself amidst the lovely cornfields of Central Illinois, where I will be an assistant professor in the Department of Electrical and Computer Engineering at UIUC. This will be a homecoming of sorts, since I have spent three years there as a Beckman Fellow. My new home will be in the Coordinated Science Laboratory, where I will continue doing (and blogging about) the same things I do (and blog about).
2. Speaking of Central Illinois, last week I was at the Allerton Conference, where I had tried my best to preach Uncle Judea‘s gospel to anyone willing to listen information theorists and their fellow travelers. The paper, entitled “Directed information and Pearl’s causal calculus,” is now up on arxiv, and here is the abstract:
Probabilistic graphical models are a fundamental tool in statistics, machine learning, signal processing, and control. When such a model is defined on a directed acyclic graph (DAG), one can assign a partial ordering to the events occurring in the corresponding stochastic system. Based on the work of Judea Pearl and others, these DAG-based “causal factorizations” of joint probability measures have been used for characterization and inference of functional dependencies (causal links). This mostly expository paper focuses on several connections between Pearl’s formalism (and in particular his notion of “intervention”) and information-theoretic notions of causality and feedback (such as causal conditioning, directed stochastic kernels, and directed information). As an application, we show how conditional directed information can be used to develop an information-theoretic version of Pearl’s “back-door” criterion for identifiability of causal effects from passive observations. This suggests that the back-door criterion can be thought of as a causal analog of statistical sufficiency.
If you had seen my posts on stochastic kernels, directed information, and causal interventions, you will, more or less, know what to expect.
Incidentally, due to my forthcoming move to UIUC, this will be my last Allerton paper!
Missing all the action
Update: I fixed a couple of broken links.
I want to write down some thoughts inspired by Chernoff’s memo on backward induction that may be relevant to feedback information theory and networked control. Some of these points were brought up in discussions with Serdar Yüksel two years ago.
The lost art of writing
From the opening paragraph of Herman Chernoff‘s unpublished 1963 memo “Backward induction in dynamic programming” (thanks to Armand Makowski for a scanned copy):
The solution of finite sequence dynamic programming problems involve a backward induction argument, the foundations of which are generally understood hazily. The purpose of this memo is to add some clarification which may be slightly redundant and whose urgency may be something less than vital.
Alas, nobody writes like that anymore.
ECE 299: regression with quadratic loss; stochastic simulation via Rademacher bootstrap
I gave the last lecture earlier today, wrapping up the semester. Here are the notes from the last two weeks:
- Regression with quadratic loss, mostly in reproducing kernel Hilbert spaces, with and without regularization.
- Case study: stochastic simulation via Rademacher bootstrap, where I discuss the work of Vladimir Koltchinskii et al. on efficient stopping algorithms for Monte Carlo stochastic simulation. The idea is to keep sampling until the empirical Rademacher average falls below a given threshold. Once that happens, you stop and compute a minimizer of the empirical risk. The work of Koltchinskii et al. was in turn inspired by the ideas of Mathukumalli Vidyasagar on the use of statistical learning theory in randomized algorithms for robust controller synthesis.
Monday’s lecture was on stochastic gradient descent as an alternative to batch empirical risk minimization. I will post the notes soon.
Linkage
In lieu of serious posting, which will resume in the new year, a few links:
- Several videos of David Blackwell, over at The Inherent Uncertainty
- The (in)famous Witsenhausen counterexample in decentralized control theory now has its own Wikipedia entry (and I think I know who is behind it).
- Emmanuel Abbe, guest-blogging at Combinatorics and More, presents his perspective on Erdal Arikan’s polar codes. Much of what he says makes me think of Terence Tao‘s work on structure and randomness in “large” combinatorial objects.
- Markov decision processes make a surprising appearance in a paper on subexponential lower bounds for certain randomized pivot rules for the simplex algorithm.
- Computation and Control: a new blog by Jerome Le Ny.
- And, last but not least, exorcising Laplace’s demon.
Value of information, Bayes risks, and rate-distortion theory
In the previous post we have seen that access to additional information is not always helpful in decision-making. On the other hand, extra information can never hurt, assuming one is precise about the quantitative meaning of “extra information.” In this post, I will show how Shannon’s information theory can be used to speak meaningfully about the value of information for decision-making. This particular approach was developed in the 1960s and 1970s by Ruslan Stratonovich (of the Stratonovich integral fame, among other things) and described in his book on information theory, which was published in Russian in 1975. As far as I know, it was never translated into English, which is a shame, since Stratonovich was an extremely original thinker, and the book contains a deep treatment of the three fundamental problems of information theory (lossless source coding, noisy channel coding, lossy source coding) from the viewpoint of statistical physics.
Deadly ninja weapons: Blackwell’s principle of irrelevant information
Having more information when making decisions should always help, it seems. However, there are situations in which this is not the case. Suppose that you observe two pieces of information, and
, which you can use to choose an action
. Suppose also that, upon choosing
, you incur a cost
. For simplicity let us assume that
,
, and
take values in finite sets
,
, and
, respectively. Then it is obvious that, no matter which “strategy” for choosing
you follow, you cannot do better than
. More formally, for any strategy
we have
Thus, the extra information is irrelevant. Why? Because the cost you incur does not depend on
directly, though it may do so through
.
Interestingly, as David Blackwell has shown in 1964 in a three-page paper, this seemingly innocuous argument does not go through when ,
, and
are Borel subsets of Euclidean spaces, the cost function
is bounded and Borel-measurable, and the strategies
are required to be measurable as well. However, if
and
are random variables with a known joint distribution
, then
is indeed irrelevant for the purpose of minimizing expected cost.
Warning: lots of measure-theoretic noodling below the fold; if that is not your cup of tea, you can just assume that all sets are finite and go with the poor man’s version stated in the first paragraph. Then all the results below will hold.
Bell Systems Technical Journal: now online
The Bell Systems Technical Journal is now online. Mmmm, seminal articles … . Shannon, Wyner, Slepian, Witsenhausen — they’re all here!
(h/t Anand Sarwate)
Sincerely, your biggest Fano
It’s time to fire up the Shameless Self-Promotion Engine again, for I am about to announce a preprint and a paper to be published. Both deal with more or less the same problem — i.e., fundamental limits of certain sequential procedures — and both rely on the same set of techniques: metric entropy, Fano’s inequality, and bounds on the mutual information through divergence with auxiliary probability measures.
So, without further ado, I give you: (more…)
1 comment