The Information Structuralist

Probability-schmobability, just gimme a good model

In the past few days I have come across two short but thought-provoking papers that seem to voice some of the same concerns, yet arrive at very different conclusions.

One is by Jan Willems (among other things, the founder of the behavioral approach to systems and control):

JW, Probability in control? (to appear in European Journal of Control)

Probability is one of the success stories of applied mathematics. It is universally used, from statistical physics to quantum mechanics, from econometrics to financial math- ematics, from information theory to control, from psychology and social sciences to medicine. Unfortunately, in many applications of probability, very little attention is paid to the modeling aspect. That is, the interpretation of the probability used in the model is seldom discussed, and it is rarely explained how one comes to the numeri- cal values of the distributions of the random variables used in the model. The aim of this communication is to put forward some remarks related to the use of probability in Systems and Control.

The other is by Andrew Gelman and Christian Robert (two high priests of the Temple of Bayes):

William Feller has a Note on Bayes’ rule in his classic probability book in which he expresses doubts about the Bayesian approach to statistics and decries it as a method of the past. We analyze in this note the motivations for Feller’s attitude, without aiming at a complete historical coverage of the reasons for this dismissal.

What are we to make of this?

Willems begins by acknowledging the unreasonable effectiveness of probabilistic models in a variety of fields, then singles out the frequentist and the subjectivist extremes of the probabilistic spectrum (I won’t bother pointing out such things as objective Bayes or prequential probability), and finally proceeds to state that neither is particularly defensible in the field of systems and control. The reason is, he says, that the frequentist tacitly assumes statistical regularity, while the subjectivist faces an arduous task of justifying his beliefs. Instead Willems suggests jettisoning the probabilistic assumptions (what he calls the descriptive component of modeling) and instead designing systems based on the most plausible input to a physical model (what he calls the prescriptive component of design). The example he gives is that of Kalman filtering: consider the linear model $\displaystyle \dot{x}(t) = A x(t) + B w(t), y(t) = C x(t) + D w(t), z(t) = H x(t) \ \ \ \ \ (1)$

with the initial condition ${x(0)}$, where ${w(t)}$ is some deterministic disturbance input, ${x(t)}$ is the state, ${y(t)}$ is the noise-corrupted observation, and ${z(t)}$ is the signal of interest. Then, as he shows elsewhere, the celebrated Kalman–Bucy filter can be derived by first finding the disturbance input ${w(t)}$ and the initial condition ${x(0)}$ that minimize $\displaystyle J(w,x(0),t) = \int^t_0 \| w(\tau) \|^2 d\tau + x(0)^T Q x(0)$

(where ${Q}$ is a fixed positive definite matrix) and then propagating them through the system model (1). No probabilistic assumptions on the vector process ${(z(t),y(t))}$ are needed; instead we look for the “most believable” explanation of the observations (although then the obvious question to ask is: why should a less noisy disturbance input be more believable than a more noisy one?). At any rate, according to Willems, a deterministic state-space model like (1) is quite palatable (say, on physical grounds), while saying that the initial state ${x(0)}$ and the disturbance process ${w(t)}$ are somehow random (never mind Gaussian) is, more often than not, a leap of faith. On the other hand, the prescription to minimize ${J(w,x(0),t)}$ can be defended on the grounds of parsimony.

The conclusion that Willems draws is that we should go back to the basics and get the physics right, and then maybe, just maybe, we won’t even need that pesky ${(\Omega,{\mathcal F},P)}$. But what exactly would “getting the physics right” entail?

This brings me to the Gelman and Robert paper. Its main goal is to revisit, given the modern statistical theory and practice, William Feller’s oft-quoted snarky remarks about “Bayesians” (as he and his fellow probabilists understood that term back in the day) and the (to him, at least) obvious superiority of the Neyman–Pearson approach with its Type 1 and Type 2 errors. I will not dwell on one eminently valid point — that the bad rap Bayesian statistics gets is in no small part due to the “intemperate rhetoric” of the more ardent Bayesianistas about such things as optimality, rationality, or logical coherence. What is more interesting is the distinction Gelman and Robert make between Bayesian inference (and that’s where such silly things as degrees of belief and betting come in) and Bayesian data analysis, which involves making strong assumptions (in the form of priors and likelihoods) to get strong predictions, but then — and that’s the crucial point! — checking the model against new and observed data and revising it. (You mean you haven’t read AG’s and Cosma Shalizi‘s recent paper on the philosophy and practice of Bayesian statistics? Drop everything at once and start reading!) Trial and error, baby! That’s what the scientific method is really about, after all.

So, both of these papers make the same point (that we should pay more attention to models and learn to build better and better ones), yet seem to arrive at diametrically opposed solutions: chuck probability out the window (quoth Willems) or embrace probability all the way (say Gelman and Robert). Either way, you’d better have damn good models, because they will be checked against reality. But perhaps we should ask ourselves: are these two views really so different? The answer is, I don’t know, but I have a good reason to suspect it to be negative.

For one, there is an excellent book by Glenn Shafer and Vladimir Vovk called Probability And Finance: It’s Only A Game, where, among other things, it is shown how one can obtain such results as the Strong Law of Large Numbers, the Central Limit Theorem, and even the Law of the Iterated Logarithm without ever mentioning the word “probability”. In fact, there is not even a single mention of anything like stable relative frequencies, and yet it is possible to define such things as martingales purely in game-theoretic terms.

Another route by which probability can sneak in is through the minimax theorem, as shown in a nice paper by Jake Abernethy, Alekh Agarwal, Peter Bartlett, and Sasha Rakhlin, where they quote Vovk: “for some important problems, the adversarial bounds of on-line competitive learning theory are only a tiny amount worse than the average-case bounds for some stochastic strategies of Nature.”

Finally, I can mention the work by Yuval Lomnitz and Meir Feder on communication over individual channels, which dispenses with the usual stochastic assumptions on channel noise that are bread and butter of information theory and instead looks for real-time adaptive strategies that learn to exploit regularities in the actual channel behavior as it unfolds in time.

In all of these, one cannot help but notice the echoes of Willems’ prescription and search for the “most plausible” noise (or disturbance, or error) process that explains the observed behavior of the system of interest! But where does Bayesian model checking come in, you say? Well, you write down these priors and likelihoods, see, and then you use them to make predictions, and then you see how well these predictions have panned out. And if you’re wrong (and you will be), then you know what to do: lather, rinse, repeat. Or, in the words of Piet Hein,

The road to wisdom? — Well, it’s plain
and simple to express:
Err
and err
and err again
but less
and less
and less.

I guess, I will have to leave such sticky situations as the loss of identifiability in controlled systems for another post.