We will stick to the setting of binary classification: we have a feature space (which is completely arbitrary) and a binary label space . Let be a class of functions , which we will call the hypothesis space. These functions are our models for the relationship between features and labels, which we wish to learn from data. In order to focus on the essential aspects of the problem, we will adopt the Bayesian approach and assume that the “true” model is a random element of with a given distribution . We assume that the learning agent knows , so we can think of this distribution as the learner’s prior. Training data consist of input-output pairs , where . Here, each is an arbitrary point of , and for now we will treat as an individual sequence. The labels , though, are random and given by , where is the model drawn by Nature from .
The learning agent receives the training data sequentially: at each time , it has access to . Let us denote the pair by . The learning task is: given the next input , predict the value . A learning algorithm is a sequence of possibly randomized mappings : at each time , the learning agent uses all available training data to generate a prediction for the label of . After generating the prediction, the learning agent observes the true . We are interested in the probability that makes an error at time :
We will refer to the function as the learning curve of [note that it depends on the sequence of inputs supplied to the learning agent during learning].
How can we measure the learner’s progress? Let’s write out the causal ordering of all the random variables generated in the process of learning:
Here, has distribution , are determined by the learning algorithm , and are determined by and . Let us denote the joint distribution of all of these variables by . When the learning agent observes at time , we can quantify its information gain relative to time by
It’s not hard to see that is nonnegative because (information at time ) includes . The expected information gain is given by
the conditional mutual information between and given . For our purposes, it will be convenient to express the information gain in terms of a certain quantity called the volume ratio, which quantifies the rate at which the learning agent can zero in on the true model as more observations keep arriving.
To that end, let us first observe that, at time , the agent can localize the unknown model to the set
This set is known in learning theory as the version space at time . It is easy to see that the version spaces are nested: . Clearly, for all , and for any
This implies, in particular, that
and therefore that
where . Since , we expect that each training sample shrinks the version space, and that the learning curve of any good learning algorithm should be controlled by the rate at which decreases. For future convenience, let us define the volume ratio at time :
Note, by the way, that since , we have
Consequently,
We are going to use binary algorithms, so the units of all information quantities are bits.
Now let’s get down to business. Following Haussler et al., we will consider the following two learning algorithms:
and then predicts
The Bayes algorithm is optimal in the sense that it minimizes the probability of incorrectly predicting given all available information. However, it is an improper learning algorithm in the sense that the function will not, in general, be an element of .
The Gibbs algorithm is suboptimal, but, unlike the Bayes algorithm, the function does belong to by construction.
Now let’s analyze their learning curves:
we have
Thus, if , then $latex {Q({\mathcal S}^{Y_t}_t) 1/2}&fg=000000$, the Bayes algorithm will be correct: . Finally, if , then the Bayes algorithm will make a mistake with probability . Therefore,
where we have defined the function
Before stating and proving the main result, we need to collect some preliminary facts. First of all, the volume ratio is always bounded between and . Moreover, we have the following useful formula: for any function ,
(the proof is by conditioning on and using the law of iterated expectation).
Theorem 1 We have the following bounds for the Bayes and the Gibbs algorithms:
Remark 1 Equation (2) shows that the error probability of the Bayes algorithm at time can be bounded below by a function of the expected information gain at time . It also shows that the error probability of the Gibbs algorithm is at least as large as that of the Bayes algorithm. On the other hand, Equation (3) gives an upper bound on the error performance of the Bayes algorithm in terms of the expected information gain, and it also shows that the error of the Gibbs algorithm is at most twice as large as that of the Bayes algorithm.
Proof: We start by obtaining the following alternative expressions for the error probabilities of the two algorithms, as well as for the expected information gain:
Using (1) with , we have
This shows, in particular, that the average information gain due to receiving another training sample is no more than one bit.
With , we have
where we have used the fact that, for any ,
Another way to arrive at this expression is by using the following fact: Suppose we have a biased coin with probability of heads equal to . The best prediction of the outcome of a single toss of that coin, with respect to the Hamming loss , is to predict heads if and tails otherwise. The expected loss in that case is equal to . The problem of predicting given and is equivalent to the above guessing problem with .
Finally, with , we have
Again, if we consider the problem of guessing the outcome of tossing a coin, we can think of the following suboptimal scheme: guess with probability and with probability (i.e., just mentally simulate the coin toss). The probability of error is exactly , which is also twice the variance of a random variable. Reasoning conditionally on , we see that the Gibbs algorithm is exactly this scheme with .
We will also use the following chain of easy-to-prove inequalities: for any ,
We start by proving the lower bound, Equation (2). First, note that, for any ,
Moreover, the function is convex. Therefore,
The upper bound, Equation (3), follows by taking in (4) and then taking expectations.
Now let’s see what happens when we have to do this over and over again: at each time , the learning agent gives a prediction , observes , updates , etc. What can we say about the expected fraction of mistakes? Given a learning algorithm , the quantity of interest is
for each . In order to get a handle on this quantity, we will need some Vapnik-Chervonenkis combinatorics. Given the sequence of input instances, let denote the set of all distinct elements of the tuple . Say that two hypotheses are equivalent if for all . This defines an equivalence relation on , so we can split it into equivalence classes . Here, the number of equivalence classes depends on and on . How many such classes can we have? Let us order the elements of arbitrarily, say by , where . Each can be associated with a binary string . Then and are in the same equivalence class if and only if they give rise to the same binary strings:
Consequently,
For any subset , let us define in the obvious way: if , then we let . Let be the largest integer , such that
for all with . A deep result, commonly known as the Sauer-Shelah lemma (see here for some fascinating history), then says that
Now we are ready to state and prove the following:
Proof: We have
From the definition of the version spaces, it is immediate that is the equivalence class containing the “true” model . In other words, there exists some random , such that . Consequently, we can write
We can recognize the quantity in the last line as the Shannon entropy of the index : that is,
Now, since takes values in , we can invoke the Sauer-Shelah estimate (5) to get
The claimed bound (6) follows.
If is a Vapnik-Chervonenkis class with VC dimension , then we immediately obtain:
Corollary 3
]]>
The system Dawson analyzed is an instance of a Probabilistic Cellular Automaton (PCA), i.e., a system consisting of a discrete collection of nodes, each of which has a discrete state. The configuration of the automaton at each time is specified by listing the state of each node. The state of each node at time is determined stochastically by its own state, as well as by the states of its neighbors, at time , independently of everything else. One of the key questions in the theory of PCAs is whether a given automaton is ergodic, i.e., if its configuration eventually “forgets” the initial condition. Thus, Dawson’s result is a sufficient condition for ergodicity of a PCA.
Now let’s move on to precise definitions. Consider an undirected graph . The neighborhood of a vertex , denoted by , is the set of all vertices that are connected to by an edge. We will also denote the set consisting of and all of its neighbors by . Let us fix a finite alphabet ; for each vertex , let denote a copy of . A configuration is an assignment of a symbol to each vertex . Thus, is the space of all configurations. We will denote a generic configuration by ; for any set , we will denote by the natural projection of onto : .
Now let’s consider the following model of a distributed one-bit memory cell. Let be a random variable, which is the bit we wish to store. We also pick two probability measures and on . Given , we draw the initial configuration at random according to . We allow arbitrary correlations between the different ‘s, so our storage is distributed in this sense. Now let’s suppose that the state of each vertex evolves stochastically in discrete time according to a local Markov model: for each , we have a channel (random transformation) with input alphabet and output alphabet . Then the configuration at time evolves according to the law
with the initial conditions , , and , . Thus, we end up with a Markov chain
where at each time the evolution from to is also Markov “in space:” the state of vertex at time is determined only by the states of the neighbors of at time , i.e., for each and , we have a Markov chain
Now the question is: does the configuration at an arbitrary time retain any information about ? One way to answer this is to pick an arbitrary finite set and to examine the evolution of the mutual information , where
If we can show that there exists some such that for all and at least one nonempty , then our memory cell is capable of storing one bit. If, on the other hand, as for every finite , then eventually the memory cell will forget .
1. Dawson’s impossibility result
Roughly speaking, Dawson proved the following: if the local channels are sufficiently noisy and if the graph is sufficiently sparse, then the mutual information will decay to zero exponentially fast. In order to state his result precisely, we need some preliminary definitions. First of all, in order to quantify the noisiness of each channel , Dawson introduces the following quantity, which he calls the transmission coefficient of :
where the supremum is over all Markov chains , where , the channel from to is arbitrary, and the channel from to is given by . Moreover, let . Clearly, for every , by the data processing inequality. Consequently, .
Remark 1 Dawson’s definition of the transmission coefficient is reminiscent of the best constant in the data processing inequality of Erkip and Cover (corrected by Anantharam et al.): to define that constant in our setting, one would fix a distribution and let
where the supremum is over all Markov chains satisfying the constraints and . The difference is that in (2) we constrain the marginal distribution of and the stochastic kernel , whereas in (3) we constrain the marginal of and the kernel . As shown by Anantharam et al.,
where the supremum is over all marginal distributions of distinct from , and denotes the marginal distribution of when . Both and measure the “noisiness” of , but in different ways. In particular, the motivation behind Dawson’s definition is as follows: At any time , the state of a given node at time depends only on and not on anything else. Thus, if boundary condition, i.e., the configuration of all the nodes except for and its neighbors, is “frozen” at some value , then the amount of information about that node will receive at time will be controlled by the quantity
which, by definition, is no more than . We will elaborate on this point later on.
Also, Dawson assumes that has bounded degree, i.e.,
Now we are in a position to state the following:
Theorem 1 (Dawson) For any finite set ,
Thus, if , then as , and the convergence is exponentially fast.
2. The proof
The proof of Theorem 1 consists of several steps. The first step gives an upper bound on the information value of the state at a given node if we already observed the states of some other nodes. This bound is stated in Lemmas 2 and 3 below.
Lemma 2 Fix two arbitrary, possibly empty finite sets . Then, for any ,
Proof: We begin by noting that, since ,
is a Markov chain. Now, for any fixed choices of the configurations and , consider two probability distributions and of :
From (6), if we consider a channel from to with transition probabilities given by , and then let be the output of with input , then
and is a Markov chain satisfying the conditions in the definition (2) of the transmission coefficient . Therefore,
Taking the expectation with respect to and , we obtain (5).
Lemma 3 For any finite set , any , and an arbitrary (possibly empty) finite set ,
Proof: Using the chain rule for mutual information and Lemma 2,
Now let’s examine the mutual information . Let us use the chain rule to write the conditional mutual information in two ways:
Therefore, we can write
Substituting this into our bound on , we obtain the statement of the lemma.
The main message of Lemma 3 can be stated more succinctly if we introduce shorthand notation
for all finite, possibly empty sets and for all , where we also let
for all finite sets . A nice way to think about the above definitions is in terms of prediction and correction steps in Bayesian filtering: Eq. (7) corresponds to prediction since it quantifies the amount of information about that the nodes in will contain at time given what we know about the state of the nodes in some other set at time , while Eq. (8) quantifies the impact of knowing the state of the nodes in at time , and so represents correction. Moreover, we can write the bound of Lemma 3 as
The second step of the proof is to show that the exponential decay of information, according to Eq. (4), is a consequence of (9). To that end, Dawson employs a clever inductive argument. Let denote the collection of all finite subsets of , and for any define a set function by
The quantity defined in Eq. (8) can be computed in terms of : indeed, by the chain rule we have
In addition, if we know , we can also compute :
Moreover, using our knowledge of and the fact that for all , we can do the following: Let us fix an arbitrary node . Then, from (9), we derive
Fixing another node , we have
where the first step uses (9); the second uses our bound on ; in the third, we add and subtract ; and the last step uses the chain rule. Continuing inductively, we see that, for any ,
In particular, using the formula with , we obtain
The final step is to complete the proof using the fact that . Since is binary, . Therefore,
This can be interpreted as follows: if we define the specific information at time by
then . Now, we write
where the first step follows from (10); the second uses (11); the third step is by (over)counting; the fourth step uses the fact that for every ; and the remaining steps are simple combinatorics. What we have shown is that the specific information at can be bounded as . Thus, if , then in one step the specific information will strictly decrease. Iterating, we get the bound
If , then as . This completes the proof of Theorem 1.
]]>
2. For those of you who use Matlab or Octave, there is a new Information Theoretical Estimators (ITE) toolbox, an open-source toolbox “capable of estimating many different variants of entropy, mutual information, divergence, association measures, cross quantities, and kernels on distributions.” Some more details are available in the guest post by the toolbox’s creator, Zoltán Zsabó, at the Princeton Information Theory b-log.
Let denote the size of the largest code over a -ary alphabet that has blocklength and minimum distance . The well-known Gilbert-Varshamov bound says that
where
is the volume of a Hamming ball of radius in . The usual way of arriving at the GV bound is through a greedy construction: pick an arbitrary codeword , then keep adding codewords that are at Hamming distance of at least from all codewords that have already been picked. When this procedure terminates, the complement of the union of the Hamming balls of radius around each of the codewords should be empty — otherwise, you will have at least one more codeword at distance of at least from the ones already picked, and this would mean that the procedure could not have terminated.
As it turns out, there is another way of deriving the GV bound using graph theory that I have learned from a nice paper by Van Vu and Lei Wu. They use this graph-theoretic interpretation to arrive at an asymptotic improvement of the GV bound. Their result, which I will not go into here, extends an earlier result by Tao Jiang and Alex Vardy for binary codes. As far as I can tell, the graph-theoretic ideas go back to the Jiang-Vardy paper as well.
In order to proceed, we need some definitions and a lemma. Let be an undirected graph. A set of vertices is called independent if no two vertices in are connected by an edge. The independence number of , denoted by , is the cardinality of the largest independent set. The following lemma is folklore in graph theory:
Lemma 1 Suppose that is -regular, i.e., every vertex has exactly neighbors. Then
Proof: Let be a maximal independent set. Any vertex is connected by an edge to at least one , because otherwise would have to be included in , which would contradict maximality. Therefore, there are at least edges with one vertex in and another vertex in . On the other hand, because is -regular, there can be at most such edges. This means that
Rearranging, we get (1).
Now let us construct the following graph (what Jiang and Vardy call the Gilbert graph): associate a vertex to each word in , and connect two vertices by an edge if and only if the Hamming distance between the corresponding words is at most . This graph has vertices, and each vertex has degree . Moreover, there is a one-to-one correspondence between independent sets of vertices and -ary codes of length and minimum distance at least , and the independence number of the Gilbert graph is equal to . The bound (1) is then precisely the GV bound.
Briefly, the Vu-Wu improvement of the GV bound exploits the deep fact that, when the neighborhood of any vertex in a -regular graph is very sparse (in the sense that it has a lot fewer than edges), the lower bound (1) can be significantly tightened. Apparently, actually counting the number of edges in such a neighborhood of any vertex of the Gilbert graph (by regularity, we may as well look at the neighborhood of the all-zero word) is rather complicated; Vu and Wu instead look at a suitable asymptotic regime when is large and for some and replace exact combinatorial bounds by entropy bounds.
]]>
There is a whole book of readymade telegrams, long and convincing, lavishly composed telegrams for all occasions. Sending such a telegram costs only twenty-five cents. You see, what gets transmitted over the telegraph is not the text of the telegram, but simply the number under which it is listed in the book, and the signature of the sender. This is quite a funny thing, reminiscent of Drugstore Breakfast #2. Everything is served up in a ready form, and the customer is totally freed from the unpleasant necessity to think, and to spend money on top of it.
Typical set encoding, anyone?
When we speak of concentration of measure, we mean the following. Let be a random variable taking values in some metric space . Then we say that (or, more correctly, the distribution of ) has the concentration property if, for any set such that , we have
Here, is the distance from the point to the set :
Another way to express (1) is as follows: for any set and any , define the -blowup of by
Then has the concentration property if
In other words, has the concentration property if any set containing with not too small a probability can be “blown up” to contain with near-certainty.
Here are two classic examples of concentration:
that counts the fraction of bits in which and disagree. Let have the uniform distribution on , i.e., for all . Then
These two examples suggest that we should aim for “hard” statements in the form of sharp bounds on the concentration function
as opposed to merely “soft” statements of the form as . The $64,000 question is: how do we get such bounds?
There are two ways to accomplish this goal, and the main idea underlying these two ways is to replace sets with some other objects that are hopefully easier to handle. The first way is to replace sets by probability measures, the second is to replace them by functions. Here is what I mean:
Fix a set with . Let denote the distribution of , and let denote the conditional distribution of given . That is, for any (measurable) set we have
I am using the subscript notation instead of the more usual to indicate the fact that is a probability measure in its own right. In this way, we can associate to each non-null set a probability measure . Now, here is a very simple observation that turns out to be very consequential:
This is very easy to prove: for any set we have
so is absolutely continuous with respect to with the Radon-Nikodym derivative . Therefore, by definition of the divergence,
So if we are interested in bounding the probabilities of various sets , we may hope to get some mileage out of the relationship (2).
On the other hand, we may also associate to a set with the function
This function is Lipschitz: for any ,
where the last step is by the triangle inequality. Interchanging the roles of and , we get the Lipschitz property. Moreover, let us consider the random variable , where is our original -valued random variable. Then we immediately notice two things:
(the second inequality is obviously true since is nonnegative with probability ).
These two observations suggest that we may obtain concentration bounds by bounding the probability that a given Lipschitz function of deviates from its median by more than . In fact, it is easy to derive an alternative expression for the concentration function :
where denotes any median of . We already showed, by passing from to , that is bounded from above by the quantity on the right-hand side of (3):
To prove the reverse inequality, fix any -Lipschitz function and consider the set , where is any median of . Then, by definition,
Moreover, if we consider the -blowup
then for any and any we must have
where the last step is by the Lipschitz property of . Consequently, by definition of the concentration function,
By passing to the functional viewpoint, we obtain another equivalent characterization of the concentration property: a random variable taking values in a metric space has the concentration property if real-valued Lipschitz functions are “nearly constant.”
Let’s look at the first, probabilistic viewpoint, which was born out of a 1996 breakthrough paper by Marton. Given a metric space , let us define the Wasserstein distance (or transportation distance) between any two probability measures and on it:
where the infimum is over all jointly distributed random variables , such that and . Now consider a random variable , for which we wish to establish concentration. What Marton showed is the following: Suppose the distribution of satisfies the transportation inequality
for some constant . Then has the concentration property, and moreover
Marton’s proof is breathtakingly beautiful. Consider any two sets with . Recalling our notation for conditional distributions, we can write
where in the first step we have used the triangle inequality, in the second we have used the fact that satisfies the transportation inequality (4), and in the last step we have used the formula (2). Now suppose that and let for some , where denotes set-theoretic complement. Then we can show that . On the other hand,
Combining these facts gives us the bound
that holds for all . If , then we get
so we indeed have concentration and a sharp bound on , at least for large enough values of . The main message here is that, in order to study concentration, it suffices to work on the level of probability measures and to focus one’s effort on showing that the distribution of satisfies a suitable transportation inequality. Since Marton’s original work, there have been many refinements and extensions, which I will not go into here. One such result, due to Sergey Bobkov and Friedrich Götze, says that satisfying a transportation inequality (4) is equivalent to the Gaussian concentration property
Now let’s look at the complementary functional viewpoint. Recall that we seek tight upper bounds on deviation probabilities of the form
It is easier to work with means instead of medians, and indeed it can be shown that concentration around the mean is equivalent to concentration around any median. So let’s focus on the mean. Let , as before, be a random variable over some metric space , and consider a Lipschitz function such that . We can apply the well-known Chernoff trick: for any we have
Now the whole affair hinges on the availability of tight upper bounds on the logarithmic moment-generating function . The entropy method, which was the subject of Gabor Lugosi’s plenary, is the name for a set of techniques for bounding by means of various inequalities involving the relative entropy between various tilted distributions derived from and itself. The entropy method has roots in the work of Michel Ledoux, who in turn distilled it from some very deep results of Michel Talagrand.
The simplest version of the entropy method goes something like this. Let us define, for any , the tilted distribution via
(assuming, of course, that exists and is finite). Then we have
Integrating and using the fact that , we get
Now suppose that we can bound for some . Then from (5) we have
which in turn gives the Gaussian tail bound
This is the so-called Herbst argument. Of course, I have glossed over the most nontrivial part — namely, showing that we can bound the relative entropy by a quadratic function of . Gabor’s talk contained numerous examples of how one could get such bounds in the special case when is an -tuple of independent random variables taking values in some base space , , and the function is not “too sensitive” to local modifications of its arguments. This takes us back to probability on metric spaces.
Here is one classic example. Suppose that our function has the bounded difference property, i.e., there exist some constants , such that changing the th argument of while keeping others constant will change the value of by at most :
We can express this more succinctly as a Lipschitz property of if we define the weighted Hamming metric
(we can assume without loss of generality that all the ‘s are strictly positive, because we can simply ignore those coordinates of that do not affect the value of ). Then the bounded difference property is equivalent to saying that is -Lipschitz with respect to this weighted Hamming metric. Moreover, it is possible to show that any product probability measure on the product space satisfies the transportation inequality
where the Wasserstein distance is computed with respect to the weighted Hamming metric (6), and . By the Bobkov–Götze result quoted above, this is equivalent to the concentration bound
This is the well-known McDiarmid’s inequality, which was originally derived using martingale techniques, but here we have arrived at it through a completely different route that took us back to where we started: concentration of Lipschitz functions around their means and/or medians, which (as we saw) is the same thing as the original, “stochastic-geometric” view of the concentration of measure phenomenon.
]]>
Roughly speaking, stochastic kernels are building blocks, objects that have to be interconnected in order to instantiate stochastic systems. Conditional probability distributions, on the other hand, arise only when we apply Bayes’ theorem to joint probability distributions induced by these interconnections.
At a very high level of abstraction, we may imagine a space of observations or outcomes and a space of states or inputs . Each possible state induces a probability distribution over — let’s denote it by . The interpretation is that if the state is , then the probability that we observe an outcome in some set is . Notice that this stipulation has the flavor of a conditional statement: if A then B. Mathematical statisticians (going back to Abraham Wald, and greatly elaborated by Lucien Le Cam and his followers) like to think of the collection as an experiment that reveals something about the state in through a random observation in . Note that is not a set of probability distributions — two distinct ‘s may carry two identical ‘s (which would indicate that these two ‘s are statistically indistinguishable on the basis of observations); or, in the simplest case of a binary state space , the experiment that has and is different from the one with and , where and are two fixed probability distributions on . So perhaps it is better to think about the experiment as a function from into , the space of all probability distributions on . When we impose a measurable structure on the state space as well and then require this function to be sufficiently well-behaved, so that the mapping is nice (read: measurable) for any , then we have a stochastic kernel.
Larry’s point about p-values is as follows: a binary hypothesis testing problem is a binary experiment , where and can be thought of as two fixed distributions on some observation space that “explain” the observed outcomes given each hypothesis. If we compute a test statistic and let be the realized value of , then the p-value (for a two-sided test) is
It’s not a conditional probability of anything, but rather the probability a certain event would have if the state were . In order to have conditioning, all relevant quantities must be instantiated as random variables. Let’s consider an example any information theorist should relate to: a binary symmetric channel.
The ingredients are: the input space , the output space , and the experiment
where is the channel’s crossover probability. We often write the channel transition probabilities suggestively as etc., but that is, strictly speaking, incorrect. Until we specify something about the input to the channel, all we have is the experiment , together with a list of possible statements like
if the input is , then the probability of observing at the output is
and the like. If we now say that the input is a random variable taking values in according to a given distribution , then we may make conditional probability statements and compute conditional probability distributions and . Of course, in this case it so happens that the conditional probability distribution is already given in terms of the original experiment. But, properly speaking, this conditional object does not exist until we fix . Even more dramatically, the posterior does not exist at all until we specify , interconnect the source of to the kernel corresponding to the experiment , and start doing what Bayesians would call inverse inference. This distinction between kernel specifications and conditional probability distributions may seem purely notational, but it matters a great deal as soon as feedback enters into the picture.
The clearest statement on the implications of this distinction has been made by Hans Witsenhausen in his influential paper on separation between estimation and control:
When the control laws have been selected and instrumented, and only then, the control variables (and the state and output variables, and the cost) become random variables, that is, become functionally related to the given random variables (noise, initial state) and therefore become functionally related to the underlying probability space. But to the designer who is still seeking for good control laws and has not made a selection yet, the realizations of control are not even random variables. They are just “random variables to be” of yet uncertain status.
I also recommend a recent paper by Jan Willems on what he terms “open stochastic systems.” It is best to illustrate the main idea of this paper through another example any information theorist should be familiar with: additive white Gaussian noise (AWGN) channel. We are all used to writing it down as
where is the input, is the additive noise independent of , and is the output. Of course, this expression tacitly assumes that we have already specified a distribution of the input . One may fix things by writing
with the understanding that is some input. But even this is not quite right in view of the above quote from Witsenhausen: is just a placeholder, something waiting to be assigned. Properly speaking, the only “legitimate” random variable here is the Gaussian noise . So Willems suggests thinking instead of the set of all pairs such that
If is now realized as a random input from some distribution , we get back our usual model from (1). However, this new viewpoint is a lot more interesting because it has room for things like causality. Consider, for example, a more complicated arrangement, in which the input taking values in some space is first “modulated” by some nonlinear transformation , and then the noise is added to the output of this nonlinear transformation. Then we can represent the overall open system as the set of all pairs , such that
Thus, to each specific value of we can associate a random variable
But if we specify , then things get a lot more interesting: if is not invertible, then the most we can say about is that
or that is somewhere in the preimage, under , of a Gaussian random variable with mean and variance . Unless we are in an exceptional situation (e.g., if is one-to-one), it is reasonable to say that (3) expresses a “more complex” statement compared to (2). From this viewpoint, it is more reasonable to say that is the input that causes , and not the other way around. (This way of thinking about causality has, in fact, already found its way into machine learning literature.) The paper of Willems contains a lot more insightful examples and thought-provoking discussion.
So, to summarize: the misunderstanding about conditioning that Larry Wasserman has sought to clear up persists not only in statistics, but also in all other fields involving stochastic systems, such as communications, control, machine learning, etc., and we should always be careful not to fall into this trap.
]]>
Consider three jointly distributed random variables . If the conditional mutual information , then and are conditionally independent given (i.e., is a Markov chain). In fact, if and only if . Now, what if is nonzero, but very small? Intuitively, one would expect that the joint distribution is somehow “close” to being Markov. It turns out that this intuition is, indeed, correct — in fact, the conditional mutual information is the divergence between and the nearest distribution under which is a Markov chain.
Theorem 1 Let be three jointly distributed random variables. Then
Moreover, the minimum is achieved by
Proof: To keep things simple, I will assume that all random variables take values in finite sets. Consider any distribution for which . Then
The first two terms on the right-hand side are equal, respectively, to and . For the third term, we can write
Putting everything together, we have
This concludes the proof.
Here is one interesting consequence. Consider three random variables , and , where . Let
be the minimum mean-square error (MMSE) achievable by any (measurable) estimator of based on , and do the same for . Then , i.e., using more information for estimation never hurts. However, how much do we gain by using both and ? The following inequality, due to Yihong Wu and Sergio Verdú, gives an upper bound on the excess MMSE in terms of conditional mutual information:
where the mutual information is measured in nats. This inequality implies (a generalization of) Tao’s inequality,
— indeed, the square root of the left-hand side of (2) is greater than or equal to the left-hand side of (3):
where the third line is due to the orthogonality principle, whereby
and the last line uses Jensen’s inequality. Combining (2) and (3) with Theorem 1, we see that the divergence between and any other distribution under which is a Markov chain can be lower-bounded as
Therefore, if the addition of improves the MMSE significantly, then the joint distribution is far from the set of all distributions under which is a Markov chain.
]]>
Welcome to the Princeton-Stanford Information Theory b-log! All researchers working on information theory are invited to participate by posting items to the blog. Both original material and pointers to the web are welcome.
Enjoy!