# The Information Structuralist

## A unified converse for noiseless and noisy channel codes

Posted in Information Theory by mraginsky on June 7, 2011

While reading the book Search Problems by Rudolf Ahlswede and Ingo Wegener (sadly, both deceased), I have come across a neat result by Ahlswede and Péter Gács, which can be thought of as a unified information-theoretic converse for noiseless and noisy channel codes. In particular, it contains as special cases the well-known Kraft’s inequality for prefix codes and the strong converse for discrete memoryless channels (DMC’s). In fact, Ahlswede and Gács establish achievability as well using a Feinstein-type maximal code construction, but I will focus on the converse.

Now, as shown recently by Yury Polyanskiy in his doctoral dissertation (see also the paper he wrote with Vince Poor and Sergio Verdú), most existing converse results for channel codes are consequences of a deceptively simple “meta-converse” involving the performance of binary hypothesis tests in the Neyman–Pearson framework. In this post, I will show that this is also the case for the Ahlswede–Gács converse.

Definitions

Let ${{\mathsf X}}$ and ${{\mathsf Y}}$ be two finite alphabets.

Definition 1 Let ${\ell^M = (\ell_1,\ldots,\ell_M)}$ be an ${M}$-tuple of positive integers. An ${\ell^M}$-prefix code is a collection ${{\mathcal C} = \{ (u_j, D_j)\}^M_{j=1}}$, where:

1. For each ${j}$, ${u_j \in {\mathsf X}^{\ell_j}}$ and ${D_j \subset {\mathsf Y}^{\ell_j}}$.
2. No element of the set ${D = \bigcup^M_{j=1}D_j}$ is a prefix of any other element.

Two special cases of importance:

1. If ${{\mathsf X} = {\mathsf Y}}$ and, for every ${j}$, ${D_j = \{u_j\}}$, then we get the standard definition of a prefix code.
2. If ${\ell_1 = \ldots = \ell_M = n}$, then the sets ${\{D_j\}}$ must be disjoint (by the prefix property), and we get a length-${n}$ block code with ${M}$ codewords ${u_1,\ldots,u_M \in {\mathsf X}^n}$ and the corresponding decoding sets ${D_1,\ldots,D_M \subset {\mathsf Y}^n}$. In case ${{\mathsf Y}^n \backslash D}$ is nonempty, we can think of it as a rejection region.

Now consider a DMC with transition matrix ${W = \{W(y|x) : (x,y) \in {\mathsf X} \times {\mathsf Y}\}}$. We will abuse notation and write

$\displaystyle W^n(y^n|x^n) = \prod^n_{i=1} W(y_i|x_i)$

for each ${n \in {\mathbb N}}$ and all ${n}$-tuples ${x^n \in {\mathsf X}^n}$, ${y^n \in {\mathsf Y}^n}$.

Definition 2 Given ${\varepsilon \in (0,1)}$, we say that ${{\mathcal C}}$ is an ${(\ell^M,\varepsilon)}$-prefix code for ${W}$ if it is an ${\ell^M}$-prefix code and

$\displaystyle W(D_j|u_j) \ge 1 -\varepsilon, \qquad \forall j = 1,\ldots,M.$

For noiseless codes, we can just take ${\varepsilon = 0}$ and consider the one-to-one mapping ${j \mapsto u_j}$, ${j = 1,\ldots,M}$, where ${\ell_j}$ is the length of the ${j}$th codeword; for channel block codes with ${\ell_1 = \ldots = \ell_M = n}$, we recover the usual maximal probability of error criterion. It is also possible to adapt this framework to channel coding with noiseless feedback.

The Ahlswede–Gács converse and some consequences

Now, the theorem:

Theorem 3 If ${{\mathcal C}}$ is an ${(\ell^M,\varepsilon)}$-prefix code for the DMC ${W}$, then for any ${\gamma \in (0,1)}$ we have

$\displaystyle (1-\varepsilon)(1-\gamma) \cdot \sum^M_{j=1} \exp\left( - C\ell_j - \sqrt{\frac{V\ell_j}{(1-\varepsilon)\gamma}}\right) \le 1, \ \ \ \ \ (1)$

where ${C}$ is the capacity of ${W}$ and ${V = V(W)}$ is nonnegative and equal to zero when ${W}$ is noiseless.

As claimed, Theorem 3 is a unified converse for noiseless and noisy channel codes. Indeed, if ${{\mathsf X} ={\mathsf Y} = \{0,1\}}$, ${W}$ is noiseless, and each ${D_j = \{u_j\}}$, then ${\varepsilon = V = 0}$, ${C = \log 2}$, and

$\displaystyle (1-\gamma)\sum^M_{j=1} 2^{-\ell_j} \le 1, \qquad \forall \gamma \in (0,1).$

In the limit ${\gamma \rightarrow 0}$, we obtain the usual Kraft’s inequality. On the other hand, if ${\ell_1 = \ldots = \ell_M = n}$, then Eq. (1) says that any block code of length ${n}$ with ${M}$ codewords and maximal probability of error ${\varepsilon}$ must satisfy

$\displaystyle M \le \frac{\exp\left( Cn + \sqrt{\frac{Vn}{(1-\varepsilon)\gamma}}\right)}{(1-\varepsilon)(1-\gamma)}.$

Taking logs of both sides and dividing by the block length, we get

$\displaystyle \frac{\log M}{n} \le C + \sqrt{\frac{V}{(1-\varepsilon)\gamma n}} - \frac{1}{n}\log \left[(1-\varepsilon)(1-\gamma) \right], \qquad \forall \gamma \in (0,1)$

which is, essentially, the strong converse of Wolfowitz.

The proof using Neyman–Pearson functions

Consider two probability distributions ${P}$ and ${Q}$ on a finite alphabet ${{\mathsf A}}$. A (randomized) hypothesis test is a conditional probability distribution ${P_{Z|A}}$, where ${Z \in \{0,1\}}$, and we say that the test decides in favor of ${Q}$ if ${Z = 0}$. For each ${\alpha \in (0,1)}$, define the Neyman–Pearson function

$\displaystyle \beta_\alpha(P,Q) = \min \left\{ \sum_{a \in {\mathsf A}}Q(a)P_{Z|A}(1|a) : \sum_{a \in {\mathsf A}}P(a)P_{Z|A}(1|a) \ge \alpha\right\}, \ \ \ \ \ (2)$

i.e., the minimum probability of erroneously deciding in favor of ${P}$ over all tests that correctly choose ${P}$ with probability at least ${\alpha}$. An important fact that we will need later is the following: for any ${\lambda > 0}$,

$\displaystyle \beta_\alpha(P,Q) \ge \frac{1}{\lambda}\left[ \alpha - P \left(\left\{ a \in {\mathsf A}: \frac{P(a)}{Q(a)} \ge \lambda \right\}\right)\right]. \ \ \ \ \ (3)$

The next lemma is very much in the spirit of Yury’s “meta-converse:”

Lemma 4 Let ${W}$ and ${W'}$ be two transition matrices with input alphabet ${{\mathsf X}}$ and output alphabet ${{\mathsf Y}}$. Fix an ${\varepsilon \in (0,1)}$. Then for any ${n \in {\mathbb N}}$, ${x^n \in {\mathsf X}^n}$, and ${B \subseteq {\mathsf Y}^n}$ such that ${W(B|x^n) \ge 1-\varepsilon}$,

$\displaystyle W'(B|x^n) \ge \beta_{1-\varepsilon}\left( W(\cdot|x^n), W'(\cdot|x^n) \right). \ \ \ \ \ (4)$

Proof: Consider a deterministic test between ${P = W(\cdot|x^n)}$ and ${Q = W'(\cdot|x^n)}$ that decides in favor of ${P}$ only when ${Y^n \in B}$. Then (4) is immediate from the definition (2). $\Box$

Now consider ${W}$ and, for any probability distribution ${P}$ over ${{\mathsf X}}$, define the mutual information

$\displaystyle I(P,W) = \sum_{x \in {\mathsf X}}\sum_{y \in {\mathsf Y}} P(x)W(y|x)\log \frac{W(y|x)}{\sum_{x' \in {\mathsf X}}P(x')W(y|x')}.$

Then the capacity is given by ${C = \max_{P}I(P,W)}$. Let ${\bar{P}}$ be the (unique) capacity-achieving input distribution and let ${\bar{Q}}$ be the corresponding output distribution, i.e.,

$\displaystyle \bar{Q}(y) = \sum_{x \in {\mathsf X}}W(y|x)P^*(x).$

We will need the following facts:

1. For any ${x \in {\mathsf X}}$,

$\displaystyle D(W(\cdot|x)\|\bar{Q}) \equiv \sum_{y \in {\mathsf Y}}W(y|x)\log \frac{W(y|x)}{\bar{Q}(y)}\le C. \ \ \ \ \ (5)$

2. For any ${n \in {\mathbb N}}$ and any ${x^n \in {\mathsf X}^n}$,

$\displaystyle \mathop{\mathbb E}_{W(\cdot|x^n)} \left[ \log \frac{W(Y^n|x^n)}{\bar{Q}^n(Y^n)}\right] \le Cn \ \ \ \ \ (6)$

and

$\displaystyle \text{Var}_{W(\cdot|x^n)} \left[ \log \frac{W(Y^n|x^n)}{\bar{Q}^n(Y^n)}\right] \le Vn, \ \ \ \ \ (7)$

where

$\displaystyle V = \max_{x \in {\mathsf X}} \text{Var}_{W(\cdot|x)}\left[ \log \frac{W(Y|x)}{\bar{Q}^n(Y^n)}\right].$

Note that ${V = 0}$ if ${W}$ is noiseless.

Here, (5) follows from convexity properties of the mutual information, while (6) and (7) follow from the fact that, for a fixed ${x^n \in {\mathsf X}^n}$, ${\log \frac{W(Y^n|x^n)}{\bar{Q}^n(Y^n)}}$ is a sum of ${n}$ independent random variables ${\log \frac{W(Y_i|x_i)}{\bar{Q}(Y_i)}}$, ${1 \le i \le n}$, whose expectations (resp., variances) are bounded by ${C}$ (resp., ${V}$).

Next, consider the DMC with transition probabilities ${W'(y|x) = \bar{Q}(y)}$ for all ${(x,y) \in {\mathsf X} \times {\mathsf Y}}$. In other words, ${W'}$ ignores the input and just generates a random output letter drawn from ${\bar{Q}}$. If ${{\mathcal C} = \{(u_j,D_j)\}^M_{j=1}}$ is an ${(\ell^M,\varepsilon)}$-prefix code for ${W}$, then we have

$\displaystyle W(D_j|u_j) \ge 1 -\varepsilon, \qquad \forall j = 1,\ldots,M.$

Hence, we can apply Lemma 4 and then Eq. (3) separately for each ${j}$ to obtain

$\displaystyle \begin{array}{rcl} \bar{Q}^{\ell_j}(D_j) &\ge& \beta_{1-\varepsilon}\left( W(\cdot|u_j), \bar{Q}^{\ell_j}\right) \\ &\ge& \frac{1}{\lambda_j}\left[ 1-\varepsilon - W\left( \left\{ y^{\ell_j} \in {\mathsf Y}^{\ell_j} : \frac{W(y^{\ell_j}|u_j)}{\bar{Q}^{\ell_j}(y^{\ell_j})} \ge \lambda_j \right\}\Big| u_j\right)\right] \end{array}$

for an arbitrary choice of ${\lambda_j > 0}$. In particular, if we fix some ${\gamma \in (0,1)}$ and choose

$\displaystyle \lambda_j = \exp\left(C \ell_j + \sqrt{\frac{V\ell_j}{(1-\varepsilon)\gamma}}\right), \qquad j = 1,\ldots,M$

then from Chebyshev’s inequality together with Eqs. (6) and (7) we have

$\displaystyle W\left( \left\{ y^{\ell_j} \in {\mathsf Y}^{\ell_j} : \frac{W(y^{\ell_j}|u_j)}{\bar{Q}^{\ell_j}(y^{\ell_j})} \ge \lambda_j \right\}\Big| u_j\right) \le (1-\varepsilon)\gamma.$

Using this in our bound and summing over all ${j}$, we get

$\displaystyle \sum^M_{j=1}\bar{Q}^{\ell_j}(D_j) \ge (1-\varepsilon)(1-\gamma)\sum^M_{j=1}\exp \left( - C\ell_j - \sqrt{\frac{V\ell_j}{(1-\varepsilon)\gamma}}\right).$

It remains to show that the sum on the left-hand side is bounded from above by ${1}$. This is where the prefix property comes in. Let ${\ell^* = \max_{1 \le j \le M}\ell_j}$ and define the sets

$\displaystyle D^*_j = D_j \times {\mathsf Y}^{\ell^* - \ell_j}, \qquad j = 1,\ldots,M.$

Owing to the prefix property, these sets must be disjoint. Indeed, if there exist some ${j}$ and ${j'}$ such that ${D^*_j \cap D^*_{j'}}$ is nonempty, then (assuming, say, ${\ell_j \ge \ell_{j'}}$) there must be an ${\ell_j}$-tuple in ${D_j}$ which is a prefix of some ${\ell_{j'}}$-tuple in ${D_{j'}}$, a contradiction. Moreover, for every ${j}$

$\displaystyle \bar{Q}^{\ell^*}(D^*_j) = \bar{Q}^{\ell_j}(D_j),$

so that

$\displaystyle \sum^M_{j=1} \bar{Q}^{\ell_j}(D_{j}) = \sum^M_{j=1}\bar{Q}^{\ell^*}(D^*_j) \le 1.$

Thus, we obtain (1).

Remark 1 The change-of-measure trick with the optimal output distribution ${\bar{Q}}$ is also used by Ahlswede and Gács, who in turn give credit to Kemperman. As recently discussed by Anand, this same trick was also used by Ahlswede in his proof of the strong converse for the MAC. In fact, now that I think of it, the Packing Lemma that appears in that work can also be proved using the Neyman–Pearson machinery.