ArticlePDF Available
Lecture Notes in Information Theory
Part I
by
Fady Alajajiand Po-Ning Chen
Department of Mathematics & Statistics,
Queen’s University, Kingston, ON K7L 3N6, Canada
Email: fady@mast.queensu.ca
Department of Electrical Engineering
Institute of Communication Engineering
National Chiao Tung University
1001, Ta Hsueh Road
Hsin Chu, Taiwan 30056
Republic of China
Email: poning@faculty.nctu.edu.tw
September 23, 2010
c
Copyright by
Fady Alajajiand Po-Ning Chen
September 23, 2010
Preface
This is a work in progress. Comments are welcome; please send them to fady@mast.queensu.ca.
ii
Acknowledgements
Many thanks are due to our families for their endless support.
iii
Table of Contents
Chapter Page
List of Tables vi
List of Figures vii
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Communication system model . . . . . . . . . . . . . . . . . . . . 2
2 Information Measures for Discrete Systems 5
2.1 Entropy, joint entropy and conditional entropy . . . . . . . . . . . 5
2.1.1 Self-information . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Properties of entropy . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Joint entropy and conditional entropy . . . . . . . . . . . . 12
2.1.5 Properties of joint entropy and conditional entropy . . . . 14
2.2 Mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Properties of mutual information . . . . . . . . . . . . . . 17
2.2.2 Conditional mutual information . . . . . . . . . . . . . . . 18
2.3 Properties of entropy and mutual information for multiple random
variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Data processing inequality . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Fano’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Divergence and variational distance . . . . . . . . . . . . . . . . . 26
2.7 Convexity/concavity of information measures . . . . . . . . . . . . 36
2.8 Fundamentals of hypothesis testing . . . . . . . . . . . . . . . . . 40
3 Lossless Data Compression 43
3.1 Principles of data compression . . . . . . . . . . . . . . . . . . . . 43
3.2 Block codes for asymptotically lossless compression . . . . . . . . 45
3.2.1 Block codes for discrete memoryless sources . . . . . . . . 45
iv
3.2.2 Block codes for stationary ergodic sources . . . . . . . . . 54
3.2.3 Redundancy for lossless block data compression . . . . . . 58
3.3 Variable-length codes for lossless data compression . . . . . . . . . 60
3.3.1 Non-singular codes and uniquely decodable codes . . . . . 60
3.3.2 Prefix or instantaneous codes . . . . . . . . . . . . . . . . 65
3.3.3 Examples of binary prefix codes . . . . . . . . . . . . . . . 71
A) Huffman codes: optimal variable-length codes . . . . . 71
B) Shannon-Fano-Elias code . . . . . . . . . . . . . . . . . 75
3.3.4 Examples of universal lossless variable-length codes . . . . 76
A) Adaptive Huffman code . . . . . . . . . . . . . . . . . . 76
B) Lempel-Ziv codes . . . . . . . . . . . . . . . . . . . . . 80
4 Data Transmission and Channel Capacity 82
4.1 Principles of data transmission . . . . . . . . . . . . . . . . . . . . 82
4.2 Discrete memoryless channels . . . . . . . . . . . . . . . . . . . . 84
4.3 Block codes for data transmission over DMCs . . . . . . . . . . . 90
4.4 Calculating channel capacity . . . . . . . . . . . . . . . . . . . . . 102
4.4.1 Symmetric, weakly-symmetric and quasi-symmetric channels103
4.4.2 Channel capacity Karuch-Kuhn-Tucker condition . . . . . 109
5 Differential Entropy and Gaussian Channels 113
5.1 Differential entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.2 Joint and conditional differential entropies, divergence and mutual
information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.3 AEP for continuous memoryless sources . . . . . . . . . . . . . . . 131
5.4 Capacity of the discrete-time memoryless Gaussian channel . . . . 132
A Overview on Suprema and Limits 133
A.1 Supremum and maximum . . . . . . . . . . . . . . . . . . . . . . 133
A.2 Infimum and minimum . . . . . . . . . . . . . . . . . . . . . . . . 135
A.3 Boundedness and suprema operations . . . . . . . . . . . . . . . . 136
A.4 Sequences and their limits . . . . . . . . . . . . . . . . . . . . . . 138
A.5 Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
B Overview in Probability and Random Processes 144
B.1 Probability space . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
B.2 Random variable and random process . . . . . . . . . . . . . . . . 145
B.3 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . 145
B.4 Convexity, concavity and Jensen’s inequality . . . . . . . . . . . . 145
v
List of Tables
Number Page
3.1 An example of the δ-typical set with n= 2 and δ= 0.4, where
F2(0.4) = {AB, AC, BA, BB, BC, CA, CB }. The codeword set
is {001(AB), 010(AC), 011(BA), 100(BB), 101(BC), 110(CA),
111(CB), 000(AA, AD, BD, CC, CD, DA, DB, DC, DD) }, where
the parenthesis following each binary codeword indicates those
sourcewords that are encoded to this codeword. The source distri-
bution is PX(A) = 0.4, PX(B) = 0.3, PX(C) = 0.2 and PX(D) =
0.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1 Quantized random variable qn(X) under an n-bit accuracy: H(qn(X))
and H(qn(X)) nversus n. . . . . . . . . . . . . . . . . . . . . . 118
vi
List of Figures
Number Page
1.1 Block diagram of a general communication system. . . . . . . . . 2
2.1 Binary entropy function hb(p). . . . . . . . . . . . . . . . . . . . . 10
2.2 Relation between entropy and mutual information. . . . . . . . . 17
2.3 Communication context of the data processing lemma. . . . . . . 21
2.4 Permissible (Pe, H(X|Y)) region due to Fano’s inequality. . . . . . 24
3.1 Block diagram of a data compression system. . . . . . . . . . . . . 45
3.2 Possible codebook Cnand its corresponding Sn. The solid box
indicates the decoding mapping from Cnback to Sn. . . . . . . . 53
3.3 (Ultimate) Compression rate Rversus source entropy HD(X) and
behavior of the probability of block decoding error as block length
ngoes to infinity for a discrete memoryless source. . . . . . . . . . 54
3.4 Classification of variable-length codes. . . . . . . . . . . . . . . . 65
3.5 Tree structure of a binary prefix code. The codewords are those
residing on the leaves, which in this case are 00, 01, 10, 110, 1110
and 1111. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.6 Example of the Huffman encoding. . . . . . . . . . . . . . . . . . 73
3.7 Example of the sibling property based on the code tree from P(16)
ˆ
X.
The arguments inside the parenthesis following ajrespectively
indicate the codeword and the probability associated with aj. b
is used to denote the internal nodes of the tree with the assigned
(partial) code as its subscript. The number in the parenthesis
following bis the probability sum of all its children. . . . . . . . 78
3.8 (Continuation of Figure 3.7) Example of violation of the sibling
property after observing a new symbol a3at n= 17. Note that
node a1is not adjacent to its sibling a2. . . . . . . . . . . . . . . 79
3.9 (Continuation of Figure 3.8) Updated Huffman code. The sibling
property holds now for the new code. . . . . . . . . . . . . . . . . 80
vii
4.1 A data transmission system, where Wrepresents the message for
transmission, Xndenotes the codeword corresponding to message
W,Ynrepresents the received word due to channel input Xn, and
ˆ
Wdenotes the reconstructed message from Yn. . . . . . . . . . . 82
4.2 Binary symmetric channel. . . . . . . . . . . . . . . . . . . . . . . 87
4.3 Binary erasure channel. . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4 Binary symmetric erasure channel. . . . . . . . . . . . . . . . . . 89
4.5 Ultimate channel coding rate Rversus channel capacity Cand be-
havior of the probability of error as blocklength ngoes to infinity
for a discrete memoryless channel. . . . . . . . . . . . . . . . . . . 102
A.1 Illustration of Lemma A.17. . . . . . . . . . . . . . . . . . . . . . 139
B.1 The support line y=ax +bof the convex function f(x). . . . . . 147
viii
Chapter 1
Introduction
1.1 Overview
Since its inception, the main role of Information Theory has been to provide the
engineering and scientific communities with a mathematical framework for the
theory of communication by establishing the fundamental limits on the perfor-
mance of various communication systems. The birth of Information Theory was
initiated with the publication of the groundbreaking works [38, 40] of Claude El-
wood Shannon (1916-2001) who asserted that it is possible to send information-
bearing signals at a fixed positive rate through a noisy communication channel
with an arbitrarily small probability of error as long as the transmission rate
is below a certain fixed quantity that depends on the channel statistical char-
acteristics; he “baptized” this quantity with the name of channel capacity. He
further proclaimed that random (stochastic) sources, representing data, speech
or image signals, can be compressed distortion-free at a minimal rate given by
the source’s intrinsic amount of information, which he called source entropy and
defined in terms of the source statistics. He went on proving that if a source has
an entropy that is less than the capacity of a communication channel, then the
source can be reliably transmitted (with asymptotically vanishing probability of
error) over the channel. He further generalized these “coding theorems” from
the lossless (distortionless) to the lossy context where the source can be com-
pressed and reproduced (possibly after channel transmission) within a tolerable
distortion threshold [39].
Inspired and guided by the pioneering ideas of Shannon,1information theo-
rists gradually expanded their interests beyond communication theory, and in-
vestigated fundamental questions in several other related fields. Among them
we cite:
1See [42] for accessing most of Shannon’s works, including his yet untapped doctoral dis-
sertation on an algebraic framework for population genetics.
1
statistical physics (thermodynamics, quantum information theory);
computer science (algorithmic complexity, resolvability);
probability theory (large deviations, limit theorems);
statistics (hypothesis testing, multi-user detection, Fisher information, es-
timation);
economics (gambling theory, investment theory);
biology (biological information theory);
cryptography (data security, watermarking);
data networks (self-similarity, traffic regulation theory).
In this textbook, we focus our attention on the study of the basic theory of
communication for single-user (point-to-point) systems for which Information
Theory was originally conceived.
1.2 Communication system model
A simple block diagram of a general communication system is depicted in Fig. 1.1.
Source - - - Modulator
?
Physical
Channel
?
Demodulator
Destination
6
Discrete
Channel
Transmitter Part
Receiver Part
Focus of
this text
Source
Encoder Channel
Encoder
Channel
Decoder
Source
Decoder
Figure 1.1: Block diagram of a general communication system.
2
Let us briefly describe the role of each block in the figure.
Source: The source, which usually represents data or multimedia signals,
is modelled as a random process (the necessary background regarding ran-
dom processes is introduced in Appendix B). It can be discrete (finite or
countable alphabet) or continuous (uncountable alphabet) in value and in
time.
Source Encoder: Its role is to represent the source in a compact fashion
by removing its unnecessary or redundant content (i.e., by compressing it).
Channel Encoder: Its role is to enable the reliable reproduction of the
source encoder output after its transmission through a noisy communica-
tion channel. This is achieved by adding redundancy (using usually an
algebraic structure) to the source encoder output.
Modulator: It transforms the channel encoder output into a waveform
suitable for transmission over the physical channel. This is typically ac-
complished by varying the parameters of a sinusoidal signal in proportion
with the data provided by the channel encoder output.
Physical Channel: It consists of the noisy (or unreliable) medium that
the transmitted waveform traverses. It is usually modelled via a sequence of
conditional (or transition) probability distributions of receiving an output
given that a specific input was sent.
Receiver Part: It consists of the demodulator, the channel decoder and
the source decoder where the reverse operations are performed. The desti-
nation represents the sink where the source estimate provided by the source
decoder is reproduced.
In this text, we will model the concatenation of the modulator, physical
channel and demodulator via a discrete-time2channel with a given sequence of
conditional probability distributions. Given a source and a discrete channel, our
objectives will include determining the fundamental limits of how well we can
construct a (source/channel) coding scheme so that:
the smallest number of source encoder symbols can represent each source
symbol distortion-free or within a prescribed distortion level D, where
D > 0 and the channel is noiseless;
2Except for a brief interlude with the continuous-time (waveform) Gaussian channel in
Chapter 5, we will consider discrete-time communication systems throughout the text.
3
the largest rate of information can be transmitted over a noisy channel
between the channel encoder input and the channel decoder output with
an arbitrarily small probability of decoding error;
we can guarantee that the source is transmitted over a noisy channel and
reproduced at the destination within distortion D, where D > 0.
4
Chapter 2
Information Measures for Discrete
Systems
In this chapter, we define information measures for discrete-time discrete-alphabet1
systems from a probabilistic standpoint and develop their properties. Elucidat-
ing the operational significance of probabilistically defined information measures
vis-a-vis the fundamental limits of coding constitutes a main objective of this
book; this will be seen in the subsequent chapters.
2.1 Entropy, joint entropy and conditional entropy
2.1.1 Self-information
Let Ebe an event belonging to a given event space and having probability
Pr(E),pE, where 0 pE1. Let I(E) called the self-information of E
represent the amount of information one gains when learning that Ehas occurred
(or equivalently, the amount of uncertainty one had about Eprior to learning
that it has happened). A natural question to ask is “what properties should I(E)
have?” Although the answer to this question may vary from person to person,
here are some common properties that I(E) is reasonably expected to have.
1. I(E) should be a decreasing function of pE.
In other words, this property first states that I(E) = I(pE), where I(·) is
a real-valued function defined over [0,1]. Furthermore, one would expect
that the less likely event Eis, the more information is gained when one
1By discrete alphabets, one usually means finite or countably infinite alphabets. We how-
ever mostly focus on finite alphabet systems, although the presented information measures
allow for countable alphabets (when they exist).
5
learns it has occurred. In other words, I(pE) is a decreasing function of
pE.
2. I(pE) should be continuous in pE.
Intuitively, one should expect that a small change in pEcorresponds to a
small change in the amount of information carried by E.
3. If E1and E2are independent events, then I(E1E2) = I(E1) + I(E2),
or equivalently, I(pE1×pE2) = I(pE1) + I(pE2).
This property declares that when events E1and E2are independent from
each other (i.e., when they do not affect each other probabilistically), the
amount of information one gains by learning that both events have jointly
occurred should be equal to the sum of the amounts of information of each
individual event.
Next, we show that the only function that satisfies properties 1-3 above is
the logarithmic function.
Theorem 2.1 The only function defined over p[0,1] and satisfying
1. I(p) is monotonically decreasing in p;
2. I(p) is a continuous function of pfor 0 p1;
3. I(p1×p2) = I(p1) + I(p2);
is I(p) = c·logb(p), where cis a positive constant and the base bof the
logarithm is any number larger than one.
Proof:
Step 1: Claim. For n= 1,2,3,···,
I1
n=c·logb1
n,
where c > 0 is a constant.
Proof: First note that for n= 1, condition 3 directly shows the claim, since
it yields that I(1) = I(1) + I(1). Thus I(1) = 0 = clogb(1).
Now let nbe a fixed positive integer greater than 1. Conditions 1 and 3
respectively imply
n < m I1
n< I 1
m.(2.1.1)
6
and
I1
mn=I1
m+I1
n(2.1.2)
where n, m = 1,2,3,···. Now using (2.1.2), we can show by induction (on
k) that
I1
nk=k·I1
n(2.1.3)
for all non-negative integers k.
Now for any positive integer r, there exists a non-negative integer ksuch
that
nk2r< nk+1.
By (2.1.1), we obtain
I1
nkI1
2r< I 1
nk+1 ,
which together with (2.1.3), yields
k·I1
nr·I1
2<(k+ 1) ·I1
n.
Hence, since I(1/n)> I(1) = 0,
k
rI(1/2)
I(1/n)k+ 1
r.
On the other hand, by the monotonicity of the logarithm, we obtain
logbnklogb2rlogbnk+1 k
rlogb(2)
logb(n)k+ 1
r.
Therefore,
logb(2)
logb(n)I(1/2)
I(1/n)<1
r.
Since nis fixed, and rcan be made arbitrarily large, we can let r to
get:
I1
n=c·logb(n).
where c=I(1/2)/logb(2) >0. This completes the proof of the claim.
7
Step 2: Claim. I(p) = c·logb(p) for positive rational number p, where c > 0
is a constant.
Proof: A positive rational number pcan be represented by a ratio of two
integers, i.e., p=r/s, where rand sare both positive integers. Then
condition 3 yields that
I1
s=Ir
s
1
r=Ir
s+I1
r,
which, from Step 1, implies that
I(p) = Ir
s=I1
sI1
r=c·logbsc·logbr=c·logbp.
Step 3: For any p[0,1], it follows by continuity and the density of the ratio-
nals in the reals that
I(p) = lim
ap, a rational I(a) = lim
bp, b rational I(b) = c·logb(p).
2
The constant cabove is by convention normalized to c= 1. Furthermore,
the base bof the logarithm determines the type of units used in measuring
information. When b= 2, the amount of information is expressed in bits (i.e.,
binary digits). When b=e i.e., the natural logarithm (ln) is used information
is measured in nats (i.e., natural units or digits). For example, if the event E
concerns a Heads outcome from the toss of a fair coin, then its self-information
is I(E) = log2(1/2) = 1 bit or ln(1/2) = 0.693 nats.
More generally, under base b > 1, information is in b-ary units or digits.
For the sake of simplicity, we will throughout use the base-2 logarithm unless
otherwise specified. Note that one can easily convert information units from bits
to b-ary units by dividing the former by log2(b).
2.1.2 Entropy
Let Xbe a discrete random variable taking values in a finite alphabet Xunder
a probability distribution or probability mass function (pmf) PX(x),P[X=x]
for all x X. Note that Xgenerically represents a memoryless source, which is
a random process {Xn}
n=1 with independent and identically distributed (i.i.d.)
random variables (cf. Appendix B).
8
Definition 2.2 (Entropy) The entropy of a discrete random variable Xwith
pmf PX(·) is denoted by H(X) or H(PX) and defined by
H(X),X
x∈X
PX(x)·log2PX(x) (bits).
Thus H(X) represents the statistical average (mean) amount of information
one gains when learning that one of its |X| outcomes has occurred, where |X|
denotes the size of alphabet X. Indeed, we directly note from the definition that
H(X) = E[log2PX(X)] = E[I(X)]
where I(x),log2PX(x) is the self-information of the elementary event [X=
x].
When computing the entropy, we adopt the convention
0·log20 = 0,
which can be justified by a continuity argument since xlog2x0 as x0.
Also note that H(X) only depends on the probability distribution of Xand is
not affected by the symbols that represent the outcomes. For example when
tossing a fair coin, we can denote Heads by 2 (instead of 1) and Tail by 100
(instead of 0), and the entropy of the random variable representing the outcome
would remain equal to log2(2) = 1 bit.
Example 2.3 Let Xbe a binary (valued) random variable with alphabet X=
{0,1}and pmf given by PX(1) = pand PX(0) = 1 p, where 0 p1 is fixed.
Then H(X) = p·log2p(1p)·log2(1p). This entropy is conveniently called
the binary entropy function and is usually denoted by hb(p): it is illustrated in
Fig. 2.1. As shown in the figure, hb(p) is maximized for a uniform distribution
(i.e., p= 1/2).
The units for H(X) above are in bits as base-2 logarithm is used. Setting
HD(X),X
x∈X
PX(x)·logDPX(x)
yields the entropy in D-ary units, where D > 1. Note that we abbreviate H2(X)
as H(X) throughout the book since bits are common measure units for a coding
system, and hence
HD(X) = H(X)
log2D.
Thus
He(X) = H(X)
log2(e)= (ln 2) ·H(X)
gives the entropy in nats, where eis the base of the natural logarithm.
9
0
1
0 0.5 1
p
Figure 2.1: Binary entropy function hb(p).
2.1.3 Properties of entropy
When developing or proving the basic properties of entropy (and other informa-
tion measures), we will often use the following fundamental inequality on the
logarithm (its proof is left as an exercise).
Lemma 2.4 (Fundamental inequality (FI)) For any x > 0 and D > 1, we
have that
logDxlogD(e)(x1)
with equality if and only if (iff) x= 1.
Setting y= 1/x and using FI above directly yields that for any y > 0, we also
have that
logDylogD(e)11
y,
also with equality iff y= 1. In the above the base-Dlogarithm was used.
Specifically, for a logarithm with base-2, the above inequalities become
log2(e)11
xlog2xlog2(e)(x1)
with equality iff x= 1.
Lemma 2.5 (Non-negativity) H(X)0. Equality holds iff Xis determin-
istic (when Xis deterministic, the uncertainty of Xis obviously zero).
10
Proof: 0PX(x)1 implies that log2[1/PX(x)] 0 for every x X. Hence,
H(X) = X
x∈X
PX(x) log2
1
PX(x)0,
with equality holding iff PX(x) = 1 for some x X.2
Lemma 2.6 (Upper bound on entropy) If a random variable Xtakes val-
ues from a finite set X, then
H(X)log2|X|,
where |X| denotes the size of the set X. Equality holds iff Xis equiprobable or
uniformly distributed over X(i.e., PX(x) = 1
|X| for all x X).
Proof:
log2|X | H(X) = log2|X| × "X
x∈X
PX(x)#"X
x∈X
PX(x) log2PX(x)#
=X
x∈X
PX(x)×log2|X | +X
x∈X
PX(x) log2PX(x)
=X
x∈X
PX(x) log2[|X | × PX(x)]
X
x∈X
PX(x)·log2(e)11
|X| × PX(x)
= log2(e)X
x∈X PX(x)1
|X|
= log2(e)·(1 1) = 0
where the inequality follows from the FI Lemma, with equality iff (x X),
|X| × PX(x) = 1, which means PX(·) is a uniform distribution on X.2
Intuitively, H(X) tells us how random Xis. Indeed, Xis deterministic (not
random at all) iff H(X) = 0. If Xis uniform (equiprobable), H(X) is maximized,
and is equal to log2|X|.
Lemma 2.7 (Log-sum inequality) For non-negative numbers, a1,a2,...,an
and b1,b2,...,bn,
n
X
i=1 ailogD
ai
bi n
X
i=1
ai!logDPn
i=1 ai
Pn
i=1 bi
(2.1.4)
with equality holding iff, (1in) (ai/bi) = (a1/b1), a constant independent
of i. (By convention, 0 ·logD(0) = 0, 0 ·logD(0/0) = 0 and a·logD(a/0) = if
a > 0. Again, this can be justified by “continuity.”)
11
Proof: Let a,Pn
i=1 aiand b,Pn
i=1 bi. Then
n
X
i=1
ailogD
ai
bialogD
a
b=a
n
X
i=1
ai
alogD
ai
bi n
X
i=1
ai
a!
| {z }
=1
logD
a
b
=a
n
X
i=1
ai
alogDai
bi
b
a
alogD(e)
n
X
i=1
ai
a1bi
ai
a
b
=alogD(e) n
X
i=1
ai
a
n
X
i=1
bi
b!
=alogD(e) (1 1) = 0
where the inequality follows from the FI Lemma, with equality holing iff ai
bi
b
a= 1
for all i; i.e., ai
bi=a
bi.
We also provide another proof using Jensen’s inequality (cf. Theorem B.6 in
Appendix B). Without loss of generality, assume that ai>0 and bi>0 for
every i. Jensen’s inequality states that
n
X
i=1
αif(ti)f n
X
i=1
αiti!
for any strictly convex function f(·), αi0, and Pn
i=1 αi= 1; equality holds
iff tiis a constant for all i. Hence by setting αi=bi/Pn
j=1 bj,ti=ai/bi, and
f(t) = t·logD(t), we obtain the desired result. 2
2.1.4 Joint entropy and conditional entropy
Given a pair of random variables (X, Y ) with a joint pmf PX,Y (·,·) defined on
X × Y, the self-information of the (two-dimensional) elementary event [X=
x, Y =y] is defined by
I(x, y),log2PX,Y (x, y).
This leads us to the definition of joint entropy.
Definition 2.8 (Joint entropy) The joint entropy H(X, Y ) of random vari-
12
ables (X, Y ) is defined by
H(X, Y ),X
(x,y)∈X×Y
PX,Y (x, y)·log2PX,Y (x, y)
=E[log2PX,Y (X, Y )].
The conditional entropy can also be similarly defined as follows.
Definition 2.9 (Conditional entropy) Given two jointly distributed random
variables Xand Y, the conditional entropy H(Y|X) of Ygiven Xis defined by
H(Y|X),X
x∈X
PX(x) X
y∈Y
PY|X(y|x)·log2PY|X(y|x)!(2.1.5)
where PY|X(·|·) is the conditional pmf of Ygiven X.
Equation (2.1.5) can be written into three different but equivalent forms:
H(Y|X) = X
(x,y)∈X×Y
PX,Y (x, y)·log2PY|X(y|x)
=E[log2PY|X(Y|X)]
=X
x∈X
PX(x)·H(Y|X=x)
where H(Y|X=x),Py∈Y PY|X(y|x) log2PY|X(y|x).
The relationship between joint entropy and conditional entropy is exhibited
by the fact that the entropy of a pair of random variables is the entropy of one
plus the conditional entropy of the other.
Theorem 2.10 (Chain rule for entropy)
H(X, Y ) = H(X) + H(Y|X).(2.1.6)
Proof: Since
PX,Y (x, y) = PX(x)PY|X(y|x),
we directly obtain that
H(X, Y ) = E[log PX,Y (X, Y )]
=E[log2PX(X)] + E[log2PY|X(Y|X)]
=H(X) + H(Y|X).
13
2
By its definition, joint entropy is commutative; i.e., H(X, Y ) = H(Y, X).
Hence,
H(X, Y ) = H(X) + H(Y|X) = H(Y) + H(X|Y) = H(Y, X),
which implies that
H(X)H(X|Y) = H(Y)H(Y|X).(2.1.7)
The above quantity is exactly equal to the mutual information which will be
introduced in the next section.
The conditional entropy can be thought of in terms of a channel whose input
is the random variable Xand whose output is the random variable Y.H(X|Y) is
then called the equivocation2and corresponds to the uncertainty in the channel
input from the receiver’s point-of-view. For example, suppose that the set of
possible outcomes of random vector (X, Y ) is {(0,0),(0,1),(1,0),(1,1)}, where
none of the elements has zero probability mass. When the receiver Yreceives 1,
he still cannot determine exactly what the sender Xobserves (it could be either
1 or 0); therefore, the uncertainty, from the receiver’s view point, depends on
the probabilities PX|Y(0|1) and PX|Y(1|1).
Similarly, H(Y|X), which is called prevarication,3is the uncertainty in the
channel output from the transmitter’s point-of-view. In other words, the sender
knows exactly what he sends, but is uncertain on what the receiver will finally
obtain.
A case that is of specific interest is when H(X|Y) = 0. By its definition,
H(X|Y) = 0 if Xbecomes deterministic after observing Y. In such case, the
uncertainty of Xafter giving Yis completely zero.
The next corollary can be proved similarly to Theorem 2.10.
Corollary 2.11 (Chain rule for conditional entropy)
H(X, Y |Z) = H(X|Z) + H(Y|X, Z ).
2.1.5 Properties of joint entropy and conditional entropy
Lemma 2.12 (Conditioning never increases entropy) Side information Y
decreases the uncertainty about X:
H(X|Y)H(X)
2Equivocation is an ambiguous statement one uses deliberately in order to deceive or avoid
speaking the truth.
3Prevarication is the deliberate act of deviating from the truth (it is a synonym of “equiv-
ocation”).
14
with equality holding iff Xand Yare independent. In other words, “condition-
ing” reduces entropy.
Proof:
H(X)H(X|Y) = X
(x,y)∈X×Y
PX,Y (x, y)·log2
PX|Y(x|y)
PX(x)
=X
(x,y)∈X×Y
PX,Y (x, y)·log2
PX|Y(x|y)PY(y)
PX(x)PY(y)
=X
(x,y)∈X×Y
PX,Y (x, y)·log2
PX,Y (x, y)
PX(x)PY(y)
X
(x,y)∈X×Y
PX,Y (x, y)
log2P(x,y)∈X×Y PX,Y (x, y)
P(x,y)∈X×Y PX(x)PY(y)
= 0
where the inequality follows from the log-sum inequality, with equality holding
iff PX,Y (x, y)
PX(x)PY(y)= constant (x, y) X × Y.
Since probability must sum to 1, the above constant equals 1, which is exactly
the case of Xbeing independent of Y.2
Lemma 2.13 Entropy is additive for independent random variables; i.e.,
H(X, Y ) = H(X) + H(Y) for independent Xand Y.
Proof: By the previous lemma, independence of Xand Yimplies H(Y|X) =
H(Y). Hence
H(X, Y ) = H(X) + H(Y|X) = H(X) + H(Y).
2
Since conditioning never increases entropy, it follows that
H(X, Y ) = H(X) + H(Y|X)H(X) + H(Y).(2.1.8)
The above lemma tells us that equality holds for (2.1.8) only when Xis inde-
pendent of Y.
A result similar to (2.1.8) also applies to conditional entropy.
15
Lemma 2.14 Conditional entropy is lower additive; i.e.,
H(X1, X2|Y1, Y2)H(X1|Y1) + H(X2|Y2).
Equality holds iff
PX1,X2|Y1,Y2(x1, x2|y1, y2) = PX1|Y1(x1|y1)PX2|Y2(x2|y2)
for all x1,x2,y1and y2.
Proof: Using the chain rule for conditional entropy and the fact that condition-
ing reduces entropy, we can write
H(X1, X2|Y1, Y2) = H(X1|Y1, Y2) + H(X2|X1, Y1, Y2)
H(X1|Y1, Y2) + H(X2|Y1, Y2),(2.1.9)
H(X1|Y1) + H(X2|Y2).(2.1.10)
For (2.1.9), equality holds iff X1and X2are conditionally independent given
(Y1, Y2): PX1,X2|Y1,Y2(x1, x2|y1, y2) = PX1|Y1,Y2(x1|y1, y2)PX2|Y1,Y2(x2|y1, y2). For
(2.1.10), equality holds iff X1is conditionally independent of Y2given Y1(i.e.,
PX1|Y1,Y2(x1|y1, y2) = PX1|Y1(x1|y1)), and X2is conditionally independent of Y1
given Y2(i.e., PX2|Y1,Y2(x2|y1, y2) = PX2|Y2(x2|y2)). Hence, the desired equality
condition of the lemma is obtained. 2
2.2 Mutual information
For two random variables Xand Y, the mutual information between Xand Yis
the reduction in the uncertainty of Ydue to the knowledge of X(or vice versa).
A dual definition of mutual information states that it is the average amount of
information that Yhas (or contains) about Xor Xhas (or contains) about Y.
We can think of the mutual information between Xand Yin terms of a
channel whose input is Xand whose output is Y. Thereby the reduction of the
uncertainty is by definition the total uncertainty of X(i.e., H(X)) minus the
uncertainty of Xafter observing Y(i.e., H(X|Y)) Mathematically, it is
mutual information = I(X;Y),H(X)H(X|Y).(2.2.1)
It can be easily verified from (2.1.7) that mutual information is symmetric; i.e.,
I(X;Y) = I(Y;X).
16
H(X)H(X|Y)I(X;Y)H(Y|X)H(Y)
H(X, Y )
R
-
Figure 2.2: Relation between entropy and mutual information.
2.2.1 Properties of mutual information
Lemma 2.15
1. I(X;Y) = X
x∈X X
y∈Y
PX,Y (x, y) log2
PX,Y (x, y)
PX(x)PY(y).
2. I(X;Y) = I(Y;X).
3. I(X;Y) = H(X) + H(Y)H(X, Y ).
4. I(X;Y)H(X) with equality holding iff Xis a function of Y(i.e., X=
f(Y) for some function f(·)).
5. I(X;Y)0 with equality holding iff Xand Yare independent.
6. I(X;Y)min{log2|X|,log2|Y|}.
Proof: Properties 1, 2, 3, and 4 follow immediately from the definition. Property
5 is a direct consequence of Lemma 2.12. Property 6 holds iff I(X;Y)log2|X|
and I(X;Y)log2|Y|. To show the first inequality, we write I(X;Y) = H(X)
H(X|Y), use the fact that H(X|Y) is non-negative and apply Lemma 2.6. A
similar proof can be used to show that I(X;Y)log2|Y|.2
The relationships between H(X), H(Y), H(X, Y ), H(X|Y), H(Y|X) and
I(X;Y) can be illustrated by the Venn diagram in Figure 2.2.
17
2.2.2 Conditional mutual information
The conditional mutual information, denoted by I(X;Y|Z), is defined as the
common uncertainty between Xand Yunder the knowledge of Z. It is mathe-
matically defined by
I(X;Y|Z),H(X|Z)H(X|Y, Z ).(2.2.2)
Lemma 2.16 (Chain rule for mutual information)
I(X;Y, Z ) = I(X;Y) + I(X;Z|Y) = I(X;Z) + I(X;Y|Z).
Proof: Without loss of generality, we only prove the first equality:
I(X;Y, Z ) = H(X)H(X|Y, Z)
=H(X)H(X|Y) + H(X|Y)H(X|Y, Z )
=I(X;Y) + I(X;Z|Y).
2
The above lemma can be read as: the information that (Y, Z ) has about X
is equal to the information that Yhas about Xplus the information that Zhas
about Xwhen Yis already known.
2.3 Properties of entropy and mutual information for
multiple random variables
Theorem 2.17 (Chain rule for entropy) Let X1,X2,...,Xnbe drawn ac-
cording to PXn(xn),PX1,···,Xn(x1,...,xn), where we use the common super-
script notation to denote an n-tuple: Xn,(X1,···, Xn) and xn,(x1,...,xn).
Then
H(X1, X2,...,Xn) =
n
X
i=1
H(Xi|Xi1,...,X1),
where H(Xi|Xi1,...,X1),H(X1) for i= 1. (The above chain rule can also
be written as:
H(Xn) =
n
X
i=1
H(Xi|Xi1),
where Xi,(X1,...,Xi).)
18
Proof: From (2.1.6),
H(X1, X2,...,Xn) = H(X1, X2,...,Xn1) + H(Xn|Xn1,...,X1).(2.3.1)
Once again, applying (2.1.6) to the first term of the right-hand-side of (2.3.1),
we have
H(X1, X2,...,Xn1) = H(X1, X2,...,Xn2) + H(Xn1|Xn2,...,X1).
The desired result can then be obtained by repeatedly applying (2.1.6). 2
Theorem 2.18 (Chain rule for conditional entropy)
H(X1, X2,...,Xn|Y) =
n
X
i=1
H(Xi|Xi1,...,X1, Y ).
Proof: The theorem can be proved similarly to Theorem 2.17. 2
Theorem 2.19 (Chain rule for mutual information)
I(X1, X2,...,Xn;Y) =
n
X
i=1
I(Xi;Y|Xi1,...,X1),
where I(Xi;Y|Xi1,...,X1),I(X1;Y) for i= 1.
Proof: This can be proved by first expressing mutual information in terms of
entropy and conditional entropy, and then applying the chain rules for entropy
and conditional entropy. 2
Theorem 2.20 (Independence bound on entropy)
H(X1, X2,...,Xn)
n
X
i=1
H(Xi).
Equality holds iff all the Xi’s are independent from each other.4
4This condition is equivalent to that Xiis independent of (Xi1,...,X1) for all i.
Their equivalence can be easily proved by chain rule for probabilities, i.e., PXn(xn) =
Qn
i=1 P(Xi|Xi1
1), which is left to the readers as an exercise.
19
Proof: By applying the chain rule for entropy,
H(X1, X2,...,Xn) =
n
X
i=1
H(Xi|Xi1,...,X1)
n
X
i=1
H(Xi).
Equality holds iff each conditional entropy is equal to its associated entropy, that
iff Xiis independent of (Xi1,...,X1) for all i.2
Theorem 2.21 (Bound on mutual information) If {(Xi, Yi)}n
i=1 is a pro-
cess satisfying the conditional independence assumption PYn|Xn=Qn
i=1 PYi|Xi,
then
I(X1,...,Xn;Y1,...,Yn)
n
X
i=1
I(Xi;Yi)
with equality holding iff {Xi}n
i=1 are independent.
Proof: From the independence bound on entropy, we have
H(Y1,...,Yn)
n
X
i=1
H(Yi).
By the conditional independence assumption, we have
H(Y1,...,Yn|X1,...,Xn) = Elog2PYn|Xn(Yn|Xn)
=E"
n
X
i=1
log2PYi|Xi(Yi|Xi)#
=
n
X
i=1
H(Yi|Xi).
Hence
I(Xn;Yn) = H(Yn)H(Yn|Xn)
n
X
i=1
H(Yi)
n
X
i=1
H(Yi|Xi)
=
n
X
i=1
I(Xi;Yi)
with equality holding iff {Yi}n
i=1 are independent, which holds iff {Xi}n
i=1 are
independent. 2
20
Source -
UEncoder -
XChannel -
YDecoder -
V
I(U;V)I(X;Y)
“By processing, we can only reduce (mutual) information,
but the processed information may be in a more useful form!”
Figure 2.3: Communication context of the data processing lemma.
2.4 Data processing inequality
Lemma 2.22 (Data processing inequality) (This is also called the data pro-
cessing lemma.) If XYZ, then I(X;Y)I(X;Z).
Proof: The Markov chain relationship XYZmeans that Xand Z
are conditional independent given Y(cf. Appendix B); we directly have that
I(X;Z|Y) = 0. By the chain rule for mutual information,
I(X;Z) + I(X;Y|Z) = I(X;Y, Z ) (2.4.1)
=I(X;Y) + I(X;Z|Y)
=I(X;Y).(2.4.2)
Since I(X;Y|Z)0, we obtain that I(X;Y)I(X;Z) with equality holding
iff I(X;Y|Z) = 0. 2
The data processing inequality means that the mutual information will not
increase after processing. This result is somewhat counter-intuitive since given
two random variables Xand Y, we might believe that applying a well-designed
processing scheme to Y, which can be generally represented by a mapping g(Y),
could possibly increase the mutual information. However, for any g(·), X
Yg(Y) forms a Markov chain which implies that data processing cannot
increase mutual information. A communication context for the data processing
lemma is depicted in Figure 2.3, and summarized in the next corollary.
Corollary 2.23 For jointly distributed random variables Xand Yand any
function g(·), we have XYg(Y) and
I(X;Y)I(X;g(Y)).
We also note that if Zobtains all the information about Xthrough Y, then
knowing Zwill not help increase the mutual information between Xand Y; this
is formalized in the following.
21
Corollary 2.24 If XYZ, then
I(X;Y|Z)I(X;Y).
Proof: The proof directly follows from (2.4.1) and (2.4.2). 2
It is worth pointing out that it is possible that I(X;Y|Z)> I(X;Y) when X,
Yand Zdo not form a Markov chain. For example, let Xand Ybe independent
equiprobable binary random variables, and let Z=X+Y. Then,
I(X;Y|Z) = H(X|Z)H(X|Y, Z )
=H(X|Z)
=PZ(0)H(X|z= 0) + PZ(1)H(X|z= 1) + PZ(2)H(X|z= 2)
= 0 + 0.5 + 0
= 0.5 bits,
which is clearly larger than I(X;Y) = 0.
Finally, we observe that we can extend the data processing inequality for a
sequence of random variables forming a Markov chain:
Corollary 2.25 If X1X2 ··· Xn, then for any i, j, k.l such that
1ijkln, we have that
I(Xi;Xl)I(Xj;Xk).
2.5 Fano’s inequality
Fano’s inequality is a quite useful tool widely employed in Information Theory
to prove converse results for coding theorems (as we will see in the following
chapters).
Lemma 2.26 (Fano’s inequality) Let Xand Ybe two random variables, cor-
related in general, with alphabets Xand Y, respectively, where Xis finite but
Ycan be countably infinite. Let ˆ
X,g(Y) be an estimate of Xfrom observing
Y, where g:Y X is a given estimation function. Define the probability of
error as
Pe,Pr[ ˆ
X6=X].
Then the following inequality holds
H(X|Y)hb(Pe) + Pe·log2(|X | 1),(2.5.1)
where hb(x),xlog2x(1 x) log2(1 x) for 0 x1 is the binary entropy
function.
22
Observation 2.27
Note that when Pe= 0, we obtain that H(X|Y) = 0 (see (2.5.1)) as
intuition suggests, since if Pe= 0, then ˆ
X=g(Y) = X(with probability
1) and thus H(X|Y) = H(g(Y)|Y) = 0.
Fano’s inequality yields upper and lower bounds on Pein terms of H(X|Y).
This is illustrated in Figure 2.4, where we plot the region for the pairs
(Pe, H(X|Y)) that are permissible under Fano’s inequality. In the figure,
the boundary of the permissible (dashed) region is given by the function
f(Pe),hb(Pe) + Pe·log2(|X | 1),
the right-hand side of (2.5.1). We obtain that when
log2(|X | 1) < H(X|Y)log2(|X|),
Pecan be upper and lower bounded as follows:
0<inf{a:f(a)H(X|Y)} Pesup{a:f(a)H(X|Y)}<1.
Furthermore, when
0< H(X|Y)log2(|X| 1),
only the lower bound holds:
Peinf{a:f(a)H(X|Y)}>0.
Thus for all non-zero values of H(X|Y), we obtain a lower bound (of the
same form above) on Pe; the bound implies that if H(X|Y) is bounded
away from zero, Peis also bounded away from zero.
A weaker but simpler version of Fano’s inequality can be directly obtained
from (2.5.1) by noting that hb(Pe)1:
H(X|Y)1 + Pelog2(|X | 1),(2.5.2)
which in turn yields that
PeH(X|Y)1
log2(|X | 1) ( for |X| >2)
which is weaker than the above lower bound on Pe.
23
log2(|X | 1)
log2(|X|)
H(X|Y)
(|X | 1)/|X|
0 1
Pe
Figure 2.4: Permissible (Pe, H(X|Y)) region due to Fano’s inequality.
Proof of Lemma 2.26:
Define a new random variable,
E,1,if g(Y)6=X
0,if g(Y) = X.
Then using the chain rule for conditional entropy, we obtain
H(E, X |Y) = H(X|Y) + H(E|X, Y )
=H(E|Y) + H(X|E, Y ).
Observe that Eis a function of Xand Y; hence, H(E|X, Y ) = 0. Since con-
ditioning never increases entropy, H(E|Y)H(E) = hb(Pe). The remaining
term, H(X|E, Y ), can be bounded as follows:
H(X|E, Y ) = Pr[E= 0]H(X|Y , E = 0) + Pr[E= 1]H(X|Y, E = 1)
(1 Pe)·0 + Pe·log2(|X | 1),
since X=g(Y) for E= 0, and given E= 1, we can upper bound the condi-
tional entropy by the log of the number of remaining outcomes, i.e., (|X| 1).
Combining these results completes the proof. 2
Fano’s inequality cannot be improved in the sense that the lower bound,
H(X|Y), can be achieved for some specific cases. Any bound that can be
24
achieved in some cases is often referred to as sharp.5From the proof of the above
lemma, we can observe that equality holds in Fano’s inequality, if H(E|Y) =
H(E) and H(X|Y, E = 1) = log2(|X| 1). The former is equivalent to Ebeing
independent of Y, and the latter holds iff PX|Y(·|y) is uniformly distributed over
the set X {g(y)}. We can therefore create an example in which equality holds
in Fano’s inequality.
Example 2.28 Suppose that Xand Yare two independent random variables
which are both uniformly distributed on the alphabet {0,1,2}. Let the estimat-
ing function be given by g(y) = y. Then
Pe= Pr[g(Y)6=X] = Pr[Y6=X] = 1
2
X
x=0
PX(x)PY(x) = 2
3.
In this case, equality is achieved in Fano’s inequality, i.e.,
hb2
3+2
3·log2(3 1) = H(X|Y) = H(X) = log23.
To conclude this section, we present an alternative proof for Fano’s inequality
to illustrate the use of the data processing inequality and the FI Lemma.
Alternative Proof of Fano’s inequality: Noting that XYˆ
Xform a
Markov chain, we directly obtain via the data processing inequality that
I(X;Y)I(X;ˆ
X),
which implies that
H(X|Y)H(X|ˆ
X).
Thus, if we show that H(X|ˆ
X) is no larger than the right-hand side of (2.5.1),
the proof of (2.5.1) is complete.
Noting that
Pe=X
x∈X X
ˆx∈Xx6=x
PX, ˆ
X(x, ˆx)
and
1Pe=X
x∈X X
ˆx∈Xx=x
PX, ˆ
X(x, ˆx) = X
x∈X
PX, ˆ
X(x, x),
5Definition. A bound is said to be sharp if the bound is achievable for some specific cases.
A bound is said to be tight if the bound is achievable for all cases.
25
we obtain that
H(X|ˆ
X)hb(Pe)Pelog2(|X | 1)
=X
x∈X X
ˆx∈Xx6=x
PX, ˆ
X(x, ˆx) log2
1
PX|ˆ
X(x|ˆx)+X
x∈X
PX, ˆ
X(x, x) log2
1
PX|ˆ
X(x|x)
"X
x∈X X
ˆx∈Xx6=x
PX, ˆ
X(x, ˆx)#log2
(|X| 1)
Pe
+"X
x∈X
PX, ˆ
X(x, x)#log2(1 Pe)
=X
x∈X X
ˆx∈Xx6=x
PX, ˆ
X(x, ˆx) log2
Pe
PX|ˆ
X(x|ˆx)(|X| 1)
+X
x∈X
PX, ˆ
X(x, x) log2
1Pe
PX|ˆ
X(x|x)(2.5.3)
log2(e)X
x∈X X
ˆx∈Xx6=x
PX, ˆ
X(x, ˆx)"Pe
PX|ˆ
X(x|ˆx)(|X| 1) 1#
+ log2(e)X
x∈X
PX, ˆ
X(x, x)"1Pe
PX|ˆ
X(x|x)1#
= log2(e)"Pe
(|X| 1) X
x∈X X
ˆx∈Xx6=x
Pˆ
Xx)X
x∈X X
ˆx∈Xx6=x
PX, ˆ
X(x, ˆx)#
+ log2(e)"(1 Pe)X
x∈X
Pˆ
X(x)X
x∈X
PX, ˆ
X(x, x)#
= log2(e)Pe
(|X| 1)(|X | 1) Pe+ log2(e) [(1 Pe)(1 Pe)]
= 0
where the inequality follows by applying the FI Lemma to each logarithm term
in (2.5.3). 2
2.6 Divergence and variational distance
In addition to the probabilistically defined entropy and mutual information, an-
other measure that is frequently considered in information theory is divergence or
relative entropy. In this section, we define this measure and study its statistical
properties.
Definition 2.29 (Divergence) Given two discrete random variables Xand ˆ
X
defined over a common alphabet X, the divergence (other names are Kullback-
Leibler divergence or distance,relative entropy and discrimination) is denoted
26
by D(Xkˆ
X) or D(PXkPˆ
X) and defined by6
D(Xkˆ
X) = D(PXkPˆ
X),EXlog2
PX(X)
Pˆ
X(X)=X
x∈X
PX(x) log2
PX(x)
Pˆ
X(x).
In other words, the divergence D(PXkPˆ
X) is the expectation (with respect to
PX) of the log-likelihood ratio log2[PX/P ˆ
X] of distribution PXagainst distribu-
tion Pˆ
X.D(Xkˆ
X) can be viewed as a measure of “distance” or “dissimilarity”
between distributions PXand Pˆ
X.D(Xkˆ
X) is also called relative entropy since
it can be regarded as a measure of the inefficiency of mistakenly assuming that
the distribution of a source is Pˆ
Xwhen the true distribution is PX. For example,
if we know the true distribution PXof a source, then we can construct a lossless
data compression code with average codeword length achieving entropy H(X)
(this will be studied in the next chapter). If, however, we mistakenly thought
that the “true” distribution is Pˆ
Xand employ the “best” code corresponding to
Pˆ
X, then the resultant average codeword length becomes
X
x∈X
[PX(x)·log2Pˆ
X(x)].
As a result, the relative difference between the resultant average codeword length
and H(X) is the relative entropy D(Xkˆ
X). Hence, divergence is a measure of
the system cost (e.g., storage consumed) paid due to mis-classifying the system
statistics.
Note that when computing divergence, we follow the convention that
0·log2
0
p= 0 and p·log2
p
0=for p > 0.
We next present some properties of the divergence and discuss its relation with
entropy and mutual information.
Lemma 2.30 (Non-negativity of divergence)
D(Xkˆ
X)0
with equality iff PX(x) = Pˆ
X(x) for all x X (i.e., the two distributions are
equal).
6In order to be consistent with the units (in bits) adopted for entropy and mutual informa-
tion, we will also use the base-2 logarithm for divergence unless otherwise specified.
27
Proof:
D(Xkˆ
X) = X
x∈X
PX(x) log2
PX(x)
Pˆ
X(x)
X
x∈X
PX(x)!log2Px∈X PX(x)
Px∈X Pˆ
X(x)
= 0
where the second step follows from the log-sum inequality with equality holding
iff for every x X,
PX(x)
Pˆ
X(x)=Pa∈X PX(a)
Pb∈X Pˆ
X(b),
or equivalently PX(x) = Pˆ
X(x) for all x X.2
Lemma 2.31 (Mutual information and divergence)
I(X;Y) = D(PX,Y kPX×PY),
where PX,Y (·,·) is the joint distribution of the random variables Xand Yand
PX(·) and PY(·) are the respective marginals.
Proof: The observation follows directly from the definitions of divergence and
mutual information. 2
Definition 2.32 (Refinement of distribution) Given distribution PXon X,
divide Xinto kmutually disjoint sets, U1,U2,...,Uk, satisfying
X=
k
[
i=1 Ui.
Define a new distribution PUon U={1,2,···, k}as
PU(i) = X
x∈Ui
PX(x).
Then PXis called a refinement (or more specifically, a k-refinement) of PU.
Let us briefly discuss the relation between the processing of information and
its refinement. Processing of information can be modeled as a (many-to-one)
mapping, and refinement is actually the reverse operation. Recall that the
data processing lemma shows that mutual information can never increase due
28
to processing. Hence, if one wishes to increase mutual information, he should
simultaneously “anti-process” (or refine) the involved statistics.
From Lemma 2.31, the mutual information can be viewed as the divergence
of a joint distribution against the product distribution of the marginals. It is
therefore reasonable to expect that a similar effect due to processing (or a reverse
effect due to refinement) should also apply to divergence. This is shown in the
next lemma.
Lemma 2.33 (Refinement cannot decrease divergence) Let PXand Pˆ
X
be the refinements (k-refinements) of PUand Pˆ
Urespectively. Then
D(PXkPˆ
X)D(PUkPˆ
U).
Proof: By the log-sum inequality, we obtain that for any i {1,2,···, k}
X
x∈Ui
PX(x) log2
PX(x)
Pˆ
X(x) X
x∈Ui
PX(x)!log2Px∈UiPX(x)
Px∈UiPˆ
X(x)
=PU(i) log2
PU(i)
Pˆ
U(i),(2.6.1)
with equality iff
PX(x)
Pˆ
X(x)=PU(i)
Pˆ
U(i)
for all x U. Hence,
D(PXkPˆ
X) =
k
X
i=1 X
x∈Ui
PX(x) log2
PX(x)
Pˆ
X(x)
k
X
i=1
PU(i) log2
PU(i)
Pˆ
U(i)
=D(PUkPˆ
U),
with equality iff
(i)(x Ui)PX(x)
Pˆ
X(x)=PU(i)
Pˆ
U(i).
2
29
Observation 2.34 One drawback of adopting the divergence as a measure be-
tween two distributions is that it does not meet the symmetry requirement of a
true distance,7since interchanging its two arguments may yield different quan-
tities. In other words, D(PXkPˆ
X)6=D(Pˆ
XkPX) in general. (It also does not
satisfy the triangular inequality.) Thus divergence is not a true distance or met-
ric. Another measure which is a true distance, called variational distance, is
sometimes used instead.
Definition 2.35 (Variational distance) The variational distance (or L1-distance)
between two distributions PXand Pˆ
Xwith common alphabet Xis defined by
kPXPˆ
Xk,X
x∈X |PX(x)Pˆ
X(x)|.
Lemma 2.36 The variational distance satisfies
kPXPˆ
Xk= 2 ·sup
E⊂X |PX(E)Pˆ
X(E)|= 2 ·X
x∈X:PX(x)>P ˆ
X(x)
[PX(x)Pˆ
X(x)] .
Proof: We first show that kPXPˆ
Xk= 2 ·Px∈X:PX(x)>P ˆ
X(x)[PX(x)Pˆ
X(x)] .
Setting A,{x X :PX(x)> P ˆ
X(x)}, we have
kPXPˆ
Xk=X
x∈X |PX(x)Pˆ
X(x)|
=X
x∈A |PX(x)Pˆ
X(x)|+X
x∈Ac|PX(x)Pˆ
X(x)|
=X
x∈A
[PX(x)Pˆ
X(x)] + X
x∈Ac
[Pˆ
X(x)PX(x)]
=X
x∈A
[PX(x)Pˆ
X(x)] + Pˆ
X(Ac)PX(Ac)
=X
x∈A
[PX(x)Pˆ
X(x)] + PX(A)Pˆ
X(A)
=X
x∈A
[PX(x)Pˆ
X(x)] + X
x∈A
[PX(x)Pˆ
X(x)]
= 2 ·X
x∈A
[PX(x)Pˆ
X(x)]
7Given a non-empty set A, the function d:A×A[0,) is called a distance or metric
if it satisfies the following properties.
1. Non-negativity: d(a, b)0 for every a, b Awith equality holding iff a=b.
2. Symmetry: d(a, b) = d(b, a) for every a, b A.
3. Triangular inequality: d(a, b) + d(b, c)d(a, c) for every a, b, c A.
30
where Acdenotes the complement set of A.
We next prove that kPXPˆ
Xk= 2 ·supE⊂X |PX(E)Pˆ
X(E)|by showing
that each quantity is greater than or equal to the other. For any set E X, we
can write
kPXPˆ
Xk=X
x∈X |PX(x)Pˆ
X(x)|
=X
xE|PX(x)Pˆ
X(x)|+X
xEc|PX(x)Pˆ
X(x)|
X
xE
[PX(x)Pˆ
X(x)]
+X
xEc
[PX(x)Pˆ
X(x)]
=|PX(E)Pˆ
X(E)|+|PX(Ec)Pˆ
X(Ec)|
=|PX(E)Pˆ
X(E)|+|Pˆ
X(E)PX(E)|
= 2 · |PX(E)Pˆ
X(E)|.
Thus kPXPˆ
Xk 2·supE⊂X |PX(E)Pˆ
X(E)|. Conversely, we have that
2·sup
E⊂X |PX(E)Pˆ
X(E)| 2· |PX(A)Pˆ
X(A)|
=|PX(A)Pˆ
X(A)|+|Pˆ
X(Ac)PX(Ac)|
=X
x∈A
[PX(x)Pˆ
X(x)]
+X
x∈Ac
[Pˆ
X(x)PX(x)]
=X
x∈A |PX(x)Pˆ
X(x)|+X
x∈Ac|PX(x)Pˆ
X(x)|
=kPXPˆ
Xk.
Therefore, kPXPˆ
Xk= 2 ·supE⊂X |PX(E)Pˆ
X(E)|.2
Lemma 2.37 (Variational distance vs divergence: Pinsker’s inequality)
D(Xkˆ
X)log2(e)
2· kPXPˆ
Xk2.
This result is referred to as Pinsker’s inequality.
Proof:
1. With A,{x X :PX(x)> P ˆ
X(x)}, we have from the previous lemma
that
kPXPˆ
Xk= 2[PX(A)Pˆ
X(A)].
31
2. Define two random variables Uand ˆ
Uas:
U=1,if X A;
0,if X Ac,
and
ˆ
U=1,if ˆ
X A;
0,if ˆ
X Ac.
Then PXand Pˆ
Xare refinements (2-refinements) of PUand Pˆ
U, respec-
tively. From Lemma 2.33, we obtain that
D(PXkPˆ
X)D(PUkPˆ
U).
3. The proof is complete if we show that
D(PUkPˆ
U)2 log2(e)[PX(A)Pˆ
X(A)]2
= 2 log2(e)[PU(1) Pˆ
U(1)]2.
For ease of notations, let p=PU(1) and q=Pˆ
U(1). Then to prove the
above inequality is equivalent to show that
p·ln p
q+ (1 p)·ln 1p
1q2(pq)2.
Define
f(p, q),p·ln p
q+ (1 p)·ln 1p
1q2(pq)2,
and observe that
df(p, q)
dq = (pq)41
q(1 q)0 for qp.
Thus, f(p, q) is non-increasing in qfor qp. Also note that f(p, q) = 0
for q=p. Therefore,
f(p, q)0 for qp.
The proof is completed by noting that
f(p, q)0 for qp,
since f(1 p, 1q) = f(p, q).
2
32
Observation 2.38 The above lemma tells us that for a sequence of distributions
{(PXn, P ˆ
Xn)}n1,when D(PXnkPˆ
Xn) goes to zero as ngoes to infinity, kPXn
Pˆ
Xnkgoes to zero as well. But the converse does not necessarily hold. For a
quick counterexample, let
PXn(0) = 1 PXn(1) = 1/n > 0
and
Pˆ
Xn(0) = 1 Pˆ
Xn(1) = 0.
In this case,
D(PXnkPˆ
Xn)
since by convention, (1/n)·log2((1/n)/0) . However,
kPXPˆ
Xk= 2PX{x:PX(x)> P ˆ
X(x)}Pˆ
X{x:PX(x)> P ˆ
X(x)}
=2
n0.
We however can upper bound D(PXkPˆ
X) by the variational distance between
PXand Pˆ
Xwhen D(PXkPˆ
X)<.
Lemma 2.39 If D(PXkPˆ
X)<, then
D(PXkPˆ
X)log2(e)
min
{x:PX(x)>0}min{PX(x), P ˆ
X(x)}· kPXPˆ
Xk.
Proof: Without loss of generality, we assume that PX(x)>0 for all x X.
Since D(PXkPˆ
X)<, we have that for any x X,PX(x)>0 implies that
Pˆ
X(x)>0. Let
t,min
{x∈X:PX(x)>0}min{PX(x), P ˆ
X(x)}.
Then for all x X,
ln PX(x)
Pˆ
X(x)ln PX(x)
Pˆ
X(x)
max
min{PX(x),P ˆ
X(x)}≤smax{PX(x),P ˆ
X(x)}
dln(s)
ds · |PX(x)Pˆ
X(x)|
=1
min{PX(x), P ˆ
X(x)}· |PX(x)Pˆ
X(x)|
1
t· |PX(x)Pˆ
X(x)|.
33
Hence,
D(PXkPˆ
X) = log2(e)X
x∈X
PX(x)·ln PX(x)
Pˆ
X(x)
log2(e)
tX
x∈X
PX(x)· |PX(x)Pˆ
X(x)|
log2(e)
tX
x∈X |PX(x)Pˆ
X(x)|
=log2(e)
t· kPXPˆ
Xk.
2
The next lemma discusses the effect of side information on divergence. As
stated in Lemma 2.12, side information usually reduces entropy; it, however,
increases divergence. One interpretation of these results is that side information
is useful. Regarding entropy, side information provides us more information, so
uncertainty decreases. As for divergence, it is the measure or index of how easy
one can differentiate the source from two candidate distributions. The larger
the divergence, the easier one can tell apart between these two distributions
and make the right guess. At an extreme case, when divergence is zero, one
can never tell which distribution is the right one, since both produce the same
source. So, when we obtain more information (side information), we should be
able to make a better decision on the source statistics, which implies that the
divergence should be larger.
Definition 2.40 (Conditional divergence) Given three discrete random vari-
ables, X,ˆ
Xand Z, where Xand ˆ
Xhave a common alphabet X, we define the
conditional divergence between Xand ˆ
Xgiven Zby
D(Xkˆ
X|Z) = D(PX|ZkPˆ
X|Z),X
z∈Z X
x∈X
PX,Z (x, z) log PX|Z(x|z)
Pˆ
X|Z(x|z).
In otherwords, it is the expected value with respect to PX,Z of the log-likelihood
ratio log PX|Z
Pˆ
X|Z.
Lemma 2.41 (Conditional mutual information and conditional diver-
gence) Given three discrete random variables X,Yand Zwith alphabets X,
Yand Z, respectively, and joint distribution PX,Y,Z , then
I(X;Y|Z) = D(PX,Y |ZkPX|ZPY|Z)
=X
x∈X X
y∈Y X
z∈Z
PX,Y,Z (x, y, z) log2
PX,Y |Z(x, y|z)
PX|Z(x|z)PY|Z(y|z),
34
where PX,Y |Zis conditional joint distribution of Xand Ygiven Z, and PX|Zand
PY|Zare the conditional distributions of Xand Y, respectively, given Z.
Proof: The proof follows directly from the definition of conditional mutual in-
formation (2.2.2) and the above defintion of conditional divergence. 2
Lemma 2.42 (Chain rule for divergence) For three discrete random vari-
ables, X,ˆ
Xand Z, where Xand ˆ
Xhave a common alphabet X, we have that
D(PX,Z kPˆ
X,Z ) = D(PXkPˆ
X) + D(PX|ZkPˆ
X|Z).
Proof: The proof readily follows from the divergence definitions. 2
Lemma 2.43 (Conditioning never decreases divergence) For three discrete
random variables, X,ˆ
Xand Z, where Xand ˆ
Xhave a common alphabet X,
we have that
D(PX|ZkPˆ
X|Z)D(PXkPˆ
X).
Proof:
D(PX|ZkPˆ
X|Z)D(PXkPˆ
X)
=X
z∈Z X
x∈X
PX,Z (x, z)·log2
PX|Z(x|z)
Pˆ
X|Z(x|z)X
x∈X
PX(x)·log2
PX(x)
Pˆ
X(x)
=X
z∈Z X
x∈X
PX,Z (x, z)·log2
PX|Z(x|z)
Pˆ
X|Z(x|z)X
x∈X X
z∈Z
PX,Z (x, z)!·log2
PX(x)
Pˆ
X(x)
=X
z∈Z X
x∈X
PX,Z (x, z)·log2
PX|Z(x|z)Pˆ
X(x)
Pˆ
X|Z(x|z)PX(x)
X
z∈Z X
x∈X
PX,Z (x, z)·log2(e) 1Pˆ
X|Z(x|z)PX(x)
PX|Z(x|z)Pˆ
X(x)!(by the FI Lemma)
= log2(e) 1X
x∈X
PX(x)
Pˆ
X(x)X
z∈Z
PZ(z)Pˆ
X|Z(x|z)!
= log2(e) 1X
x∈X
PX(x)
Pˆ
X(x)Pˆ
X(x)!
= log2(e) 1X
x∈X
PX(x)!= 0
35
with equality holding iff for all xand z,
PX(x)
Pˆ
X(x)=PX|Z(x|z)
Pˆ
X|Z(x|z).
2
Note that it is not necessary that
D(PX|ZkPˆ
X|ˆ
Z)D(PXkPˆ
X).
In other words, the side information is helpful for divergence only when it pro-
vides information on the similarity or difference of the two distributions. For the
above case, Zonly provides information about X, and ˆ
Zprovides information
about ˆ
X; so the divergence certainly cannot be expected to increase. The next
lemma shows that if (Z, ˆ
Z) are independent of (X, ˆ
X), then the side information
of (Z, ˆ
Z) does not help in improving the divergence of Xagainst ˆ
X.
Lemma 2.44 (Independent side information does not change diver-
gence) If (X, ˆ
X) is independent of (Z, ˆ
Z), then
D(PX|ZkPˆ
X|ˆ
Z) = D(PXkPˆ
X),
where
D(PX|ZkPˆ
X|ˆ
Z),X
x∈X X
ˆxˆ
XX
z∈Z X
ˆzˆ
Z
PX, ˆ
X,Z, ˆ
Z(x, ˆx, z, ˆz) log2
PX|Z(x|z)
Pˆ
X|ˆ
Zx|ˆz).
Proof: This can be easily justified by the definition of divergence. 2
Lemma 2.45 (Additivity of divergence under independence)
D(PX,Z kPˆ
X, ˆ
Z) = D(PXkPˆ
X) + D(PZkPˆ
Z),
provided that (X, ˆ
X) is independent of (Z, ˆ
Z).
Proof: This can be easily proved from the definition. 2
2.7 Convexity/concavity of information measures
We next address the convexity/concavity properties of information measures
with respect to the distributions on which they are defined. Such properties will
be useful when optimizing the information measures over distribution spaces.
36
Lemma 2.46
1. H(PX) is a concave function of PX, namely
H(λPX+ (1 λ)Pe
X)λH(PX) + (1 λ)H(Pe
X).
2. Noting that I(X;Y) can be re-written as I(PX, PY|X), where
I(PX, PY|X),X
x∈X X
y∈Y
PY|X(y|x)PX(x) log2
PY|X(y|x)
Pa∈X PY|X(y|a)PX(a),
then I(X;Y) is a concave function of PX(for fixed PY|X), and a convex
function of PY|X(for fixed PX).
3. D(PXkPˆ
X) is convex with respect to both the first argument PXand the
second argument Pˆ
X. It is also convex in the pair (PX, P ˆ
X); i.e., if (PX, P ˆ
X)
and (QX, Q ˆ
X) are two pairs of probability mass functions, then
D(λPX+ (1 λ)QXkλP ˆ
X+ (1 λ)Qˆ
X)
λ·D(PXkQX) + (1 λ)·D(Pˆ
XkQˆ
X),(2.7.1)
for all λ[0,1].
Proof:
1. The proof uses the log-sum inequality:
H(λPX+ (1 λ)Pe
X)λH(PX) + (1 λ)H(Pe
X)
=λX
x∈X
PX(x) log2
PX(x)
λPX(x) + (1 λ)Pe
X(x)
+(1 λ)X
x∈X
Pe
X(x) log2
Pe
X(x)
λPX(x) + (1 λ)Pe
X(x)
λ X
x∈X
PX(x)!log2Px∈X PX(x)
Px∈X[λPX(x) + (1 λ)Pe
X(x)]
+(1 λ) X
x∈X
Pe
X(x)!log2Px∈X Pe
X(x)
Px∈X[λPX(x) + (1 λ)Pe
X(x)]
= 0,
with equality holding iff PX(x) = Pe
X(x) for all x.
37
2. We first show the concavity of I(PX;PY|X) with respect to PX. Let ¯
λ=
1λ.
I(λPX+¯
λP e
X, PY|X)λI(PX, PY|X)¯
λI(Pe
X, PY|X)
=λX
y∈Y X
x∈X
PX(x)PY|X(y|x) log2X
x∈X
PX(x)PY|X(y|x)
X
x∈X
[λPX(x) + ¯
λP e
X(x)]PY|X(y|x)]
+¯
λX
y∈Y X
x∈X
Pe
X(x)PY|X(y|x) log2X
x∈X
Pe
X(x)PY|X(y|x)
X
x∈X
[λPX(x) + ¯
λP e
X(x)]PY|X(y|x)]
0 (by the log-sum inequality)
with equality holding iff
X
x∈X
PX(x)PY|X(y|x) = X
x∈X
Pe
X(x)PY|X(y|x)
for all y Y. We now turn to the convexity of I(PX, PY|X) with respect to
PY|X. For ease of notation, let PYλ(y),λPY(y)+¯
λPe
Y(y),and PYλ|X(y|x),
λPY|X(y|x) + ¯
λPe
Y|X(y|x).Then
λI(PX, PY|X) + ¯
λI(PX, Pe
Y|X)I(PX, λPY|X+¯
λPe
Y|X)
=λX
x∈X X
y∈Y
PX(x)PY|X(y|x) log2
PY|X(y|x)
PY(y)
+¯
λX
x∈X X
y∈Y
PX(x)Pe
Y|X(y|x) log2
Pe
Y|X(y|x)
Pe
Y(y)
X
x∈X X
y∈Y
PX(x)PYλ|X(y|x) log2
PYλ|X(y|x)
PYλ(y)
=λX
x∈X X
y∈Y
PX(x)PY|X(y|x) log2
PY|X(y|x)PYλ(y)
PY(y)PYλ|X(y|x)
+¯
λX
x∈X X
y∈Y
PX(x)Pe
Y|X(y|x) log2
Pe
Y|X(y|x)PYλ(y)
Pe
Y(y)PYλ|X(y|x)
λlog2(e)X
x∈X X
y∈Y
PX(x)PY|X(y|x)1PY(y)PYλ|X(y|x)
PY|X(y|x)PYλ(y)
+¯
λlog2(e)X
x∈X X
y∈Y
PX(x)Pe
Y|X(y|x) 1Pe
Y(y)PYλ|X(y|x)
Pe
Y|X(y|x)PYλ(y)!
= 0,
38
where the inequality follows from the FI Lemma, with equality holding iff
(x X, y Y)PY(y)
PY|X(y|x)=Pe
Y(y)
Pe
Y|X(y|x).
3. For ease of notation, let PXλ(x),λPX(x) + (1 λ)Pe
X(x).
λD(PXkPˆ
X) + (1 λ)D(Pe
XkPˆ
X)D(PXλkPˆ
X)
=λX
x∈X
PX(x) log2
PX(x)
PXλ(x)+ (1 λ)X
x∈X
Pe
X(x) log2
Pe
X(x)
PXλ(x)
=λD(PXkPXλ) + (1 λ)D(Pe
XkPXλ)
0
by the non-negativity of the divergence, with equality holding iff PX(x) =
Pe
X(x) for all x.
Similarly, by letting Pˆ
Xλ(x),λP ˆ
X(x) + (1 λ)Pe
X(x), we obtain:
λD(PXkPˆ
X) + (1 λ)D(PXkPe
X)D(PXkPˆ
Xλ)
=λX
x∈X
PX(x) log2
Pˆ
Xλ(x)
Pˆ
X(x)+ (1 λ)X
x∈X
PX(x) log2
Pˆ
Xλ(x)
Pe
X(x)
λ
ln 2 X
x∈X
PX(x) 1Pˆ
X(x)
Pˆ
Xλ(x)!+(1 λ)
ln 2 X
x∈X
PX(x) 1Pe
X(x)
Pˆ
Xλ(x)!
= log2(e) 1X
x∈X
PX(x)λP ˆ
X(x) + (1 λ)Pe
X(x)
Pˆ
Xλ(x)!
= 0,
where the inequality follows from the FI Lemma, with equality holding iff
Pe
X(x) = Pˆ
X(x) for all x.
Finally, by the log-sum inequality, for each x X, we have
(λPX(x) + (1 λ)Pˆ
X(x)) log2
λPX(x) + (1 λ)Pˆ
X(x)
λQX(x) + (1 λ)Qˆ
X(x)
λPX(x) log2
λPX(x)
λQX(x)+ (1 λ)Pˆ
X(x) log2
(1 λ)Pˆ
X(x)
(1 λ)Qˆ
X(x).
Summing over x, we yield (2.7.1).
Note that the last result (convexity of D(PXkPˆ
X) in the pair (PX, P ˆ
X))
actually implies the first two results: just set Pˆ
X=Qˆ
Xto show convexity
in the first argument PX, and set PX=QXto show convexity in the second
argument Pˆ
X.
2
39
2.8 Fundamentals of hypothesis testing
One of the fundamental problems in statistics is to decide between two alternative
explanations for the observed data. For example, when gambling, one may wish
to test whether it is a fair game or not. Similarly, a sequence of observations on
the market may reveal the information that whether a new product is successful
or not. This is the simplest form of the hypothesis testing problem, which is
usually named simple hypothesis testing.
It has quite a few applications in information theory. One of the frequently
cited examples is the alternative interpretation of the law of large numbers.
Another example is the computation of the true coding error (for universal codes)
by testing the empirical distribution against the true distribution. All of these
cases will be discussed subsequently.
The simple hypothesis testing problem can be formulated as follows:
Problem: Let X1,...,Xnbe a sequence of observations which is possibly dr-
awn according to either a “null hypothesis” distribution PXnor an “alternative
hypothesis” distribution Pˆ
Xn. The hypotheses are usually denoted by:
H0:PXn
H1:Pˆ
Xn
Based on one sequence of observations xn, one has to decide which of the hy-
potheses is true. This is denoted by a decision mapping φ(·), where
φ(xn) = 0,if distribution of Xnis classified to be PXn;
1,if distribution of Xnis classified to be Pˆ
Xn.
Accordingly, the possible observed sequences are divided into two groups:
Acceptance region for H0:{xn Xn:φ(xn) = 0}
Acceptance region for H1:{xn Xn:φ(xn) = 1}.
Hence, depending on the true distribution, there are possibly two types of pro-
bability of errors:
Type I error : αn=αn(φ),PXn({xn Xn:φ(xn) = 1})
Type II error : βn=βn(φ),Pˆ
Xn({xn Xn:φ(xn) = 0}).
The choice of the decision mapping is dependent on the optimization criterion.
Two of the most frequently used ones in information theory are:
40
1. Bayesian hypothesis testing.
Here, φ(·) is chosen so that the Bayesian cost
π0αn+π1βn
is minimized, where π0and π1are the prior probabilities for the null
and alternative hypotheses, respectively. The mathematical expression for
Bayesian testing is:
min
{φ}[π0αn(φ) + π1βn(φ)] .
2. Neyman Pearson hypothesis testing subject to a fixed test level.
Here, φ(·) is chosen so that the type II error βnis minimized subject to a
constant bound on the type I error; i.e.,
αnε
where ε > 0 is fixed. The mathematical expression for Neyman-Pearson
testing is:
min
{φ:αn(φ)ε}βn(φ).
The set {φ}considered in the minimization operation could have two different
ranges: range over deterministic rules, and range over randomization rules. The
main difference between a randomization rule and a deterministic rule is that the
former allows the mapping φ(xn) to be random on {0,1}for some xn, while the
latter only accept deterministic assignments to {0,1}for all xn. For example, a
randomization rule for specific observations ˜xncan be
φxn) = 0,with probability 0.2;
φxn) = 1,with probability 0.8.
The Neyman-Pearson lemma shows the well-known fact that the likelihood
ratio test is always the optimal test.
Lemma 2.47 (Neyman-Pearson Lemma) For a simple hypothesis testing
problem, define an acceptance region for the null hypothesis through the likeli-
hood ratio as
An(τ),xn Xn:PXn(xn)
Pˆ
Xn(xn)> τ,
and let
α
n,PXn{Ac
n(τ)}and β
n,Pˆ
Xn{An(τ)}.
Then for type I error αnand type II error βnassociated with another choice of
acceptance region for the null hypothesis, we have
αnα
nβnβ
n.
41
Proof: Let Bbe a choice of acceptance region for the null hypothesis. Then
αn+τβn=X
xn∈Bc
PXn(xn) + τX
xn∈B
Pˆ
Xn(xn)
=X
xn∈Bc
PXn(xn) + τ"1X
xn∈Bc
Pˆ
Xn(xn)#
=τ+X
xn∈Bc
[PXn(xn)τP ˆ
Xn(xn)] .(2.8.1)
Observe that (2.8.1) is minimized by choosing B=An(τ). Hence,
αn+τβnα
n+τβ
n,
which immediately implies the desired result. 2
The Neyman-Pearson lemma indicates that no other choices of acceptance re-
gions can simultaneously improve both type I and type II errors of the likelihood
ratio test. Indeed, from (2.8.1), it is clear that for any αnand βn, one can always
find a likelihood ratio test that performs as good. Therefore, the likelihood ratio
test is an optimal test. The statistical properties of the likelihood ratio thus
become essential in hypothesis testing. Note that, when the observations are
i.i.d. under both hypotheses, the divergence, which is the statistical expectation
of the log-likelihood ratio, plays an important role in hypothesis testing (for
non-memoryless observations, one is then concerned with the divergence rate, an
extended notion of divergence for systems with memory which will be defined in
a following chapter).
42
Chapter 3
Lossless Data Compression
3.1 Principles of data compression
As mentioned in Chapter 1, data compression describes methods of representing
a source by a code whose average codeword length (or code rate) is acceptably
small. The representation can be: lossless (or asymptotically lossless) where
the reconstructed source is identical (or asymptotically identical) to the original
source; or lossy where the reconstructed source is allowed to deviate from the
original source, usually within an acceptable threshold. We herein focus on
lossless data compression.
Since a memoryless source is modelled as a random variable, the averaged
codeword length of a codebook is calculated based on the probability distribution
of that random variable. For example, a ternary memoryless source Xexhibits
three possible outcomes with
PX(x= outcomeA) = 0.5;
PX(x= outcomeB) = 0.25;
PX(x= outcomeC) = 0.25.
Suppose that a binary code book is designed for this source, in which outcomeA,
outcomeBand outcomeCare respectively encoded as 0, 10, and 11. Then the
average codeword length (in bits/source outcome) is
length(0) ×PX(outcomeA) + length(10) ×PX(outcomeB)
+length(11) ×PX(outcomeC)
= 1 ×0.5 + 2 ×0.25 + 2 ×0.25
= 1.5 bits.
There are usually no constraints on the basic structure of a code. In the
case where the codeword length for each source outcome can be different, the
43
code is called a variable-length code. When the codeword lengths of all source
outcomes are equal, the code is referred to as a fixed-length code . It is obvious
that the minimum average codeword length among all variable-length codes is
no greater than that among all fixed-length codes, since the latter is a subclass
of the former. We will see in this chapter that the smallest achievable average
code rate for variable-length and fixed-length codes coincide for sources with
good probabilistic characteristics, such as stationarity and ergodicity. But for
more general sources with memory, the two quantities are different (cf. Part II
of the book).
For fixed-length codes, the sequence of adjacent codewords are concate-
nated together for storage or transmission purposes, and some punctuation
mechanism—such as marking the beginning of each codeword or delineating
internal sub-blocks for synchronization between encoder and decoder—is nor-
mally considered an implicit part of the codewords. Due to constraints on space
or processing capability, the sequence of source symbols may be too long for the
encoder to deal with all at once; therefore, segmentation before encoding is often
necessary. For example, suppose that we need to encode using a binary code the
grades of a class with 100 students. There are three grade levels: A,Band C.
By observing that there are 3100 possible grade combinations for 100 students,
a straightforward code design requires log2(3100)= 159 bits to encode these
combinations. Now suppose that the encoder facility can only process 16 bits
at a time. Then the above code design becomes infeasible and segmentation is
unavoidable. Under such constraint, we may encode grades of 10 students at a
time, which requires log2(310)= 16 bits. As a consequence, for a class of 100
students, the code requires 160 bits in total.
In the above example, the letters in the grade set {A, B, C }and the letters
from the code alphabet {0,1}are often called source symbols and code symbols,
respectively. When the code alphabet is binary (as in the previous two examples),
the code symbols are referred to as code bits or simply bits (as already used).
A tuple (or grouped sequence) of source symbols is called a sourceword and the
resulting encoded tuple consisting of code symbols is called a codeword. (In the
above example, each sourceword consists of 10 source symbols (students) and
each codeword consists of 16 bits.)
Note that, during the encoding process, the sourceword lengths do not have to
be equal. In this text, we however only consider the case where the sourcewords
have a fixed length throughout the encoding process (except for the Lempel-Ziv
code briefly discussed at the end of this chapter), but we will allow the codewords
to have fixed or variable lengths as defined earlier.1The block diagram of a source
1In other words, our fixed-length codes are actually “fixed-to-fixed length codes” and our
variable-length codes are “fixed-to-variable length codes” since, in both cases, a fixed number
44
Source -
sourcewords Source
encoder -
-
codewords Source
decoder -
sourcewords
Figure 3.1: Block diagram of a data compression system.
coding system is depicted in Figure 3.1.
When adding segmentation mechanisms to fixed-length codes, the codes can
be loosely divided into two groups. The first consists of block codes in which the
encoding (or decoding) of the next segment of source symbols is independent of
the previous segments. If the encoding/decoding of the next segment, somehow,
retains and uses some knowledge of earlier segments, the code is called a fixed-
length tree code. As will not investigate such codes in this text, we can use
“block codes” and “fixed-length codes” as synonyms.
In this chapter, we first consider data compression for block codes in Sec-
tion 3.2. Data compression for variable-length codes is then addressed in Sec-
tion 3.3.
3.2 Block codes for asymptotically lossless compression
3.2.1 Block codes for discrete memoryless sources
We first focus on the study of asymptotically lossless data compression of discrete
memoryless sources via block (fixed-length) codes. Such sources were already
defined in Appendix B and the previous chapter; but we nevertheless recall their
definition.
Definition 3.1 (Discrete memoryless source) A discrete memoryless source
(DMS) {Xn}
n=1 consists of a sequence of independent and identically distributed
(i.i.d.) random variables, X1, X2, X3,..., all taking values in a common finite
alphabet X. In particular, if PX(·) is the common distribution or probability
mass function (pmf) of the Xi’s, then
PXn(x1, x2,...,xn) =
n
Y
i=1
PX(xi).
of source symbols is mapped onto codewords with fixed and variable lengths, respectively.
45
Definition 3.2 An (n, M) block code of blocklength nand size M(which can
be a function of nin general,2i.e., M=Mn) for a discrete source {Xn}
n=1 is a set
{c1,c2,...,cM} Xnconsisting of Mreproduction (or reconstruction) words,
where each reproduction word is a sourceword (an n-tuple of source symbols).3
The block code’s operation can be symbolically represented as4
(x1, x2,...,xn)cm {c1,c2,...,cM}.
This procedure will be repeated for each consecutive block of length n, i.e.,
···(x3n,...,x31)(x2n,...,x21)(x1n,...,x11) ·· ·|cm3|cm2|cm1,
where | reflects the necessity of “punctuation mechanism” or ”synchronization
mechanism” for consecutive source block coders.
The next theorem provides a key tool for proving Shannon’s source coding
theorem.
2In the literature, both (n, M ) and (M, n) have been used to denote a block code with
blocklength nand size M. For example, [45, p. 149] adopts the former one, while [12, p. 193]
uses the latter. We use the (n, M ) notation since M=Mnis a function of nin general.
3One can binary-index the reproduction words in {c1,c2,...,cM}using k,log2Mbits.
As such k-bit words in {0,1}kare usually stored for retrieval at a later date, the (n, M ) block
code can be represented by an encoder-decoder pair of functions (f , g), where the encoding
function f:Xn {0,1}kmaps each sourceword xnto a k-bit word f(xn) which we call a
codeword. Then the decoding function g:{0,1}k {c1,c2,...,cM}is a retrieving operation
that produces the reproduction words. Since the codewords are binary-valued, such a block
code is called a binary code. More generally, a D-ary block code (where D > 1 is an integer)
would use an encoding function f:Xn {0,1,···, D 1}kwhere each codeword f(xn)
contains k D-ary code symbols.
Furthermore, since the behavior of block codes is investigated for sufficiently large nand
M(tending to infinity), it is legitimate to replace log2Mby log2Mfor the case of binary
codes. With this convention, the data compression rate or code rate is
bits required per source symbol = k
n=1
nlog2M.
Similarly, for D-ary codes, the rate is
D-ary code symbols required per source symbol = k
n=1
nlogDM.
For computational convenience, nats (under the natural logarithm) can be used instead of
bits or D-ary code symbols; in this case, the code rate becomes:
nats required per source symbol = 1
nlog M.
4When one uses an encoder-decoder pair (f , g) to describe the block code, the code’s oper-
ation can be expressed as: cm=g(f(xn)).
46
Theorem 3.3 (Shannon-McMillan) (Asymptotic equipartition property
or AEP5)If {Xn}
n=1 is a DMS with entropy H(X), then
1
nlog2PXn(X1,...,Xn)H(X) in probability.
In other words, for any δ > 0,
lim
n→∞ Pr 1
nlog2PXn(X1,...,Xn)H(X)> δ= 0.
Proof: This theorem follows by first observing that for an i.i.d. sequence {Xn}
n=1,
1
nlog2PXn(X1,...,Xn) = 1
n
n
X
i=1
[log2PX(Xi)]
and that the sequence {−log2PX(Xi)}
i=1 is i.i.d., and then applying the weak
law of large numbers (WLLN) on the later sequence. 2
The AEP indeed constitutes an “information theoretic” analog of WLLN as
it states that if {−log2PX(Xi)}
i=1 is an i.i.d. sequence, then for any δ > 0,
Pr (
1
n
n
X
i=1
[log2PX(Xi)] H(X)δ)1 as n .
As a consequence of the AEP, all the probability mass will be ultimately placed
on the weakly δ-typical set, which is defined as
Fn(δ),xn Xn:1
nlog2PXn(xn)H(X)δ
=(xn Xn:1
n
n
X
i=1
log2PX(xi)H(X)δ).
Note that since the source is memoryless, for any xn Fn(δ), (1/n) log2PXn(xn),
the normalized self-information of xn, is equal to (1/n)Pn
i=1 [log2PX(xi)],
which is the empirical (arithmetic) average self-information or “apparent” en-
tropy of the source. Thus, a sourceword xnis δ-typical if it yields an apparent
source entropy within δof the “true” source entropy H(X). Note that the source-
words in Fn(δ) are nearly equiprobable or equally surprising (cf. Property 1 of
Theorem 3.4); this justifies naming Theorem 3.3 by AEP.
Theorem 3.4 (Consequence of the AEP) Given a DMS {Xn}
n=1 with en-
tropy H(X) and any δgreater than zero, then the weakly δ-typical set Fn(δ)
satisfies the following.
5This is also called the entropy stability property.
47
1. If xn Fn(δ), then
2n(H(X)+δ)PXn(xn)2n(H(X)δ).
2. PXn(Fc
n(δ)) < δ for sufficiently large n, where the superscript c denotes
the complementary set operation.
3. |Fn(δ)|>(1δ)2n(H(X)δ)for sufficiently large n, and |Fn(δ)| 2n(H(X)+δ)
for every n, where |Fn(δ)|denotes the number of elements in Fn(δ).
Note: The above theorem also holds if we define the typical set using the base-
Dlogarithm logDfor any D > 1 instead of the base-2 logarithm; in this case,
one just needs to appropriately change the base of the exponential terms in the
above theorem (by replacing 2xterms with Dxterms) and also substitute H(X)
with HD(X)).
Proof: Property 1 is an immediate consequence of the definition of Fn(δ).
Property 2 is a direct consequence of the AEP, since the AEP states that for a
fixed δ > 0, limn→∞ PXn(Fn(δ)) = 1; i.e., ε > 0, there exists n0=n0(ε) such
that for all nn0,
PXn(Fn(δ)) >1ε.
In particular, setting ε=δyields the result. We nevertheless provide a direct
proof of Property 2 as we give an explicit expression for n0: observe that by
Chebyshev’s inequality,
PXn(Fc
n(δ)) = PXnxn Xn:1
nlog2PXn(xn)H(X)> δ
σ2
X
2< δ,
for n > σ2
X3, where the variance
σ2
X,Var[log2PX(X)] = X
x∈X
PX(x) [log2PX(x)]2(H(X))2
is a constant6independent of n.
6In the proof, we assume that the variance σ2
X= Var[log2PX(X)] <. This holds since
the source alphabet is finite:
Var[log2PX(X)] E[(log2PX(X))2] = X
x∈X
PX(x)(log2PX(x))2
X
x∈X
4
e2[log2(e)]2=4
e2[log2(e)]2× |X | <.
48
To prove Property 3, we have from Property 1 that
1X
xn∈Fn(δ)
PXn(xn)X
xn∈Fn(δ)
2n(H(X)+δ)=|Fn(δ)|2n(H(X)+δ),
and, using Properties 2 and 1, we have that
1δ < 1σ2
X
2X
xn∈Fn(δ)
PXn(xn)X
xn∈Fn(δ)
2n(H(X)δ)=|Fn(δ)|2n(H(X)δ),
for nσ2
X3.2
Note that for any n > 0, a block code Cn= (n, M) is said to be uniquely
decodable or completely lossless if its set of reproduction words is trivially equal
to the set of all source n-tuples: {c1,c2,...,cM}=Xn. In this case, if we are
binary-indexing the reproduction words using an encoding-decoding pair (f , g),
every sourceword xnwill be assigned to a distinct binary codeword f(xn) of
length k= log2Mand all the binary k-tuples are the image under fof some
sourceword. In other words, fis a bijective (injective and surjective) map and
hence invertible with the decoding map g=f1and M=|X|n= 2k. Thus the
code rate is (1/n) log2M= log2|X| bits/source symbol.
Now the question becomes: can we achieve a better (i.e., smaller) compres-
sion rate? The answer is affirmative: we can achieve a compression rate equal to
the source entropy H(X) (in bits), which can be significantly smaller than log2M
when this source is strongly non-uniformly distributed, if we give up unique de-
codability (for every n) and allow nto be sufficiently large to asymptotically
achieve lossless reconstruction by having an arbitrarily small (but positive) pro-
bability of decoding error Pe(Cn),PXn{xn Xn:g(f(xn)) 6=xn}.
Thus, block codes herein can perform data compression that is asymptotically
lossless with respect to blocklength; this contrasts with variable-length codes
which can be completely lossless (uniquely decodable) for every finite block-
length.
We now can formally state and prove Shannon’s asymptotically lossless source
coding theorem for block codes. The theorem will be stated for general D-
ary block codes, representing the source entropy HD(X) in D-ary code sym-
bol/source symbol as the smallest (infimum) possible compression rate for asymp-
totically lossless D-ary block codes. Without loss of generality, the theorem
will be proved for the case of D= 2. The idea behind the proof of the for-
ward (achievability) part is basically to binary-index the source sequence in the
weakly δ-typical set Fn(δ) to a binary codeword (starting from index one with
corresponding k-tuple codeword 0 ···01); and to encode all sourcewords outside
Fn(δ) to a default all-zero binary codeword, which certainly cannot be repro-
duced distortionless due to its many-to-one-mapping property. The resultant
49
Source 1
2
2
X
i=1
log2PX(xi)H(X)
codeword reconstructed
source sequence
AA 0.525 bits 6∈ F2(0.4) 000 ambiguous
AB 0.317 bits F2(0.4) 001 AB
AC 0.025 bits F2(0.4) 010 AC
AD 0.475 bits 6∈ F2(0.4) 000 ambiguous
BA 0.317 bits F2(0.4) 011 BA
BB 0.109 bits F2(0.4) 100 BB
BC 0.183 bits F2(0.4) 101 BC
BD 0.683 bits 6∈ F2(0.4) 000 ambiguous
CA 0.025 bits F2(0.4) 110 CA
CB 0.183 bits F2(0.4) 111 CB
CC 0.475 bits 6∈ F2(0.4) 000 ambiguous
CD 0.975 bits 6∈ F2(0.4) 000 ambiguous
DA 0.475 bits 6∈ F2(0.4) 000 ambiguous
DB 0.683 bits 6∈ F2(0.4) 000 ambiguous
DC 0.975 bits 6∈ F2(0.4) 000 ambiguous
DD 1.475 bits 6∈ F2(0.4) 000 ambiguous
Table 3.1: An example of the δ-typical set with n= 2 and δ= 0.4,
where F2(0.4) = {AB, AC, BA, BB, BC, CA, CB }. The codeword
set is {001(AB), 010(AC), 011(BA), 100(BB), 101(BC), 110(CA),
111(CB), 000(AA, AD, BD, CC, CD, DA, DB, DC, DD) }, where
the parenthesis following each binary codeword indicates those source-
words that are encoded to this codeword. The source distribution is
PX(A) = 0.4, PX(B) = 0.3, PX(C) = 0.2 and PX(D) = 0.1.
code rate is (1/n)log2(|Fn(δ)|+ 1)bits per source symbol. As revealed in the
Shannon-McMillan AEP theorem and its Consequence, almost all the probabi-
lity mass will be on Fn(δ) as nsufficiently large, and hence, the probability of
non-reconstructable source sequences can be made arbitrarily small. A simple
example for the above coding scheme is illustrated in Table 3.1. The converse
part of the proof will establish (by expressing the probability of correct decoding
in terms of the δ-typical set and also using the Consequence of the AEP) that
for any sequence of D-ary codes with rate strictly below the source entropy, their
probability of error cannot asymptotically vanish (is bounded away from zero).
Actually a stronger result is proven: it is shown that their probability of error
not only does not asymptotically vanish, it actually ultimately grows to 1 (this
is why we call this part a “strong” converse).
50
Theorem 3.5 (Shannon’s source coding theorem) Given integer D > 1,
consider a discrete memoryless source {Xn}
n=1 with entropy HD(X). Then the
following hold.
Forward part (achievability): For any 0 < ε < 1, there exists 0 < δ < ε
and a sequence of D-ary block codes { Cn= (n, Mn)}
n=1 with
lim sup
n→∞
1
nlogDMnHD(X) + δ(3.2.1)
satisfying
Pe(Cn)< ε (3.2.2)
for all sufficiently large n, where Pe(Cn) denotes the probability of decoding
error for block code Cn.7
Strong converse part: For any 0 < ε < 1, any sequence of D-ary block
codes { Cn= (n, Mn)}
n=1 with
lim sup
n→∞
1
nlogDMn< HD(X) (3.2.3)
satisfies
Pe(Cn)>1ε
for all nsufficiently large.
Proof:
Forward Part: Without loss of generality, we will prove the result for the case of
binary codes (i.e., D= 2). Please be reminded that subscript Din HD(X) will
be dropped (i.e., omitted) specifically when D= 2.
Given 0 < ε < 1, fix δsuch that 0 < δ < ε and choose n > 2. Now
construct a binary Cnblock code by simply mapping the δ/2-typical sourcewords
xnonto distinct not all-zero binary codewords of length k,log2Mnbits. In
other words, binary-index (cf. the footnote in Definition 3.2) the sourcewords in
Fn(δ/2) with the following encoding map:
xnbinary index of xn,if xn Fn(δ/2);
xnall-zero codeword,if xn6∈ Fn(δ/2).
7(3.2.2) is equivalent to lim supn→∞ Pe(Cn)ε. Since εcan be made arbitrarily small,
the forward part actually indicates the existence of a sequence of D-ary block codes { Cn}
n=1
satisfying (3.2.1) such that lim supn→∞ Pe(Cn) = 0.
Based on this, the converse should be that any sequence of D-ary block codes satisfying
(3.2.3) satisfies lim supn→∞ Pe(Cn)>0. However, the so-called strong converse actually gives
a stronger consequence: lim supn→∞ Pe(Cn) = 1 (as ǫcan be made arbitrarily small).
51
Then by the Shannon-McMillan AEP theorem, we obtain that
Mn=|Fn(δ/2)|+ 1 2n(H(X)+δ/2) + 1 <2·2n(H(X)+δ/2) <2n(H(X)+δ),
for n > 2. Hence, a sequence of Cn= (n, Mn) block code satisfying (3.2.1) is
established. It remains to show that the error probability for this sequence of
(n, Mn) block code can be made smaller than εfor all sufficiently large n.
By the Shannon-McMillan AEP theorem,
PXn(Fc
n(δ/2)) <δ
2for all sufficiently large n.
Consequently, for those nsatisfying the above inequality, and being bigger than
2,
Pe(Cn)PXn(Fc
n(δ/2)) < δ ε.
(For the last step, the readers can refer to Table 3.1 to confirm that only the
“ambiguous” sequences outside the typical set contribute to the probability of
error.)
Strong Converse Part: Fix any sequence of block codes { Cn}
n=1 with
lim sup
n→∞
1
nlog2| Cn|< H(X).
Let Snbe the set of source symbols that can be correctly decoded through Cn-
coding system. (A quick example is depicted in Figure 3.2.) Then |Sn|=| Cn|.
By choosing δsmall enough with ε/2> δ > 0, and also by definition of limsup
operation, we have
(N0)(n > N0)1
nlog2|Sn|=1
nlog2| Cn|< H(X)2δ,
which implies
|Sn|<2n(H(X)2δ).
Furthermore, from Property 2 of the Consequence of the AEP, we obtain that
(N1)(n > N1)PXn(Fc
n(δ)) < δ.
52
Source Symbols
Sn
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?W ?W
Codewords
Figure 3.2: Possible codebook Cnand its corresponding Sn. The solid
box indicates the decoding mapping from Cnback to Sn.
Consequently, for n > N ,max{N0, N1,log2(2)}, the probability of cor-
rectly block decoding satisfies
1Pe(Cn) = X
xn∈Sn
PXn(xn)
=X
xn∈Sn∩Fc
n(δ)
PXn(xn) + X
xn∈Sn∩Fn(δ)
PXn(xn)
PXn(Fc
n(δ)) + |Sn Fn(δ)| · max
xn∈Fn(δ)PXn(xn)
< δ +|Sn| · max
xn∈Fn(δ)PXn(xn)
<ε
2+ 2n(H(X)2δ)·2n(H(X)δ)
<ε
2+ 2
< ε,
which is equivalent to Pe(Cn)>1εfor n > N.2
Observation 3.6 The results of the above theorem is illustrated in Figure 3.3,
where R= lim supn→∞(1/n) logDMnis usually called the ultimate (or asymp-
totic) code rate of block codes for compressing the source. It is clear from the
figure that the (ultimate) rate of any block code with arbitrarily small decod-
ing error probability must be greater than the source entropy. Conversely, the
probability of decoding error for any block code of rate smaller than entropy ul-
timately approaches 1 (and hence is bounded away from zero). Thus for a DMS,
the source entropy HD(X) is the infimum of all “achievable” source (block) cod-
ing rates; i.e., it is the infimum of all rates for which there exists a sequence
of D-ary block codes with asymptotically vanishing (as the blocklength goes to
infinity) probability of decoding error.
53
-
HD(X)
Pe
n→∞
1
for all block codes
Pe
n→∞
0
for the best data compression block code
R
Figure 3.3: (Ultimate) Compression rate Rversus source entropy
HD(X) and behavior of the probability of block decoding error as block
length ngoes to infinity for a discrete memoryless source.
For a source with (statistical) memory, Shannon-McMillan’s theorem cannot
be directly applied in its original form, and thereby Shannon’s source coding
theorem appears restricted to only memoryless sources. However, by exploring
the concept behind these theorems, one can find that the key for the validity
of Shannon’s source coding theorem is actually the existence of a set An=
{xn
1, xn
2,...,xn
M}with MDnHD(X)and PXn(Ac
n)0, namely, the existence
of a “typical-like” set Anwhose size is prohibitively small and whose probability
mass is asymptotically large. Thus, if we can find such typical-like set for a
source with memory, the source coding theorem for block codes can be extended
for this source. Indeed, with appropriate modifications, the Shannon-McMillan
theorem can be generalized for the class of stationary ergodic sources and hence
a block source coding theorem for this class can be established; this is considered
in the next subsection. The block source coding theorem for general (e.g., non-
stationary non-ergodic) sources in terms of a “generalized entropy” measure (see
the end of the next subsection for a brief description) will be studied in detail
in Part II of the book.
3.2.2 Block codes for stationary ergodic sources
In practice, a stochastic source used to model data often exhibits memory or
statistical dependence among its random variables; its joint distribution is hence
not a product of its marginal distributions. In this subsection, we consider the
asymptotic lossless data compression theorem for the class of stationary ergodic
sources.
Before proceeding to generalize the block source coding theorem, we need
to first generalize the “entropy” measure for a sequence of dependent random
variables Xn(which certainly should be backward compatible to the discrete
memoryless cases). A straightforward generalization is to examine the limit of
the normalized block entropy of a source sequence, resulting in the concept of
entropy rate.
Definition 3.7 (Entropy rate) The entropy rate for a source {Xn}
n=1 is de-
54
noted by H(X) and defined by
H(X),lim
n→∞
1
nH(Xn)
provided the limit exists, where Xn= (X1,···, Xn).
Next we will show that the entropy rate exists for stationary sources (here,
we do not need ergodicity for the existence of entropy rate).
Lemma 3.8 For a stationary source {Xn}
n=1, the conditional entropy
H(Xn|Xn1,...,X1)
is non-increasing in nand also bounded from below by zero. Hence by Lemma
A.20, the limit
lim
n→∞ H(Xn|Xn1,...,X1)
exists.
Proof: We have
H(Xn|Xn1,...,X1)H(Xn|Xn1,...,X2) (3.2.4)
=H(Xn,···, X2)H(Xn1,···, X2)
=H(Xn1,···, X1)H(Xn2,···, X1) (3.2.5)
=H(Xn1|Xn2,...,X1)
where (3.2.4) follows since conditioning never increases entropy, and (3.2.5) holds
because of the stationarity assumption. Finally, recall that each conditional
entropy H(Xn|Xn1,...,X1) is non-negative. 2
Lemma 3.9 (Cesaro-mean theorem) If anaas n and bn= (1/n)Pn
i=1 ai,
then bnaas n .
Proof: anaimplies that for any ε > 0, there exists Nsuch that for all n > N,
|ana|< ε. Then
|bna|=
1
n
n
X
i=1
(aia)
1
n
n
X
i=1 |aia|
=1
n
N
X
i=1 |aia|+1
n
n
X
i=N+1 |aia|
1
n
N
X
i=1 |aia|+nN
nε.
55
Hence, limn→∞ |bna| ε. Since εcan be made arbitrarily small, the lemma
holds. 2
Theorem 3.10 For a stationary source {Xn}
n=1, its entropy rate always exists
and is equal to
H(X) = lim
n→∞ H(Xn|Xn1,...,X1).
Proof: The result directly follows by writing
1
nH(Xn) = 1
n
n
X
i=1
H(Xi|Xi1,...,X1) (chain-rule for entropy)
and applying the Cesaro-mean theorem. 2
Observation 3.11 It can also be shown that for a stationary source, (1/n)H(Xn)
is non-increasing in nand (1/n)H(Xn)H(Xn|Xn1,...,X1) for all n1.
(The proof is left as an exercise.)
It is obvious that when {Xn}
n=1 is a discrete memoryless source, H(Xn) =
n×H(X) for every n. Hence,
H(X) = lim
n→∞
1
nH(Xn) = H(X).
For a first-order stationary Markov source,
H(X) = lim
n→∞
1
nH(Xn) = lim
n→∞ H(Xn|Xn1,...,X1) = H(X2|X1),
where
H(X2|X1),X
x1∈X X
x2∈X
π(x1)PX2|X1(x2|x1)·log PX2|X1(x2|x1),
and π(·) is the stationary distribution for the Markov source. Furthermore, if
the Markov source is binary with PX2|X1(0|1) = αand PX2|X1(1|0) = β, then
H(X) = β
α+βhb(α) + α
α+βhb(β),
where hb(α),αlog α(1 α) log(1 α) is the binary entropy function.
Theorem 3.12 (Generalized AEP or Shannon-McMillan-Breiman The-
orem [12]) If {Xn}
n=1 is a stationary ergodic source, then
1
nlog2PXn(X1,...,Xn)a.s.
H(X).
56
Since the AEP theorem (law of large numbers) is valid for stationary ergodic
sources, all consequences of AEP will follow, including Shannon’s lossless source
coding theorem.
Theorem 3.13 (Shannon’s source coding theorem for stationary ergo-
dic sources) Given integer D > 1, let {Xn}
n=1 be a stationary ergodic source
with entropy rate (in base D)
HD(X),lim
n→∞
1
nHD(Xn).
Then the following hold.
Forward part (achievability): For any 0 < ε < 1, there exists δwith
0< δ < ε and a sequence of D-ary block codes { Cn= (n, Mn)}
n=1 with
lim sup
n→∞
1
nlogDMn< HD(X) + δ,
and probability of decoding error satisfied
Pe(Cn)< ε
for all sufficiently large n.
Strong converse part: For any 0 < ε < 1, any sequence of D-ary block
codes { Cn= (n, Mn)}
n=1 with
lim sup
n→∞
1
nlogDMn< HD(X)
satisfies
Pe>1ε
for all nsufficiently large.
A discrete memoryless (i.i.d.) source is stationary and ergodic (so Theorem
3.5 is clearly a special case of Theorem 3.13). In general, it is hard to check
whether a stationary process is ergodic or not. It is known though that if a
stationary process is a mixture of two or more stationary ergodic processes,
i.e., its n-fold distribution can be written as the mean (with respect to some
distribution) of the n-fold distributions of stationary ergodic processes, then it
is not ergodic.8
8The converse is also true; i.e., if a stationary process cannot be represented as a mixture
of stationary ergodic processes, then it is ergodic.
57
For example, let Pand Qbe two distributions on a finite alphabet Xsuch
that the process {Xn}
n=1 is i.i.d. with distribution Pand the process {Yn}
n=1
is i.i.d. with distribution Q. Flip a biased coin (with Heads probability equal to
θ, 0 < θ < 1) once and let
Zi=(Xiif Heads
Yiif Tails
for i= 1,2,···. Then the resulting process {Zi}
i=1 has its n-fold distribution as
a mixture of the n-fold distributions of {Xn}
n=1 and {Yn}
n=1:
PZn(an) = θPXn(an) + (1 θ)Pyn(an)
for all an Xn,n= 1,2,···. Then the process {Zi}
i=1 is stationary but not
ergodic.
A specific case for which ergodicity can be easily verified (other than the
case of i.i.d. sources) is the case of stationary Markov sources. Specifically, if a
(finite-alphabet) stationary Markov source is irreducible, then it is ergodic and
hence the Generalized AEP holds for this source. Note that irreducibility can
be verified in terms of the source’s transition probability matrix.
In more complicated situations such as when the source is non-stationary
(with time-varying statistics) and/or non-ergodic, the source entropy rate H(X)
(if the limit exists; otherwise one can look at the lim inf /lim sup of (1/n)H(Xn))
has no longer an operational meaning as the smallest possible compression rate.
This causes the need to establish new entropy measures which appropriately
characterize the operational limits of an arbitrary stochastic system with mem-
ory. This is achieved in [21] where Han and Verd´u introduce the notions of
inf/sup-entropy rates and illustrate the key role these entropy measures play
in proving a general lossless block source coding theorem. More specifically,
they demonstrate that for an arbitrary finite-alphabet source X={Xn=
(X1, X2,...,Xn)}
n=1 (not necessarily stationary and ergodic), the expression for
the minimum achievable (block) source coding rate is given by the sup-entropy
rate ¯
H(X), defined by
¯
H(X),inf
β∈ℜ β: lim sup
n→∞
Pr 1
nlog PXn(Xn)> β= 0.
More details will be provided in Part II of the book.
3.2.3 Redundancy for lossless block data compression
Shannon’s block source coding theorem establishes that the smallest data com-
pression rate for achieving arbitrarily small error probability for stationary er-
godic sources is given by the entropy rate. Thus one can define the source
58
redundancy as the reduction in coding rate one can achieve via asymptotically
lossless block source coding versus just using uniquely decodable (completely
lossless for any value of the sourceword blocklength n) block source coding. In
light of the fact that the former approach yields a source coding rate equal to the
entropy rate while the later approach provides a rate of log2|X|, we therefore
define the total block source-coding redundancy ρt(in bits/source symbol) for a
stationary ergodic source {Xn}
n=1 as
ρt,log2|X | H(X).
Hence ρtrepresents the amount of “useless” (or superfluous) statistical source
information one can eliminate via binary9block source coding.
If the source is i.i.d. and uniformly distributed, then its entropy rate is equal
to log2|X| and as a result its redundancy is ρt= 0. This means that the source
is incompressible, as expected, since in this case every sourceword xnwill belong
to the δ-typical set Fn(δ) for every n > 0 and δ > 0 (i.e., Fn(δ) = Xn) and
hence there are no superfluous sourcewords that can be dispensed of via source
coding. If the source has memory or has a non-uniform marginal distribution,
then its redundancy is strictly positive and can be classified into two parts:
Source redundancy due to the non-uniformity of the source marginal dis-
tribution ρd:
ρd,log2|X | H(X1).
Source redundancy due to the source memory ρm:
ρm,H(X1)H(X).
As a result, the source total redundancy ρtcan be decomposed in two parts:
ρt=ρd+ρm.
We can summarize the redundancy of some typical stationary ergodic sources
in the following table.
9Since we are measuring ρtin code bits/source symbol, all logarithms in its expression are
in base 2 and hence this redundancy can be eliminated via asymptotically lossless binary block
codes (one can also change the units to D-ary code symbol/source symbol by using base-D
logarithms for the case of D-ary block codes).
59
Source ρdρmρt
i.i.d. uniform 0 0 0
i.i.d. non-uniform log2|X | H(X1) 0 ρd
1st-order symmetric
Markov10 0H(X1)H(X2|X1)ρm
1st-order non-
symmetric Markov log2|X| H(X1)H(X1)H(X2|X1)ρd+ρm
3.3 Variable-length codes for lossless data compression
3.3.1 Non-singular codes and uniquely decodable codes
We next study variable-length (completely) lossless data compression codes.
Definition 3.14 Consider a discrete source {Xn}
n=1 with finite alphabet X
along with a D-ary code alphabet B={0,1,··· , D 1}, where D > 1 is an
integer. Fix integer n1, then a D-ary n-th order variable-length code (VLC)
is a map
f:Xn B
mapping (fixed-length) sourcewords of length nto D-ary codewords in Bof
variable lengths, where Bdenotes the set of all finite-length strings from B(i.e.,
c B integer l1 such that c Bl).
The codebook Cof a VLC is the set of all codewords:
C=f(Xn) = {f(xn) B:xn Xn}.
A variable-length lossless data compression code is a code in which the
source symbols can be completely reconstructed without distortion. In order
to achieve this goal, the source symbols have to be encoded unambiguously in
the sense that any two different source symbols (with positive probabilities) are
represented by different codewords. Codes satisfying this property are called
non-singular codes. In practice however, the encoder often needs to encode a
sequence of source symbols, which results in a concatenated sequence of code-
words. If any concatenation of codewords can also be unambiguously recon-
structed without punctuation, then the code is said to be uniquely decodable. In
10A first-order Markov process is symmetric if for any x1and ˆx1,
{a:a=PX2|X1(y|x1) for some y}={a:a=PX2|X1(y|ˆx1) for some y}.
60
other words, a VLC is uniquely decodable if all finite sequences of sourcewords
(xn Xn) are mapped onto distinct strings of codewords; i.e., for any mand
m, (xn
1, xn
2,···, xn
m)6= (yn
1, yn
2,···, yn
m) implies that
(f(xn
1), f (xn
2),···, f (xn
m)) 6= (f(yn
1), f (yn
2),···, f (yn
m)).
Note that a non-singular VLC is not necessarily uniquely decodable. For exam-
ple, consider a binary (first-order) code for the source with alphabet
X={A, B, C, D, E, F }
given by
code of A= 0,
code of B= 1,
code of C= 00,
code of D= 01,
code of E= 10,
code of F= 11.
The above code is clearly non-singular; it is however not uniquely decodable
because the codeword sequence, 010, can be reconstructed as ABA,DA or AE
(i.e., (f(A), f (B), f(A)) = (f(D), f (A)) = (f(A), f(E)) even if (A, B, A), (D, A)
and (A, E) are all non-equal).
One important objective is to find out how “efficiently” we can represent
a given discrete source via a uniquely decodable n-th order VLC and provide
a construction technique that (at least asymptotically, as n ) attains the
optimal “efficiency.” In other words, we want to determine what is the smallest
possible average code rate (or equivalently, average codeword length) can an
n-th order uniquely decodable VLC have when (losslessly) representing a given
source, and we want to give an explicit code construction that can attain this
smallest possible rate (at least asymptotically in the sourceword length n).
Definition 3.15 Let Cbe a D-ary n-th order VLC code
f:Xn {0,1,···, D 1}
for a discrete source {Xn}
n=1 with alphabet Xand distribution PXn(xn), xn
Xn. Setting (cxn) as the length of the codeword cxn=f(xn) associated with
sourceword xn, then the average codeword length for Cis given by
,X
xn∈Xn
PXn(xn)(cxn)
61
and its average code rate (in D-ary code symbols/source symbol) is given by
Rn,
n=1
nX
xn∈Xn
PXn(xn)(cxn).
The following theorem provides a strong condition which a uniquely decod-
able code must satisfy.
Theorem 3.16 (Kraft inequality for uniquely decodable codes) Let Cbe
a uniquely decodable D-ary n-th order VLC for a discrete source {Xn}
n=1 with
alphabet X. Let the M=|X|ncodewords of Chave lengths 1, 2,...,ℓM,
respectively. Then the following inequality must hold
M1
X
m=0
Dm1.
Proof: Suppose that we use the codebook Cto encode Nsourcewords (xn
i
Xn,i= 1,···, N ) arriving in a sequence; this yields a concatenated codeword
sequence
c1c2c3...cN.
Let the lengths of the codewords be respectively denoted by
(c1), (c2),...,ℓ(cN).
Consider X
c1∈C X
c2∈C ···X
cN∈C
D[(c1)+(c2)+···+(cN)]!.
It is obvious that the above expression is equal to
X
c∈C
D(c)!N
= M1
X
m=0
Dm!N
.
(Note that |C| =M.) On the other hand, all the code sequences with length
i=(c1) + (c2) + ···+(cN)
contribute equally to the sum of the identity, which is Di. Let Aidenote the
number of N-codeword sequences that have length i. Then the above identity
can be re-written as M
X
m=1
Dm!N
=
LN
X
i=1
AiDi,
62
where
L,max
c∈C (c).
Since Cis by assumption a uniquely decodable code, the codeword sequence
must be unambiguously decodable. Observe that a code sequence with length i
has at most Diunambiguous combinations. Therefore, AiDi, and
M
X
m=1
Dm!N
=
LN
X
i=1
AiDi
LN
X
i=1
DiDi=LN,
which implies that
M
X
m=1
Dm(LN)1/N .
The proof is completed by noting that the above inequality holds for every N,
and the upper bound (LN )1/N goes to 1 as Ngoes to infinity. 2
The Kraft inequality is a very useful tool, especially for showing that the
fundamental lower bound of the average rate of uniquely decodable VLCs for
discrete memoryless sources is given by the source entropy.
Theorem 3.17 The average rate of every uniquely decodable D-ary n-th order
VLC for a discrete memoryless source {Xn}
n=1 is lower-bounded by the source
entropy HD(X) (measured in D-ary code symbols/source symbol).
Proof: Consider a uniquely decodable D-ary n-th order VLC code for the source
{Xn}
n=1
f:Xn {0,1,···, D 1}
63
and let (cxn) denote the length of the codeword cxn=f(xn) for sourceword xn.
Hence,
RnHD(X) = 1
nX
xn∈Xn
PXn(xn)(cxn)1
nHD(Xn)
=1
n"X
xn∈Xn
PXn(xn)(cxn)X
xn∈Xn
(PXn(xn) logDPXn(xn))#
=1
nX
xn∈Xn
PXn(xn) logD
PXn(xn)
D(cxn)
1
n"X
xn∈Xn
PXn(xn)#logDPxn∈XnPXn(xn)
Pxn∈XnD(cxn)
(log-sum inequality)
=1
nlog "X
xn∈Xn
D(cxn)#
0
where the last inequality follows from the Kraft inequality for uniquely decodable
codes and the fact that the logarithm is a strictly increasing function. 2
From the above theorem, we know that the average code rate is no smaller
than the source entropy. Indeed a lossless data compression code, whose average
code rate achieves entropy, should be optimal (since if its average code rate is
below entropy, the Kraft inequality is violated and the code is no longer uniquely
decodable). We summarize
1. Uniquely decodability the Kraft inequality holds.
2. Uniquely decodability average code rate of VLCs for memoryless sources
is lower bounded by the source entropy.
Exercise 3.18
1. Find a non-singular and also non-uniquely decodable code that violates the
Kraft inequality. (Hint: The answer is already provided in this subsection.)
2. Find a non-singular and also non-uniquely decodable code that beats the
entropy lower bound.
64
Prefix
codes
Uniquely
decodable codes
Non-singular
codes
Figure 3.4: Classification of variable-length codes.
3.3.2 Prefix or instantaneous codes
Aprefix code is a VLC which is self-punctuated in the sense that there is no
need to append extra symbols for differentiating adjacent codewords. A more
precise definition follows:
Definition 3.19 (Prefix code) A VLC is called a prefix code or an instanta-
neous code if no codeword is a prefix of any other codeword.
A prefix code is also named an instantaneous code because the codeword se-
quence can be decoded instantaneously (it is immediately recognizable) without
the reference to future codewords in the same sequence. Note that a uniquely
decodable code is not necessarily prefix-free and may not be decoded instanta-
neously. The relationship between different codes encountered thus far is de-
picted in Figure 3.4.
AD-ary prefix code can be represented graphically as an initial segment of
aD-ary tree. An example of a tree representation for a binary (D= 2) prefix
code is shown in Figure 3.5.
Theorem 3.20 (Kraft inequality for prefix codes) There exists a D-ary
nth-order prefix code for a discrete source {Xn}
n=1 with alphabet Xiff the
codewords of length m,m= 1,...,M, satisfy the Kraft inequality, where M=
|X|n.
Proof: Without loss of generality, we provide the proof for the case of D= 2
(binary codes).
1. [The forward part]Prefix codes satisfy the Kraft inequality.
65
U
R
R
R*
j
(0)
(1)
00
01
10
(11)
110
(111) 1110
1111
Figure 3.5: Tree structure of a binary prefix code. The codewords are
those residing on the leaves, which in this case are 00, 01, 10, 110, 1110
and 1111.
The codewords of a prefix code can always be put on a tree. Pick up a length
max ,max
1mMm.
A tree has originally 2max nodes on level max. Each codeword of length m
obstructs 2maxmnodes on level max. In other words, when any node is chosen
as a codeword, all its children will be excluded from being codewords (as for a
prefix code, no codeword can be a prefix of any other code). There are exactly
2maxmexcluded nodes on level max of the tree. We therefore say that each
codeword of length mobstructs 2maxmnodes on level max. Note that no two
codewords obstruct the same nodes on level max. Hence the number of totally
obstructed codewords on level max should be less than 2max , i.e.,
M
X
m=1
2maxm2max ,
which immediately implies the Kraft inequality:
M
X
m=1
2m1.
66
(This part can also be proven by stating the fact that a prefix code is a uniquely
decodable code. The objective of adding this proof is to illustrate the character-
istics of a tree-like prefix code.)
2. [The converse part]Kraft inequality implies the existence of a prefix code.
Suppose that 1, 2,...,ℓMsatisfy the Kraft inequality. We will show that
there exists a binary tree with Mselected nodes where the ith node resides on
level i.
Let nibe the number of nodes (among the Mnodes) residing on level i
(namely, niis the number of codewords with length ior ni=|{m:m=i}|),
and let
max ,max
1mMm.
Then from the Kraft inequality, we have
n121+n222+···+nmax 2max 1.
The above inequality can be re-written in a form that is more suitable for this
proof as:
n1211
n121+n2221
···
n121+n222+···+nmax 2max 1.
Hence,
n12
n222n121
···
nmax 2max n12max1 · ·· nmax121,
which can be interpreted in terms of a tree model as: the 1st inequality says
that the number of codewords of length 1 is less than the available number of
nodes on the 1st level, which is 2. The 2nd inequality says that the number of
codewords of length 2 is less than the total number of nodes on the 2nd level,
which is 22, minus the number of nodes obstructed by the 1st level nodes already
occupied by codewords. The succeeding inequalities demonstrate the availability
of a sufficient number of nodes at each level after the nodes blocked by shorter
length codewords have been removed. Because this is true at every codeword
length up to the maximum codeword length, the assertion of the theorem is
proved. 2
Theorems 3.16 and 3.20 unveil the following relation between a variable-
length uniquely decodable code and a prefix code.
67
Corollary 3.21 A uniquely decodable D-ary n-th order code can always be
replaced by a D-ary n-th order prefix code with the same average codeword
length (and hence the same average code rate).
The following theorem interprets the relationship between the average code
rate of a prefix code and the source entropy.
Theorem 3.22 Consider a discrete memoryless source {Xn}
n=1.
1. For any D-ary n-th order prefix code for the source, the average code rate
is no less than the source entropy HD(X).
2. There must exist a D-ary n-th order prefix code for the source whose
average code rate is no greater than HD(X) + 1
n, namely,
Rn,1
nX
xn∈Xn
PXn(xn)(cxn)HD(X) + 1
n,(3.3.1)
where cxnis the codeword for sourceword xn, and (cxn) is the length of
codeword cxn.
Proof: A prefix code is uniquely decodable, and hence it directly follows from
Theorem 3.17 that its average code rate is no less than the source entropy.
To prove the second part, we can design a prefix code satisfying both (3.3.1)
and the Kraft inequality, which immediately implies the existence of the desired
code by Theorem 3.20. Choose the codeword length for sourceword xnas
(cxn) = ⌊−logDPXn(xn)+ 1.(3.3.2)
Then
D(cxn)PXn(xn).
Summing both sides over all source symbols, we obtain
X
xn∈Xn
D(cxn)1,
which is exactly the Kraft inequality. On the other hand, (3.3.2) implies
(cxn) logDPXn(xn) + 1,
which in turn implies
X
xn∈Xn
PXn(xn)(cxn)X
xn∈XnPXn(xn) logDPXn(xn)+X
xn∈Xn
PXn(xn)
=HD(Xn) + 1 = nHD(X) + 1,
68
where the last equality holds since the source is memoryless. 2
We note that n-th order prefix codes (which encode sourcewords of length
n) for memoryless sources can yield an average code rate arbitrarily close to
the source entropy when allowing nto grow without bound. For example, a
memoryless source with alphabet
{A, B, C }
and probability distribution
PX(A) = 0.8, PX(B) = PX(C) = 0.1
has entropy being equal to
0.8·log20.80.1·log20.10.1·log20.1 = 0.92 bits.
One of the best binary first-order or single-letter encoding (with n= 1) prefix
codes for this source is given by c(A) = 0, c(B) = 10 and c(C) = 11, where c(·)
is the encoding function. Then the resultant average code rate for this code is
0.8×1 + 0.2×2 = 1.2 bits 0.92 bits.
Now if we consider a second-order (with n= 2) prefix code by encoding two
consecutive source symbols at a time, the new source alphabet becomes
{AA, AB, AC, BA, BB, B C, CA, C B, CC },
and the resultant probability distribution is calculated by
(x1, x2 {A, B, C })PX2(x1, x2) = PX(x1)PX(x2)
as the source is memoryless. Then one of the best binary prefix codes for the
source is given by
c(AA) = 0
c(AB) = 100
c(AC) = 101
c(BA) = 110
c(BB) = 111100
c(BC) = 111101
c(CA) = 1110
c(CB) = 111110
c(CC) = 111111.
69
The average code rate of this code now becomes
0.64(1 ×1) + 0.08(3 ×3 + 4 ×1) + 0.01(6 ×4)
2= 0.96 bits,
which is closer to the source entropy of 0.92 bits. As nincreases, the average
code rate will be brought closer to the source entropy.
From Theorems 3.17 and 3.22, we obtain the lossless variable-length source
coding theorem for discrete memoryless sources.
Theorem 3.23 (Lossless variable-length source coding theorem) Fix in-
teger D > 1 and consider a discrete memoryless source {Xn}
n=1 with distribution
PXand entropy HD(X) (measured in D-ary units). Then the following hold.
Forward part (achievability): For any ε > 0, there exists a D-ary n-th
order prefix (hence uniquely decodable) code
f:Xn {0,1,···, D 1}
for the source with an average code rate Rnsatisfying
RnHD(X) + ε
for nsufficiently large.
Converse part: Every uniquely decodable code
f:Xn {0,1,···, D 1}
for the source has an average code rate RnHD(X).
Thus, for a discrete memoryless source, its entropy HD(X) (measured in D-
ary units) represents the smallest variable-length lossless compression rate for n
sufficiently large.
Proof: The forward part follows directly from Theorem 3.22 by choosing
nlarge enough such that 1/n < ε, and the converse part is already given by
Theorem 3.17. 2
Observation 3.24 Theorem 3.23 actually also holds for the class of stationary
sources by replacing the source entropy HD(X) with the source entropy rate
HD(X),lim
n→∞
1
nHD(Xn),
measured in D-ary units. The proof is very similar to the proofs of Theorems 3.17
and 3.22 with slight modifications (such as using the fact that 1
nHD(Xn) is non-
increasing with nfor stationary sources).
70
3.3.3 Examples of binary prefix codes
A) Huffman codes: optimal variable-length codes
Given a discrete source with alphabet X, we next construct an optimal binary
first-order (single-letter) uniquely decodable variable-length code
f:X {0,1},
where optimality is in the sense that the code’s average codeword length (or
equivalently, its average code rate) is minimized over the class of all binary
uniquely decodable codes for the source. Note that finding optimal n-th oder
codes with n > 1 follows directly by considering Xnas a new source with ex-
panded alphabet (i.e., by mapping nsource symbols at a time).
By Corollary 3.21, we remark that in our search for optimal uniquely de-
codable codes, we can restrict our attention to the (smaller) class of optimal
prefix codes. We thus proceed by observing the following necessary conditions
of optimality for binary prefix codes.
Lemma 3.25 Let Cbe an optimal binary prefix code with codeword lengths
i,i= 1,···, M , for a source with alphabet X={a1,...,aM}and symbol
probabilities p1,...,pM. We assume, without loss of generality, that
p1p2p3 ··· pM,
and that any group of source symbols with identical probability is listed in order
of increasing codeword length (i.e., if pi=pi+1 =···=pi+s, then ii+1
··· i+s). Then the following properties hold.
1. Higher probability source symbols have shorter codewords: pi> pjimplies
ij, for i, j = 1,···, M .
2. The two least probable source symbols have codewords of equal length:
M1=M.
3. Among the codewords of length M, two of the codewords are identical
except in the last digit.
Proof:
1) If pi> pjand i> j, then it is possible to construct a better code Cby
interchanging (“swapping”) codewords jand kof C, since
(C)(C) = pij+pji(pii+pjj)
= (pipj)(ji)
<0.
71
Hence code Cis better than code C, contradicting the fact that Cis optimal.
2) We first know that M1M, since:
If pM1> pM, then M1Mby result 1) above.
If pM1=pM, then M1Mby our assumption about the ordering
of codewords for source symbols with identical probability.
Now, if M1< M, we may delete the last digit of codeword M, and the
deletion cannot result in another codeword since Cis a prefix code. Thus
the deletion forms a new prefix code with a better average codeword length
than C, contradicting the fact that Cis optimal. Hence, we must have that
M1=M.
3) Among the codewords of length M, if no two codewords agree in all digits
except the last, then we may delete the last digit in all such codewords to
obtain a better codeword. 2
The above observation suggests that if we can construct an optimal code for
the entire source except for its two least likely symbols, then we can construct
an optimal overall code. Indeed, the following lemma due to Huffman follows
from Lemma 3.25.
Lemma 3.26 (Huffman) Consider a source with alphabet X={a1,...,aM}
and symbol probabilities p1,...,pMsuch that p1p2 ··· pM. Consider the
reduced source alphabet Yobtained from Xby combining the two least likely
source symbols aM1and aMinto an equivalent symbol aM1,M with probability
pM1+pM. Suppose that C, given by f:Y {0,1}, is an optimal code for
the reduced source Y. We now construct a code C,f:X {0,1}, for the
original source Xas follows:
The codewords for symbols a1, a2,···, aM2are exactly the same as the
corresponding codewords in C:
f(a1) = f(a1), f (a2) = f(a2),···, f (aM2) = f(aM2).
The codewords associated with symbols aM1and aMare formed by ap-
pending a “0” and a “1”, respectively, to the codeword f(aM1,M) associ-
ated with the letter aM1,M in C:
f(aM1) = [f(aM1,M )0] and f(aK) = [f(aM1,M )1].
Then code Cis optimal for the original source X.
72
0.05
0.1
0.1
0.25
0.25
0.25
(1111)
(1110)
(110)
(10)
(01)
(00)
0.15
0.1
0.25
0.25
0.25
111
110
10
01
00
0.25
0.25
0.25
0.25
11
10
01
00
0.5
0.25
0.25
1
01
00
0.5
0.5
1
01.0
Figure 3.6: Example of the Huffman encoding.
Hence the problem of finding the optimal code for a source of alphabet size
Mis reduced to the problem of finding an optimal code for the reduced source
of alphabet size M1. In turn we can reduce the problem to that of size M2
and so on. Indeed the above lemma yields a recursive algorithm for constructing
optimal binary prefix codes.
Huffman encoding algorithm: Repeatedly apply the above lemma until one is left
with a reduced source with two symbols. An optimal binary prefix code for this
source consists of the codewords 0 and 1. Then proceed backwards, constructing
(as outlined in the above lemma) optimal codes for each reduced source until
one arrives at the original source.
Example 3.27 Consider a source with alphabet {1,2,3,4,5,6}with symbol
probabilities 0.25,0.25,0.25,0.1,0.1 and 0.05, respectively. By following the
Huffman encoding procedure as shown in Figure 3.6, we obtain the Huffman
code as
00,01,10,110,1110,1111.
Observation 3.28
Huffman codes are not unique for a given source distribution; e.g., by
inverting all the code bits of a Huffman code, one gets another Huffman
73
code, or by resolving ties in different ways in the Huffman algorithm, one
also obtains different Huffman codes (but all of these codes have the same
minimal Rn).
One can obtain optimal codes that are not Huffman codes; e.g., by inter-
changing two codewords of the same length of a Huffman code, one can get
another non-Huffman (but optimal) code. Furthermore, one can construct
an optimal suffix code (i.e., a code in which no codeword can be a suffix
of another codeword) from a Huffman code (which is a prefix code) by
reversing the Huffman codewords.
Binary Huffman codes always satisfy the Kraft inequality with equality
(their code tree is “saturated”); e.g., see [13, p. 72].
Any n-th order binary Huffman code f:Xn {0,1}for a stationary
source {Xn}
n=1 with finite alphabet Xsatisfies:
H(X)1
nH(Xn)Rn<1
nH(Xn) + 1
n.
Thus, as nincreases to infinity, RnH(X) but the complexity as well as
encoding-decoding delay grows exponentially with n.
Finally, note that non-binary (i.e., for D > 2) Huffman codes can also be
constructed in a mostly similar way as for the case of binary Huffman codes
by designing a D-ary tree and iteratively applying Lemma 3.26, where now
the Dleast likely source symbols are combined at each stage. The only
difference from the case of binary Huffman codes is that we have to ensure
that we are ultimately left with Dsymbols at the last stage of the algorithm
to guarantee the code’s optimality. This is remedied by expanding the
original source alphabet Xby adding “dummy” symbols (each with zero
probability) so that the alphabet size of the expanded source |X|is the
smallest positive integer greater than or equal to |X| with
|X|= 1 (modulo D 1).
For example, if |X| = 6 and D= 3 (ternary codes), we obtain that |X|= 7,
meaning that we need to enlarge the original source Xby adding one
dummy (zero-probability) source symbol.
We thus obtain that the necessary conditions for optimality of Lemma 3.25
also hold for D-ary prefix codes when replacing Xwith the expanded
source Xand replacing “two” with D in the statement of the lemma.
The resulting D-ary Huffman code will be an optimal code for the original
source X(e.g., see [18, Chap. 3] and [33, Chap. 11]).
74
B) Shannon-Fano-Elias code
Assume X={1,...,M}and PX(x)>0 for all x X. Define
F(x),X
ax
PX(a),
and
¯
F(x),X
a<x
PX(a) + 1
2PX(x).
Encoder: For any x X, express ¯
F(x) in decimal binary form, say
¯
F(x) = .c1c2. . . ck. . . ,
and take the first k(fractional) bits as the codeword of source symbol x, i.e.,
(c1, c2,...,ck),
where k,log2(1/PX(x))+ 1.
Decoder: Given codeword (c1,...,ck), compute the cumulative sum of F(·) start-
ing from the smallest element in {1,2,...,M}until the first xsatisfying
F(x).c1. . . ck.
Then xshould be the original source symbol.
Proof of decodability: For any number a[0,1], let [a]kdenote the operation
that chops the binary representation of aafter kbits (i.e., removing the (k+1)th
bit, the (k+ 2)th bit, etc). Then
¯
F(x)¯
F(x)k<1
2k.
Since k=log2(1/PX(x))+ 1,
1
2k1
2PX(x)
="X
a<x
PX(a) + PX(x)
2#X
ax1
PX(a)
=¯
F(x)F(x1).
Hence,
F(x1) = F(x1) + 1
2k1
2k¯
F(x)1
2k<¯
F(x)k.
75
In addition,
F(x)>¯
F(x)¯
F(x)k.
Consequently, xis the first element satisfying
F(x).c1c2. . . ck.
Average codeword length:
¯
=X
x∈X
PX(x)log2
1
PX(x)+ 1
<X
x∈X
PX(x) log2
1
PX(x)+ 2
= (H(X) + 2) bits.
Observation 3.29 The Shannon-Fano-Elias code is a prefix code.
3.3.4 Examples of universal lossless variable-length codes
In Section 3.3.3, we assume that the source distribution is known. Thus we can
use either Huffman codes or Shannon-Fano-Elias codes to compress the source.
What if the source distribution is not a known priori? Is it still possible to
establish a completely lossless data compression code which is universally good
(or asymptotically optimal) for all interested sources? The answer is affirma-
tive. Two of the examples are the adaptive Huffman codes and the Lempel-Ziv
codes (which unlike Huffman and Shannon-Fano-Elias codes map variable-length
sourcewords onto codewords).
A) Adaptive Huffman code
A straightforward universal coding scheme is to use the empirical distribution
(or relative frequencies) as the true distribution, and then apply the optimal
Huffman code according to the empirical distribution. If the source is i.i.d.,
the relative frequencies will converge to its true marginal probability. Therefore,
such universal codes should be good for all i.i.d. sources. However, in order to get
an accurate estimation of the true distribution, one must observe a sufficiently
long sourceword sequence under which the coder will suffer a long delay. This
can be improved by using the adaptive universal Huffman code [19].
The working procedure of the adaptive Huffman code is as follows. Start
with an initial guess of the source distribution (based on the assumption that
76
the source is DMS). As a new source symbol arrives, encode the data in terms of
the Huffman coding scheme according to the current estimated distribution, and
then update the estimated distribution and the Huffman codebook according to
the newly arrived source symbol.
To be specific, let the source alphabet be X,{a1,...,aM}. Define
N(ai|xn),number of aioccurrence in x1, x2,...,xn.
Then the (current) relative frequency of aiis N(ai|xn)/n. Let cn(ai) denote the
Huffman codeword of source symbol aiwith respect to the distribution
N(a1|xn)
n,N(a2|xn)
n,···,N(aM|xn)
n.
Now suppose that xn+1 =aj. The codeword cn(aj) is set as output, and the
relative frequency for each source outcome becomes:
N(aj|xn+1)
n+ 1 =n×(N(aj|xn)/n) + 1
n+ 1
and N(ai|xn+1)
n+ 1 =n×(N(ai|xn)/n)
n+ 1 for i6=j.
This observation results in the following distribution updated policy.
P(n+1)
ˆ
X(aj) = nP (n)
ˆ
X(aj) + 1
n+ 1
and
P(n+1)
ˆ
X(ai) = n
n+ 1P(n)
ˆ
X(ai) for i6=j,
where P(n+1)
ˆ
Xrepresents the estimate of the true distribution PXat time (n+ 1).
Note that in the Adaptive Huffman coding scheme, the encoder and decoder
need not be re-designed at every time, but only when a sufficient change in the
estimated distribution occurs such that the so-called sibling property is violated.
Definition 3.30 (Sibling property) A prefix code is said to have the sibling
property if its codetree satisfies:
1. every node in the code-tree (except for the root node) has a sibling (i.e.,
the code-tree is saturated), and
2. the node can be listed in non-decreasing order of probabilities with each
node being adjacent to its sibling.
77
a1(00,3/8)
a2(01,1/4)
a3(100,1/8)
a4(101,1/8)
a5(110,1/16)
a6(111,1/16)
b11(1/8)
b10(1/4)
b0(5/8)
b1(3/8)
8/8
b05
8b13
8
|{z }
sibling pair
a13
8a21
4
|{z }
sibling pair
b10 1
4b11 1
8
|{z }
sibling pair
a31
8a41
8
|{z }
sibling pair
a51
16a61
16
|{z }
sibling pair
Figure 3.7: Example of the sibling property based on the code tree from
P(16)
ˆ
X. The arguments inside the parenthesis following ajrespectively
indicate the codeword and the probability associated with aj. b is
used to denote the internal nodes of the tree with the assigned (partial)
code as its subscript. The number in the parenthesis following bis the
probability sum of all its children.
The next observation indicates the fact that the Huffman code is the only
prefix code satisfying the sibling property.
Observation 3.31 A prefix code is a Huffman code iff it satisfies the sibling
property.
An example for a code tree satisfying the sibling property is shown in Fig-
ure 3.7. The first requirement is satisfied since the tree is saturated. The second
requirement can be checked by the node list in Figure 3.7.
If the next observation (say at time n= 17) is a3, then its codeword 100 is
set as output (using the Huffman code corresponding to P(16)
ˆ
X). The estimated
78
a1(00,6/17)
a2(01,4/17)
a3(100,3/17)
a4(101,2/17)
a5(110,1/17)
a6(111,1/17)
b11(2/17)
b10(5/17)
b0(10/17)
b1(7/17)
17/17
b010
17b17
17
|{z }
sibling pair
a16
17b10 5
17
a24
17a33
17a42
17
|{z }
sibling pair
b11 2
17a51
17a61
17
|{z }
sibling pair
Figure 3.8: (Continuation of Figure 3.7) Example of violation of the
sibling property after observing a new symbol a3at n= 17. Note that
node a1is not adjacent to its sibling a2.
distribution is updated as
P(17)
ˆ
X(a1) = 16 ×(3/8)
17 =6
17, P (17)
ˆ
X(a2) = 16 ×(1/4)
17 =4
17
P(17)
ˆ
X(a3) = 16 ×(1/8) + 1
17 =3
17, P (17)
ˆ
X(a4) = 16 ×(1/8)
17 =2
17
P(17)
ˆ
X(a5) = 16 ×[1/(16)]
17 =1
17, P (17)
ˆ
X(a6) = 16 ×[1/(16)]
17 =1
17.
The sibling property is then violated (cf. Figure 3.8). Hence, codebook needs to
be updated according to the new estimated distribution, and the observation at
n= 18 shall be encoded using the new codebook in Figure 3.9. Details about
Adaptive Huffman codes can be found in [19].
79
a1(10,6/17)
a2(00,4/17)
a3(01,3/17)
a4(110,2/17)
a5(1110,1/17)
a6(1111,1/17)
b111(2/17)
b11(4/17)
b0(7/17)
b1(10/17) 17/17
b110
17b07
17
|{z }
sibling pair
a16
17b11 4
17
|{z }
sibling pair
a24
17a33
17
|{z }
sibling pair
a42
17b111 2
17
|{z }
sibling pair
a51
17a61
17
|{z }
sibling pair
Figure 3.9: (Continuation of Figure 3.8) Updated Huffman code. The
sibling property holds now for the new code.
B) Lempel-Ziv codes
We now introduce a well-known and feasible universal coding scheme, which is
named after its inventors, Lempel and Ziv (e.g., cf. [12]). These codes, unlike
Huffman and Shannon-Fano-Elias codes, map variable-length sourcewords (as
opposed to fixed-length codewords) onto codewords.
Suppose the source alphabet is binary. Then the Lempel-Ziv encoder can be
described as follows.
Encoder:
1. Parse the input sequence into strings that have never appeared before. For
example, if the input sequence is 1011010100010 ..., the algorithm first
grabs the first letter 1 and finds that it has never appeared before. So 1
is the first string. Then the algorithm scoops the second letter 0 and also
determines that it has not appeared before, and hence, put it to be the
next string. The algorithm moves on to the next letter 1, and finds that
80
this string has appeared. Hence, it hits another letter 1 and yields a new
string 11, and so on. Under this procedure, the source sequence is parsed
into the strings
1,0,11,01,010,00,10.
2. Let Lbe the number of distinct strings of the parsed source. Then we
need log2Lbits to index these strings (starting from one). In the above
example, the indices are:
parsed source : 1 0 11 01 010 00 10
index : 001 010 011 100 101 110 111 .
The codeword of each string is then the index of its prefix concatenated
with the last bit in its source string. For example, the codeword of source
string 010 will be the index of 01, i.e., 100, concatenated with the last bit
of the source string, i.e., 0. Through this procedure, encoding the above
parsed strings with L= 3 yields the codeword sequence
(000,1)(000,0)(001,1)(010,1)(100,0)(010,0)(001,0)
or equivalently,
0001000000110101100001000010.
Note that the conventional Lempel-Ziv encoder requires two passes: the first
pass to decide L, and the second pass to generate the codewords. The algorithm,
however, can be modified so that it requires only one pass over the entire source
string. Also note that the above algorithm uses an equal number of bits—log2L
to all the location index, which can also be relaxed by proper modification.
Decoder: The decoding is straightforward from the encoding procedure.
Theorem 3.32 The above algorithm asymptotically achieves the entropy rate
of any (unknown statistics) stationary ergodic source.
Proof: Please refer to [12, Sec. 13.5]. 2
81
Chapter 4
Data Transmission and Channel
Capacity
4.1 Principles of data transmission
A noisy communication channel is an input-output medium in which the output
is not completely or deterministically specified by the input. The channel is
indeed stochastically modeled, where given channel input x, the channel output
yis governed by a transition (conditional) probability distribution denoted by
PY|X(y|x). Since two different inputs may give rise to the same output, the
receiver, upon receipt of an output, needs to guess the most probable sent in-
put. In general, words of length nare sent and received over the channel; in
this case, the channel is characterized by a sequence of n-dimensional transition
distributions PYn|Xn(yn|xn), for n= 1,2,···. A block diagram depicting a data
transmission or channel coding system (with no feedback1) is given in Figure 4.1.
W-Channel
Encoder -
XnChannel
PYn|Xn(·|·)-
YnChannel
Decoder -
ˆ
W
Figure 4.1: A data transmission system, where Wrepresents the mes-
sage for transmission, Xndenotes the codeword corresponding to mes-
sage W,Ynrepresents the received word due to channel input Xn, and
ˆ
Wdenotes the reconstructed message from Yn.
1The capacity of channels with (output) feedback will be studied in Part II of the book.
82
The designer of a data transmission (or channel) code needs to carefully
select codewords from the set of channel input words (of a given length) so
that a minimal ambiguity is obtained at the channel receiver. For example,
suppose that a channel has binary input and output alphabets and that its
transition probability distribution induces the following conditional probability
on its output symbols given that input words of length 2 are sent:
PY|X2(y= 0|x2= 00) = 1
PY|X2(y= 0|x2= 01) = 1
PY|X2(y= 1|x2= 10) = 1
PY|X2(y= 1|x2= 11) = 1,
which can be graphically depicted as
*
-
*
-
11
10
01
00
1
0
1
1
1
1
and a binary message (either event Aor event B) is required to be transmitted
from the sender to the receiver. Then the data transmission code with (codeword
00 for event A, codeword 10 for event B) obviously induces less ambiguity at
the receiver than the code with (codeword 00 for event A, codeword 01 for event
B).
In short, the objective in designing a data transmission (or channel) code
is to transform a noisy channel into a reliable medium for sending messages
and recovering them at the receiver with minimal loss. To achieve this goal, the
designer of a data transmission code needs to take advantage of the common parts
between the sender and the receiver sites that are least affected by the channel
noise. We will see that these common parts are probabilistically captured by the
mutual information between the channel input and the channel output.
As illustrated in the previous example, if a “least-noise-affected” subset of
the channel input words is appropriately selected as the set of codewords, the
messages intended to be transmitted can be reliably sent to the receiver with
arbitrarily small error. One then raises the question:
What is the maximum amount of information (per channel use) that
can be reliably transmitted over a given noisy channel ?
83
In the above example, we can transmit a binary message error-free, and hence
the amount of information that can be reliability transmitted is at least 1 bit
per channel use (or channel symbol). It can be expected that the amount of
information that can be reliably transmitted for a highly noisy channel should
be less than that for a less noisy channel. But such a comparison requires a good
measure of the noisiness of channels.
From an information theoretic viewpoint, channel capacity provides a good
measure of the noisiness of a channel; it is defined as the maximal amount of
informational messages (per channel use) that can be transmitted via a data
transmission code over the channel and recovered with arbitrarily small proba-
bility of error at the receiver. In addition to its dependence on the channel
transition distribution, channel capacity also depends on the coding constraint
imposed on the channel input, such as “only block (fixed-length) codes are al-
lowed.” When no coding constraints are applied on the channel input (so that
variable-length codes can be employed), the derivation of the channel capacity
is usually viewed as a hard problem, and is only partially solved so far. In
this chapter, we will introduce the channel capacity for block codes (namely,
only block transmission code can be used). Throughout the chapter, the noisy
channel is assumed to be memoryless (as defined in the next section).
4.2 Discrete memoryless channels
Definition 4.1 (Discrete channel) A discrete communication channel is char-
acterized by
A finite input alphabet X.
A finite output alphabet Y.
A sequence of n-dimensional transition distributions
{PYn|Xn(yn|xn)}
n=1
such that Pyn∈YnPYn|Xn(yn|xn) = 1 for every xn Xn, where xn=
(x1,···, xn) Xnand yn= (y1,···, yn) Yn. We assume that the above
sequence of n-dimensional distribution is consistent, i.e.,
PYi|Xi(yi|xi) = Pxi+1 ∈X Pyi+1∈Y PXi+1 (xi+1)PYi+1|Xi+1 (yi+1|xi+1)
Pxi+1∈X PXi+1 (xi+1)
=X
xi+1∈X X
yi+1∈Y
PXi+1|Xi(xi+1|xi)PYi+1|Xi+1 (yi+1 |xi+1)
for every xi,yi,PXi+1|Xiand i= 1,2,···.
84
In general, real-world communications channels exhibit statistical memory
in the sense that current channel outputs statistically depend on past outputs
as well as past, current and (possibly) future inputs. However, for the sake of
simplicity, we restrict our attention in this chapter to the class of memoryless
channels (channels with memory will later be treated in Volume II).
Definition 4.2 (Discrete memoryless channel) A discrete memoryless chan-
nel (DMC) is a channel whose sequence of transition distributions PYn|Xnsatisfies
PYn|Xn(yn|xn) =
n
Y
i=1
PY|X(yi|xi) (4.2.1)
for every n= 1,2,···, xn Xnand yn Yn. In other words, a DMC is
fully described by the channel’s transition distribution matrix Q,[px,y] of size
|X| × |Y|, where
px,y ,PY|X(y|x)
for x X,y Y. Furthermore, the matrix Qis stochastic; i.e., the sum of the
entries in each of its rows is equal to 1 since Py∈Y px,y = 1 for all x X.
Observation 4.3 We note that the DMC’s condition (4.2.1) is actually equiv-
alent to the following two sets of conditions:
PYn|Xn,Y n1(yn|xn, yn1) = PY|X(yn|xn)n= 1,2,···,xn,yn;
(4.2.2a)
PYn1|Xn(yn1|xn) = PYn1|Xn1(yn1|xn1)n= 2,3,···,xn,yn1.
(4.2.2b)
PYn|Xn,Y n1(yn|xn, yn1) = PY|X(yn|xn)n= 1,2,···,xn,yn;
(4.2.3a)
PXn|Xn1,Y n1(xn|xn1, yn1) = PXn|Xn1(xn|xn1)n= 1,2,···,xn,yn1.
(4.2.3b)
Condition (4.2.2a) (also, (4.2.3a)) implies that the current output Ynonly de-
pend on the current input Xnbut not on past inputs Xn1and outputs Yn1.
Condition (4.2.2b) indicates that the past outputs Yn1do not depend on the
current input Xn. These two conditions together give
PYn|Xn(yn|xn)=PYn1|Xn(yn1|xn)PYn|Xn,Y n1(yn|xn, yn1)
=PYn1|Xn1(yn1|xn1)PY|X(yn|xn);
85
hence, (4.2.1) holds recursively on n= 1,2,···. The converse (i.e., (4.2.1) implies
both (4.2.2a) and (4.2.2b)) is a direct consequence of
PYn|Xn,Y n1(yn|xn, yn1) = PYn|Xn(yn|xn)
Pyn∈Y PYn|Xn(yn|xn)
and
PYn1|Xn(yn1|xn) = X
yn∈Y
PYn|Xn(yn|xn).
Similarly, (4.2.3b) states that the current input Xnis independent of past outputs
Yn1, which together with (4.2.3a) implies again
PYn|Xn(yn|xn)
=PXn,Y n(xn, yn)
PXn(xn)
=PXn1,Y n1(xn1, yn1)PXn|Xn1,Y n1(xn|xn1, yn1)PYn|Xn,Y n1(yn|xn, yn1)
PXn1(xn1)PXn|Xn1(xn|xn1)
=PYn1|Xn1(yn1|xn1)PY|X(yn|xn),
hence, recursively yielding (4.2.1). The converse for (4.2.3b)—i.e., (4.2.1) imply-
ing (4.2.3b)— can be analogously proved by noting that
PXn|Xn1,Y n1(xn|xn1, yn1) = PXn(xn)Pyn∈Y PYn|Xn(yn|xn)
PXn1(xn1)PYn1|Xn1(yn1|xn1).
Note that the above definition of DMC in (4.2.1) prohibits the use of channel
feedback, where the current channel output ynis a function of past channel
outputs yn1in addition to past and current inputs xn(this is exactly what is
inferred by Condition (4.2.2a)). Therefore, conditions (4.2.2a) will be instead
used to define a DMC channel with feedback (feedback will be considered in Part
II of this book).
Examples of DMCs:
1. Identity (noiseless) channels: An identity channel has equal-size input and
output alphabets (|X| =|Y|) and channel transition probability satisfying
PY|X(y|x) = (1 if y=x
0 if y6=x.
This is a noiseless or perfect channel as the channel input is received error-
free at the channel output.
86
2. Binary symmetric channels: A binary symmetric channel (BSC) is a chan-
nel with binary input and output alphabets such that each input has a
(conditional) probability given by εfor being received inverted at the out-
put, where ε[0,1] is called the channel’s crossover probability or bit error
rate. The channel’s transition distribution matrix is given by
Q= [px,y] = p0,0p0,1
p1,0p1,1
=PY|X(0|0) PY|X(1|0)
PY|X(0|1) PY|X(1|1)=1ε ε
ε1ε(4.2.4)
and can be graphically represented via a transition diagram as shown in
Figure 4.2.
-
*
-
j
1
0
XPY|XY
1
0
1ε
1ε
ε
ε
Figure 4.2: Binary symmetric channel.
If we set ε= 0, then the BSC reduces to the binary identity (noiseless)
channel. The channel is called “symmetric” since PY|X(1|0) = PY|X(0|1);
i.e., it has the same probability for flipping an input bit into a 0 or a 1.
A detailed discussion of DMCs with various symmetry properties will be
discussed at the end of this chapter.
Despite its simplicity, the BSC is rich enough to capture most of the
complexity of coding problems over more general channels. For exam-
ple, it can exactly model the behavior of practical channels with additive
memoryless Gaussian noise used in conjunction of binary symmetric mod-
ulation and hard-decision demodulation (e.g., see [44, p. 240].) It is also
worth pointing out that the BSC can be explicitly represented via a binary
modulo-2 additive noise channel whose output at time iis the modulo-2
sum of its input and noise variables:
Yi=XiZifor i= 1,2,···
87
where denotes addition modulo-2, Yi,Xiand Ziare the channel output,
input and noise, respectively, at time i, the alphabets X=Y=Z={0,1}
are all binary, and it is assumed that Xiand Zjare independent from each
other for any i, j = 1,2,···,and that the noise process is a Bernoulli(ε)
process i.e., a binary i.i.d. process with Pr[Z= 1] = ε.
3. Binary erasure channels: In the BSC, some input bits are received perfectly
and others are received corrupted (flipped) at the channel output. In some
channels however, some input bits are lost during transmission instead of
being received corrupted (for example, packets in data networks may get
dropped or blocked due to congestion or bandwidth constraints). In this
case, the receiver knows the exact location of these bits in the received
bitstream or codeword, but not their actual value. Such bits are then
declared as “erased” during transmission and are called “erasures.” This
gives rise to the so-called binary erasure channel (BEC) as illustrated in
Figure 4.3, with input alphabet X={0,1}and output alphabet Y=
{0, E, 1}, where Erepresents an erasure, and channel transition matrix
given by
Q= [px,y] = p0,0p0,E p0,1
p1,0p1,E p1,1
=PY|X(0|0) PY|X(E|0) PY|X(1|0)
PY|X(0|1) PY|X(E|1) PY|X(1|1)
=1α α 0
0α1α(4.2.5)
where 0 α1 is called the channel’s erasure probability.
4. Binary channels with errors and erasures: One can combine the BSC with
the BEC to obtain a binary channel with both errors and erasures, as
shown in Figure 4.4. We will call such channel the binary symmetric
erasure channel (BSEC). In this case, the channel’s transition matrix is
given by
Q= [px,y] = p0,0p0,E p0,1
p1,0p1,E p1,1=1εα α ε
ε α 1εα(4.2.6)
where ε, α [0,1] are the channel’s crossover and erasure probabilities,
respectively. Clearly, setting α= 0 reduces the BSEC to the BSC, and
setting ε= 0 reduces the BSEC to the BEC.
More generally, the channel need not have a symmetric property in the
sense of having identical transition distributions when inputs bits 0 or 1
88
-
-
z
:
1
0
XPY|XY
1
0
E
1α
1α
α
α
Figure 4.3: Binary erasure channel.
are sent. For example, the channel’s transition matrix can be given by
Q= [px,y] = p0,0p0,E p0,1
p1,0p1,E p1,1=1εα α ε
εα1εα(4.2.7)
where the probabilities ε6=εand α6=αin general. We call such chan-
nel, an asymmetric channel with errors and erasures (this model might
be useful to represent practical channels using asymmetric or non-uniform
modulation constellations).
-
*
-
j
z
:
1
0
XPY|XY
1
0
E
1εα
1εα
α
α
ε
ε
Figure 4.4: Binary symmetric erasure channel.
5. q-ary symmetric channels: Given an integer q2, the q-ary symmetric
channel is a non-binary extension of the BSC; it has alphabets X=Y=
89
{0,1,···, q 1}of size qand channel transition matrix given by
Q= [px,y] = p0,0p0,1··· p0,q1
p1,0p1,1··· p1,q1
=
1εε
q1··· ε
q1
ε
q11ε··· ε
q1
.
.
..
.
..
.
..
.
.
ε
q1
ε
q1··· 1ε
(4.2.8)
where 0 ε1 is the channel’s symbol error rate (or probability). When
q= 2, the channel reduces to the BSC with bit error rate ε, as expected.
As the BSC, the q-ary symmetric channel can be expressed as a modulo-
qadditive noise channel with common input, output and noise alphabets
X=Y=Z={0,1,···, q 1}and whose output Yiat time iis given by
Yi=XiqZi, for i= 1,2,···,where qdenotes addition modulo-q, and
Xiand Ziare the channel’s input and noise variables, respectively, at time
i. Here, the noise process {Zn}
n=1 is assumed to be an i.i.d. process with
distribution
Pr[Z= 0] = 1 εand Pr[Z=a] = ε
q1a {1,···, q 1}.
It is also assumed that the input and noise processes are independent from
each other.
6. q-ary erasure channels: Given an integer q2, one can also consider
a non-binary extension of the BEC, yielding the so called q-ary erasure
channel. Specifically, this channel has input and output alphabets given
by X={0,1,···, q 1}and Y={0,1,···, q 1, E }, respectively, where
Edenotes an erasure, and channel transition distribution given by
PY|X(y|x) =
1αif y=x,x X
αif y=E,x X
0 if y6=x,x X
(4.2.9)
where 0 α1 is the erasure probability. As expected, setting q= 2
reduces the channel to the BEC.
4.3 Block codes for data transmission over DMCs
Definition 4.4 (Fixed-length data transmission code) Given positive in-
tegers nand M, and a discrete channel with input alphabet Xand output
90
alphabet Y, a fixed-length data transmission code (or block code) for this chan-
nel with blocklength nand rate 1
nlog2Mmessage bits per channel symbol (or
channel use) is denoted by Cn= (n, M ) and consists of:
1. Minformation messages intended for transmission.
2. An encoding function
f:{1,2,...,M} Xn
yielding codewords f(1), f (2),···, f (M) Xn, each of length n. The set
of these Mcodewords is called the codebook and we also usually write
Cn={f(1), f (2),···, f (M)}to list the codewords.
3. A decoding function g:Yn {1,2,...,M}.
The set {1,2,...,M}is called the message set and we assume that a message
Wfollows a uniform distribution over the set of messages: Pr[W=w] = 1
Mfor
all w {1,2,...,M}. A block diagram for the channel code is given at the
beginning of this chapter; see Figure 4.1. As depicted in the diagram, to convey
message Wover the channel, the encoder sends its corresponding codeword
Xn=f(W) at the channel input. Finally, Ynis received at the channel output
(according to the memoryless channel distribution PYn|Xn) and the decoder yields
ˆ
W=g(Yn) as the message estimate.
Definition 4.5 (Average probability of error) The average probability of
error for a channel block code Cn= (n, M ) code with encoder f(·) and decoder
g(·) used over a channel with transition distribution PYn|Xnis defined as
Pe(Cn),1
M
M
X
w=1
λw(Cn),
where
λw(Cn),Pr[ ˆ
W6=W|W=w] = Pr[g(Yn)6=w|Xn=f(w)]
=X
yn∈Yn:g(yn)6=w
PYn|Xn(yn|f(w))
is the code’s conditional probability of decoding error given that message wis
sent over the channel.
Note that, since we have assumed that the message Wis drawn uniformly
from the set of messages, we have that
Pe(Cn) = Pr[ ˆ
W6=W].
91
Observation 4.6 Another more conservative error criterion is the so-called
maximal probability of error
λ(Cn),max
w∈{1,2,···,M }λw(Cn).
Clearly, Pe(Cn)λ(Cn); so one would expect that Pe(Cn) behaves differently
than λ(Cn). However it can be shown that from a code Cn= (n, M ) with
arbitrarily small Pe(Cn), one can construct (by throwing away from Cnhalf of
its codewords with largest conditional probability of error) a code C
n= (n, M
2)
with arbitrarily small λ(C
n) at essentially the same code rate as ngrows to
infinity (e.g., see [12, p. 204], [45, p. 163]).2Hence, we will only use Pe(Cn)
as our criterion when evaluating the “goodness” or reliability3of channel block
codes.
Our target is to find a good channel block code (or to show the existence of a
good channel block code). From the perspective of the (weak) law of large num-
bers, a good choice is to draw the code’s codewords based on the jointly typical
set between the input and the output of the channel, since all the probability
mass is ultimately placed on the jointly typical set. The decoding failure then
occurs only when the channel input-output pair does not lie in the jointly typical
set, which implies that the probability of decoding error is ultimately small. We
next define the jointly typical set.
Definition 4.7 (Jointly typical set) The set Fn(δ) of jointly δ-typical n-tuple
pairs (xn, yn) with respect to the memoryless distribution PXn,Y n(xn, yn) =
Qn
i=1 PX,Y (xi, yi) is defined by
Fn(δ),(xn, yn) Xn× Yn:
1
nlog2PXn(xn)H(X)< δ, 1
nlog2PYn(yn)H(Y)< δ,
and 1
nlog2PXn,Y n(xn, yn)H(X, Y )< δ.
2Note that this fact holds for single-user channels with known transition distributions (as
given in Definition 4.1) that remain constant throughout the transmission of a codeword. It
does not however hold for single-user channels whose statistical descriptions may vary in an
unknown manner from symbol to symbol during a codeword transmission; such channels, which
include the class of “arbitrarily varying channels” (see [13, Chapter 2, Section 6]), will not be
considered in this textbook.
3We interchangeably use the terms “goodness” or “reliability” for a block code to mean
that its (average) probability of error asymptotically vanishes with increasing blocklength.
92
In short, a pair (xn, yn) generated by independently drawing ntimes under PX,Y
is jointly δ-typical if its joint and marginal empirical entropies are respectively
δ-close to the true joint and marginal entropies.
With the above definition, we directly obtain the joint AEP theorem.
Theorem 4.8 (Joint AEP) If (X1, Y1), (X2, Y2), ..., (Xn, Yn), ... are i.i.d.,
i.e., {(Xi, Yi)}
i=1 is a dependent pair of DMSs, then
1
nlog2PXn(X1, X2,...,Xn)H(X) in probability,
1
nlog2PYn(Y1, Y2,...,Yn)H(Y) in probability,
and
1
nlog2PXn,Y n((X1, Y1),...,(Xn, Yn)) H(X, Y ) in probability
as n .
Proof: By the weak law of large numbers, we have the desired result. 2
Theorem 4.9 (Shannon-McMillan theorem for pairs) Given a dependent
pair of DMSs with joint entropy H(X, Y ) and any δgreater than zero, we can
choose nbig enough so that the jointly δ-typical set satisfies:
1. PXn,Y n(Fc
n(δ)) < δ for sufficiently large n.
2. The number of elements in Fn(δ) is at least (1 δ)2n(H(X,Y )δ)for suffi-
ciently large n, and at most 2n(H(X,Y )+δ)for every n.
3. If (xn, yn) Fn(δ), its probability of occurrence satisfies
2n(H(X,Y )+δ)< PXn,Y n(xn, yn)<2n(H(X,Y )δ).
Proof: The proof is quite similar to that of the Shannon-McMillan theorem for
a single memoryless source presented in the previous chapter; we hence leave it
as an exercise. 2
We herein arrive at the main result of this chapter, Shannon’s channel coding
theorem for DMCs. It basically states that a quantity C, termed as channel
capacity and defined as the maximum of the channel’s mutual information over
the set of its input distributions (see below), is the supremum of all “achievable”
93
channel block code rates; i.e., it is the supremum of all rates for which there
exists a sequence of block codes for the channel with asymptotically decaying
(as the blocklength grows to infinity) probability of decoding error. In other
words, for a given DMC, its capacity C, which can be calculated by solely using
the channel’s transition matrix Q, constitutes the largest rate at which one can
reliably transmit information via a block code over this channel. Thus, it is
possible to communicate reliably over an inherently noisy DMC at a fixed rate
(without decreasing it) as long as this rate is below Cand the code’s blocklength
is allowed to be large.
Theorem 4.10 (Shannon’s channel coding theorem) Consider a DMC
with finite input alphabet X, finite output alphabet Yand transition distribution
probability PY|X(y|x), x X and y Y. Define the channel capacity4
C,max
PX
I(X;Y) = max
PX
I(PX, PY|X)
where the maximum is taken over all input distributions PX. Then the following
hold.
Forward part (achievability): For any 0 < ε < 1, there exist γ > 0 and a
sequence of data transmission block codes { Cn= (n, Mn)}
n=1 with
lim inf
n→∞
1
nlog2MnCγ
and
Pe(Cn)< ε for sufficiently large n,
where Pe(Cn) denotes the (average) probability of error for block code Cn.
Converse part: Any sequence of data transmission block codes { Cn=
(n, Mn)}
n=1 with
lim inf
n→∞
1
nlog2Mn> C
4First note that the mutual information I(X;Y) is actually a function of the input statistics
PXand the channel statistics PY|X. Hence, we may write it as
I(PX, PY|X) = X
x∈X X
y∈Y
PX(x)PY|X(y|x) log2
PY|X(y|x)
Px∈X PX(x)PY|X(y|x).
Such an expression is more suitable for calculating the channel capacity.
Note also that the channel capacity Cis well-defined since, for a fixed PY|X,I(PX, PY|X) is
concave and continuous in PX(with respect to both the variational distance and the Euclidean
distance (i.e., L2-distance) [45, Chapter 2]), and since the set of all input distributions PXis
a compact (closed and bounded) subset of R|X | due to the finiteness of X. Hence there exists
aPXthat achieves the supremum of the mutual information and the maximum is attainable.
94
satisfies
Pe(Cn)>0 for sufficiently large n;
i.e., the codes’ probability of error is bounded away from zero for all n
sufficiently large.
Proof of the forward part: It suffices to prove the existence of a good block
code sequence (satisfying the rate condition, i.e., lim inf n→∞(1/n) log2Mn
Cγfor some γ > 0) whose average error probability is ultimately less than ε.
We will use Shannon’s original random coding proof technique in which the
good block code sequence is not deterministically constructed; instead, its exis-
tence is implicitly proven by showing that for a class (ensemble) of block code
sequences { Cn}
n=1 and a code-selecting distribution Pr[ Cn] over these block
code sequences, the expectation value of the average error probability, evaluated
under the code-selecting distribution on these block code sequences, can be made
smaller than εfor nsufficiently large:
ECn[Pe(Cn)] = X
Cn
Pr[ Cn]Pe(Cn)0 as n .
Hence, there must exists at least one such a desired good code sequence { C
n}
n=1
among them (with Pe(C
n)0 as n ).
Fix ε(0,1) and some γin (0,4ε). Observe that there exists N0such that
for n > N0, we can choose an integer Mnwith
Cγ
21
nlog2Mn> C γ.
(Since we are only concerned with the case of “sufficient large n,” it suffices to
consider only those n’s satisfying n > N0, and ignore those n’s for nN0.)
Define δ,γ/8. Let Pˆ
Xbe the probability distribution achieving the channel
capacity:
C,max
PX
I(PX, PY|X) = I(Pˆ
X, PY|X).
Denote by Pˆ
Ynthe channel output distribution due to channel input product
distribution Pˆ
Xn(with Pˆ
Xn(xn) = Qn
i=1 Pˆ
X(xi)), i.e.,
Pˆ
Yn(yn) = X
xn∈Xn
Pˆ
Xn,ˆ
Yn(xn, yn)
where
Pˆ
Xn,ˆ
Yn(xn, yn),Pˆ
Xn(xn)PYn|Xn(yn|xn)
95
for all xn Xnand yn Yn. Note that since Pˆ
Xn(xn) = Qn
i=1 Pˆ
X(xi) and the
channel is memoryless, the resulting joint input-output process {(ˆ
Xi,ˆ
Yi)}
i=1 is
also memoryless with
Pˆ
Xn,ˆ
Yn(xn, yn) =
n
Y
i=1
Pˆ
X, ˆ
Y(xi, yi)
and
Pˆ
X, ˆ
Y(x, y) = Pˆ
X(x)PY|X(y|x) for x X, y Y.
We next present the proof in three steps.
Step 1: Code construction.
For any blocklength n, independently select Mnchannel inputs with re-
placement5from Xnaccording to the distribution Pˆ
Xn(xn). For the se-
lected Mnchannel inputs yielding codebook Cn,{c1,c2,...,cMn}, define
the encoder fn(·) and decoder gn(·), respectively, as follows:
fn(m) = cmfor 1 mMn,
and
gn(yn) =
m, if cmis the only codeword in Cn
satisfying (cm, yn) Fn(δ);
any one in {1,2,...,Mn},otherwise,
where Fn(δ) is defined in Definition 4.7 with respect to distribution Pˆ
Xn,ˆ
Yn.
(We evidently assume that the codebook Cnand the channel distribution
PY|Xare known at both the encoder and the decoder.) Hence, the code
Cnoperates as follows. A message Wis chosen according to the uniform
distribution from the set of messages. The encoder fnthen transmits the
Wth codeword cWin Cnover the channel. Then Ynis received at the
channel output and the decoder guesses the sent message via ˆ
W=gn(Yn).
Note that there is a total |X|nMnpossible randomly generated codebooks
Cnand the probability of selecting each codebook is given by
Pr[ Cn] =
Mn
Y
m=1
Pˆ
Xn(cm).
5Here, the channel inputs are selected with replacement. That means it is possible and
acceptable that all the selected Mnchannel inputs are identical.
96
Step 2: Conditional error probability.
For each (randomly generated) data transmission code Cn, the conditional
probability of error given that message mwas sent, λm(Cn), can be upper
bounded by:
λm(Cn)X
yn∈Yn: (cm,yn)6∈Fn(δ)
PYn|Xn(yn|cm)
+
Mn
X
m=1
m6=mX
yn∈Yn: (cm,yn)∈Fn(δ)
PYn|Xn(yn|cm),(4.3.1)
where the first term in (4.3.1) considers the case that the received channel
output ynis not jointly δ-typical with cm, (and hence, the decoding rule
gn(·) would possibly result in a wrong guess), and the second term in
(4.3.1) reflects the situation when ynis jointly δ-typical with not only the
transmitted codeword cm, but also with another codeword cm(which may
cause a decoding error).
By taking expectation in (4.3.1) with respect to the mth codeword-
selecting distribution Pˆ
Xn(cm), we obtain
X
cm∈Xn
Pˆ
Xn(cm)λm(Cn)X
cm∈XnX
yn6∈Fn(δ|cm)
Pˆ
Xn(cm)PYn|Xn(yn|cm)
+X
cm∈Xn
Mn
X
m=1
m6=mX
yn∈Fn(δ|cm)
Pˆ
Xn(cm)PYn|Xn(yn|cm)
=Pˆ
Xn,ˆ
Yn(Fc
n(δ))
+
Mn
X
m=1
m6=mX
cm∈XnX
yn∈Fn(δ|cm)
Pˆ
Xn,ˆ
Yn(cm, yn),
(4.3.2)
where
Fn(δ|xn),{yn Yn: (xn, yn) Fn(δ)}.
Step 3: Average error probability.
We now can analyze the expectation of the average error probability
ECn[Pe(Cn)]
over the ensemble of all codebooks Cngenerated at random according to
Pr[ Cn] and show that it asymptotically vanishes as ngrows without bound.
We obtain the following series of inequalities.
97
ECn[Pe(Cn)] = X
Cn
Pr[ Cn]Pe(Cn)
=X
c1∈Xn··· X
cMn∈Xn
Pˆ
Xn(c1)···Pˆ
Xn(cMn) 1
Mn
Mn
X
m=1
λm(Cn)!
=1
Mn
Mn
X
m=1 X
c1∈Xn··· X
cm1∈XnX
cm+1∈Xn··· X
cMn∈Xn
Pˆ
Xn(c1)···Pˆ
Xn(cm1)Pˆ
Xn(cm+1)···Pˆ
Xn(cMn)
× X
cm∈Xn
Pˆ
Xn(cm)λm(Cn)!
1
Mn
Mn
X
m=1 X
c1∈Xn··· X
cm1∈XnX
cm+1∈Xn··· X
cMn∈Xn
Pˆ
Xn(c1)···Pˆ
Xn(cm1)Pˆ
Xn(cm+1)···Pˆ
Xn(cMn)
×Pˆ
Xn,ˆ
Yn(Fc
n(δ))
+1
Mn
Mn
X
m=1 X
c1∈Xn··· X
cm1∈XnX
cm+1∈Xn··· X
cMn∈Xn
Pˆ
Xn(c1)···Pˆ
Xn(cm1)Pˆ
Xn(cm+1)···Pˆ
Xn(cMn)
×
Mn
X
m=1
m6=mX
cm∈XnX
yn∈Fn(δ|cm)
Pˆ
Xn,ˆ
Yn(cm, yn) (4.3.3)
=Pˆ
Xn,ˆ
Yn(Fc
n(δ))
+1
Mn
Mn
X
m=1
Mn
X
m=1
m6=m
X
c1∈Xn··· X
cm1∈XnX
cm+1∈Xn··· X
cMn∈Xn
Pˆ
Xn(c1)···Pˆ
Xn(cm1)Pˆ
Xn(cm+1)···Pˆ
Xn(cMn)
×X
cm∈XnX
yn∈Fn(δ|cm)
Pˆ
Xn,ˆ
Yn(cm, yn)
,
where (4.3.3) follows from (4.3.2), and the last step holds since Pˆ
Xn,ˆ
Yn(Fc
n(δ))
98
is a constant independent of c1,...,cMnand m. Observe that for n > N0,
Mn
X
m=1
m6=m
X
c1∈Xn··· X
cm1∈XnX
cm+1∈Xn··· X
cMn∈Xn
Pˆ
Xn(c1)···Pˆ
Xn(cm1)Pˆ
Xn(cm+1)···Pˆ
Xn(cMn)
×X
cm∈XnX
yn∈Fn(δ|cm)
Pˆ
Xn,ˆ
Yn(cm, yn)
=
Mn
X
m=1
m6=m
X
cm∈XnX
cm∈XnX
yn∈Fn(δ|cm)
Pˆ
Xn(cm)Pˆ
Xn,ˆ
Yn(cm, yn)
=
Mn
X
m=1
m6=m
X
cm∈XnX
yn∈Fn(δ|cm)
Pˆ
Xn(cm) X
cm∈Xn
Pˆ
Xn,ˆ
Yn(cm, yn)!
=
Mn
X
m=1
m6=m
X
cm∈XnX
yn∈Fn(δ|cm)
Pˆ
Xn(cm)Pˆ
Yn(yn)
=
Mn
X
m=1
m6=m
X
(cm,yn)∈Fn(δ)
Pˆ
Xn(cm)Pˆ
Yn(yn)
Mn
X
m=1
m6=m
|Fn(δ)|2n(H(ˆ
X)δ)2n(H(ˆ
Y)δ)
Mn
X
m=1
m6=m
2n(H(ˆ
X, ˆ
Y)+δ)2n(H(ˆ
X)δ)2n(H(ˆ
Y)δ)
= (Mn1)2n(H(ˆ
X, ˆ
Y)+δ)2n(H(ˆ
X)δ)2n(H(ˆ
Y)δ)
Mn·2n(H(ˆ
X, ˆ
Y)+δ)2n(H(ˆ
X)δ)2n(H(ˆ
Y)δ)
2n(C4δ)·2n(I(ˆ
X;ˆ
Y)3δ)= 2,
where the first inequality follows from the definition of the jointly typical
set Fn(δ), the second inequality holds by the Shannon-McMillan theorem
for pairs (Theorem 4.9), the last inequality follows since C=I(ˆ
X;ˆ
Y) by
definition of ˆ
Xand ˆ
Y, and since (1/n) log2MnC(γ/2) = C4δ.
99
Consequently,
ECn[Pe(Cn)] Pˆ
Xn,ˆ
Yn(Fc
n(δ)) + 2,
which for sufficiently large n(and n > N0), can be made smaller than
2δ=γ/4< ε by the Shannon-McMillan theorem for pairs. 2
Before proving the converse part of the channel coding theorem, let us recall
Fano’s inequality in a channel coding context. Consider an (n, Mn) channel
block code Cnwith encoding and decoding functions given by
fn:{1,2,···, Mn} Xn
and
gn:Yn {1,2,···, Mn},
respectively. Let message W, which is uniformly distributed over the set of
messages {1,2,···, Mn}, be sent via codeword Xn(W) = fn(W) over the DMC,
and let Ynbe received at the channel output. At the receiver, the decoder
estimates the sent message via ˆ
W=gn(Yn) and the probability of estimation
error is given by the code’s average error probability:
Pr[W6=ˆ
W] = Pe(Cn)
since Wis uniformly distributed. Then Fano’s inequality (2.5.2) yields
H(W|Yn)1 + Pe(Cn) log2(Mn1)
1 + Pe(Cn) log2Mn.(4.3.4)
We next proceed with the proof of the converse part.
Proof of the converse part: For any (n, Mn) block channel code Cnas de-
scribed above, we have that WXnYnform a Markov chain; we thus
obtain by the data processing inequality that
I(W;Yn)I(Xn;Yn).(4.3.5)
We can also upper bound I(Xn;Yn) in terms of the channel capacity Cas follows
I(Xn;Yn)max
PXn
I(Xn;Yn)
max
PXn
n
X
i=1
I(Xi;Yi) (by Theorem 2.21)
n
X
i=1
max
PXn
I(Xi;Yi)
=
n
X
i=1
max
PXi
I(Xi;Yi)
=nC. (4.3.6)
100
Consequently, code Cnsatisfies the following:
log2Mn=H(W) (since Wis uniformly distributed)
=H(W|Yn) + I(W;Yn)
H(W|Yn) + I(Xn;Yn) (by 4.3.5)
H(W|Yn) + nC (by 4.3.6)
1 + Pe(Cn)·log2Mn+nC. (by 4.3.4)
This implies that
Pe(Cn)1C
(1/n) log2Mn1
log2Mn
.
So if lim infn→∞(1/n) log2Mn> C, then for any δ > 0, there exists an integer
Nsuch that for nN,1
nlog2Mn> C +δ.
Hence, for nN0,max{N, 2},
Pe(Cn)>1C
C+δ1
n(C+δ)>δ
2(C+δ)>0;
i.e., Pe(Cn) is bounded away from zero for nsufficiently large. 2
The results of the above channel coding theorem is illustrated in Figure 4.5,
where R= lim inf n→∞(1/n) log2Mn(measured in message bits/channel use) is
usually called the ultimate (or asymptotic) coding rate of channel block codes.
As indicated in the figure, the ultimate rate of any good block code for the
DMC must be smaller than its capacity C. Conversely, any block code with
(ultimate) rate greater than C, will have its probability of error bounded away
from zero. Thus for a DMC, its capacity Cis the supremum of all “achievable”
channel block coding rates; i.e., it is the supremum of all rates for which there
exists a sequence of channel block codes with asymptotically vanishing (as the
blocklength goes to infinity) probability of error.
Shannon’s channel coding theorem, established in 1948 [38], provides the ul-
timate limit for reliable communication over a noisy channel. However, it does
not provide an explicit efficient construction for good codes since searching for
a good code from the ensemble of randomly generated codes is prohibitively
complex, as its size grows double-exponentially with blocklength (see Step 1 of
the proof of the forward part). It thus spurred the entire area of coding theory,
which flourished over the last 60 years with the aim of constructing power-
ful error-correcting codes operating close to the capacity limit. Particular ad-
vances were made for the class of linear codes (also known as group codes) whose
101
-
C
limn→∞ Pe= 0
for the best channel block code
lim supn→∞ Pe>0
for all channel block codes
R
Figure 4.5: Ultimate channel coding rate Rversus channel capacity C
and behavior of the probability of error as blocklength ngoes to infinity
for a discrete memoryless channel.
rich6yet elegantly simple algebraic structures made them amenable for efficient
practically-implementable encoding and decoding. Examples of such codes in-
clude Hamming codes, Golay codes, BCH and Reed-Solomon codes and convo-
lutional codes. In 1993, the so-called Turbo codes were introduced by Berrou
et al. [3, 4] and shown experimentally to perform close to the channel capacity
limit for the class of memoryless channels. Similar near-capacity achieving lin-
ear codes were later established with the re-discovery of Gallager’s low-density
parity-check codes [16, 17, 29, 30]. Many of these codes are used with increased
sophistication in today’s ubiquitous communication, information and multime-
dia technologies. For detailed studies on coding theory, see the following texts
[8, 10, 23, 28, 31, 35, 44].
4.4 Calculating channel capacity
Given a DMC with finite input alphabet X, finite input alphabet Yand channel
transition matrix Q= [px,y] of size |X| × |Y|, where px,y ,PY|X(y|x), for x X
and y Y, we would like to calculate
C,max
PX
I(X;Y)
where the maximization (which is well-defined) is carried over the set of input
distributions PX, and I(X;Y) is the mutual information between the channel’s
input and output.
Note that Ccan be determined numerically via non-linear optimization tech-
niques such as the iterative algorithms developed by Arimoto [1] and Blahut
[7, 9], see also [14] and [45, Chap. 9]. In general, there are no closed-form (single-
letter) analytical expressions for C. However, for many “simplified” channels,
6Indeed, there exist linear codes that can achieve the capacity of memoryless channels with
additive noise (e.g., see [13, p. 114]). Such channels include the BSC and the q-ary symmetric
channel.
102
it is possible to analytically determine Cunder some “symmetry” properties of
their channel transition matrix.
4.4.1 Symmetric, weakly-symmetric and quasi-symmetric
channels
Definition 4.11 A DMC with finite input alphabet X, finite output alphabet
Yand channel transition matrix Q= [px,y] of size |X | × |Y| is said to be sym-
metric if the rows of Qare permutations of each other and the columns of Q
are permutations of each other. The channel is said to be weakly-symmetric if
the rows of Qare permutations of each other and all the column sums in Qare
equal.
It directly follows from the definition that symmetry implies weak-symmetry.
Examples of symmetric DMCs include the BSC, the q-ary symmetric channel and
the following ternary channel with X=Y={0,1,2}and transition matrix
Q=
PY|X(0|0) PY|X(1|0) PY|X(2|0)
PY|X(0|1) PY|X(1|1) PY|X(2|1)
PY|X(0|2) PY|X(1|2) PY|X(2|2)
=
0.4 0.1 0.5
0.5 0.4 0.1
0.1 0.5 0.4
.
The following DMC with |X| =|Y| = 3 and
Q=
0.5 0.25 0.25 0
0.5 0.25 0.25 0
0 0.25 0.25 0.5
0 0.25 0.25 0.5
(4.4.1)
is weakly-symmetric (but not symmetric). Noting that all above channels in-
volve square transition matrices, we emphasize that Qcan be rectangular while
satisfying the symmetry or weak-symmetry properties. For example, the DMC
with |X| = 2, |Y| = 4 and
Q=1ε
2
ε
2
1ε
2
ε
2
ε
2
1ε
2
ε
2
1ε
2(4.4.2)
is symmetric (where ε[0,1]), while the DMC with |X| = 2, |Y| = 3 and
Q=1
3
1
6
1
2
1
3
1
2
1
6
is weakly-symmetric.
103
Lemma 4.12 The capacity of a weakly-symmetric channel Qis achieved by a
uniform input distribution and is given by
C= log2|Y| H(q1, q2,···, q|Y|) (4.4.3)
where (q1, q2,···, q|Y|) denotes any row of Qand
H(q1, q2,···, q|Y|),|Y|
X
i=1
qilog2qi
is the row entropy.
Proof: The mutual information between the channel’s input and output is given
by
I(X;Y) = H(Y)H(Y|X)
=H(Y)X
x∈X
PX(x)H(Y|X=x)
where H(Y|X=x) = Py∈Y PY|X(y|x) log2PY|X(y|x) = Py∈Y px,y log2px,y .
Noting that every row of Qis a permutation of every other row, we obtain
that H(Y|X=x) is independent of xand can be written as
H(Y|X=x) = H(q1, q2,···, q|Y|)
where (q1, q2,···, q|Y|) is any row of Q. Thus
H(Y|X) = X
x∈X
PX(x)H(q1, q2,···, q|Y|)
=H(q1, q2,···, q|Y|) X
x∈X
PX(x)!
=H(q1, q2,···, q|Y|).
Thus
I(X;Y) = H(Y)H(q1, q2,···, q|Y|)
log2|Y| H(q1, q2,···, q|Y|)
with equality achieved iff Yis uniformly distributed over Y. We next show that
choosing a uniform input distribution, PX(x) = 1
|X| x X, yields a uniform
output distribution, hence maximizing mutual information. Indeed, under a
uniform input distribution, we obtain that for any y Y,
PY(y) = X
x∈X
PX(x)PY|X(y|x) = 1
|X| X
x∈X
px,y =A
|X|
104
where A,Px∈X px,y is a constant given by the sum of the entries in any column
of Q, since by the weak-symmetry property all column sums in Qare identical.
Noting that Py∈Y PY(y) = 1 yields that
X
y∈Y
A
|X| = 1
and thus
A=|X|
|Y|.(4.4.4)
Thus
PY(y) = A
|X| =|X|
|Y|
1
|X| =1
|Y|
for any y Y; thus the uniform input distribution induces a uniform output
distribution and achieves channel capacity as given by (4.4.3). 2
Observation 4.13 Note that if the weakly-symmetric channel has a square (i.e.,
with |X| =|Y|) transition matrix Q, then Qis a doubly-stochastic matrix; i.e.,
both its row sums and its column sums are equal to 1. Note however that having
a square transition matrix does not necessarily make a weakly-symmetric channel
symmetric; e.g., see (4.4.1).
Example 4.14 (Capacity of the BSC.) Since the BSC with crossover pro-
bability (or bit error rate) εis symmetric, we directly obtain from Lemma 4.12
that its capacity is achieved by a uniform input distribution and is given by
C= log2(2) H(1 ε, ε) = 1 hb(ε) (4.4.5)
where hb(·) is the binary entropy function.
Example 4.15 (Capacity of the q-ary symmetric channel.) Similarly, the
q-ary symmetric channel with symbol error rate εdescribed in (4.2.8) is sym-
metric; hence, by Lemma 4.12, its capacity is given by
C= log2qH1ε, ε
q1,···,ε
q1= log2q+εlog2
ε
q1+(1ε) log2(1ε).
Note that when q= 2, the channel capacity is equal to that of the BSC, as ex-
pected. Furthermore, when ε= 0, the channel reduces to the identity (noiseless)
q-ary channel and its capacity is given by C= log2q.
We next note that one can further weaken the weak-symmetry property and
define a class of “quasi-symmetric” channels for which the uniform input distri-
bution still achieves capacity and yields a simple closed-form formula for capacity.
105
Definition 4.16 A DMC with finite input alphabet X, finite output alphabet
Yand channel transition matrix Q= [px,y] of size |X | × |Y| is said to be quasi-
symmetric7if Qcan be partitioned along its columns into mweakly-symmetric
sub-matrices Q1,Q2,···,Qmfor some integer m1, where each Qisub-matrix
has size |X| × |Yi|for i= 1,2,···, m with Y1 ··· Ym=Yand Yi Yj=
i6=j,i, j = 1,2,···, m.
Hence, quasi-symmetry is our weakest symmetry notion, since a weakly-
symmetric channel is clearly quasi-symmetric (just set m= 1 in the above
definition); we thus have: symmetry =weak-symmetry =quasi-symmetry.
Lemma 4.17 The capacity of a quasi-symmetric channel Qas defined above is
achieved by a uniform input distribution and is given by
C=
m
X
i=1
aiCi(4.4.6)
where
ai,X
y∈Yi
px,y = sum of any row in Qi, i = 1,···, m,
and
Ci= log2|Yi| Hany row in the matrix 1
aiQi, i = 1,···, m
is the capacity of the ith weakly-symmetric “sub-channel” whose transition ma-
trix is obtained by multiplying each entry of Qiby 1
ai(this normalization renders
sub-matrix Qiinto a stochastic matrix and hence a channel transition matrix).
Proof: We first observe that for each i= 1,···, m,aiis independent of the
input value x, since sub-matrix iis weakly-symmetric (so any row in Qiis a
permutation of any other row); and hence aiis the sum of any row in Qi.
For each i= 1,···, m, define
PYi|X(y|x),(px,y
aiif y Yiand x X;
0 otherwise
where Yiis a random variable taking values in Yi. It can be easily verified that
PYi|X(y|x) is a legitimate conditional distribution. Thus [PYi|X(y|x)] = 1
aiQi
is the transition matrix of the weakly-symmetric “sub-channel” iwith input
7This notion of “quasi-symmetry” is slightly more general that Gallager’s notion [18, p. 94],
as we herein allow each sub-matrix to be weakly-symmetric (instead of symmetric as in [18]).
106
alphabet Xand output alphabet Yi. Let I(X;Yi) denote its mutual information.
Since each such sub-channel iis weakly-symmetric, we know that its capacity
Ciis given by
Ci= max
PX
I(X;Yi) = log2|Yi| Hany row in the matrix 1
aiQi,
where the maximum is achieved by a uniform input distribution.
Now, the mutual information between the input and the output of our original
quasi-symmetric channel Qcan be written as
I(X;Y) = X
y∈Y X
x∈X
PX(x)px,y log2
px,y
Px∈X PX(x)px,y
=
m
X
i=1 X
y∈YiX
x∈X
aiPX(x)px,y
ai
log2
px,y
ai
Px∈X PX(x)px,y
ai
=
m
X
i=1
aiX
y∈YiX
x∈X
PX(x)PYi|X(y|x) log2
PYi|X(y|x)
Px∈X PX(x)PYi|X(y|x)
=
m
X
i=1
aiI(X;Yi).
Therefore, the capacity of channel Qis
C= max
PX
I(X;Y)
= max
PX
m
X
i=1
aiI(X;Yi)
=
m
X
i=1
aimax
PX
I(X;Yi) (as the same uniform PXmaximizes each I(X;Yi))
=
m
X
i=1
aiCi.
2
Example 4.18 (Capacity of the BEC) The BEC with erasure probability α
as given in (4.2.5) is quasi-symmetric (but neither weakly-symmetric nor sym-
metric). Indeed, its transition matrix Qcan be partitioned along its columns
into two symmetric (hence weakly-symmetric) sub-matrices
Q1=1α0
0 1 α
107
and
Q2=α
α.
Thus applying the capacity formula for quasi-symmetric channels of Lemma 4.17
yields that the capacity of the BEC is given by
C=a1C1+a2C2
where a1= 1 α,a2=α,
C1= log2(2) H1α
1α,0
1α= 1 H(1,0) = 1 0 = 1,
and
C2= log2(1) Hα
α= 0 0 = 0.
Therefore, the BEC capacity is given by
C= (1 α)(1) + (α)(0) = 1 α. (4.4.7)
Example 4.19 (Capacity of the BSEC) Similarly, the BSEC with crossover
probability εand erasure probability αas described in (4.2.6) is quasi-symmetric;
its transition matrix can be partitioned along its columns into two symmetric
sub-matrices
Q1=1εα ε
ε1εα
and
Q2=α
α.
Hence by Lemma 4.17, the channel capacity is given by C=a1C1+a2C2where
a1= 1 α,a2=α,
C1= log2(2) H1εα
1α,ε
1α= 1 hb1εα
1α,
and
C2= log2(1) Hα
α= 0.
We thus obtain that
C= (1 α)1hb1εα
1α+ (α)(0)
= (1 α)1hb1εα
1α.(4.4.8)
As already noted, the BSEC is a combination of the BSC with bit error rate ε
and the BEC with erasure probability α. Indeed, setting α= 0 in (4.4.8) yields
that C= 1 hb(1 ε) = 1 hb(ε) which is the BSC capacity. Furthermore,
setting ε= 0 results in C= 1 α, the BEC capacity.
108
4.4.2 Channel capacity Karuch-Kuhn-Tucker condition
When the channel does not satisfy any symmetry property, the following neces-
sary and sufficient Karuch-Kuhn-Tucker (KKT) condition (e.g., cf. [18, pp. 87-
91], [5, 11]) for calculating channel capacity can be quite useful.
Definition 4.20 (Mutual information for a specific input symbol) The
mutual information for a specific input symbol is defined as:
I(x;Y),X
y∈Y
PY|X(y|x) log2
PY|X(y|x)
PY(y).
From the above definition, the mutual information becomes:
I(X;Y) = X
x∈X
PX(x)X
y∈Y
PY|X(y|x) log2
PY|X(y|x)
PY(y)
=X
x∈X
PX(x)I(x;Y).
Lemma 4.21 (KKT condition for channel capacity.) For a given DMC,
an input distribution PXachieves its channel capacity iff there exists a constant
Csuch that (I(x:Y) = Cx X with PX(x)>0;
I(x:Y)Cx X with PX(x) = 0. (4.4.9)
Furthermore, the constant Cis the channel capacity (justifying the choice of
notation).
Proof: The forward (if) part holds directly; hence, we only prove the converse
(only-if) part.
Without loss of generality, we assume that PX(x)<1 for all x X, since
PX(x) = 1 for some ximplies that I(X;Y) = 0. The problem of calculating the
channel capacity is to maximize
I(X;Y) = X
x∈X X
y∈Y
PX(x)PY|X(y|x) log2
PY|X(y|x)
Px∈X PX(x)PY|X(y|x),(4.4.10)
subject to the condition X
x∈X
PX(x) = 1 (4.4.11)
109
for a given channel distribution PY|X. By using the Lagrange multiplier method
(e.g., see [5]), maximizing (4.4.10) subject to (4.4.11) is equivalent to maximize:
f(PX),X
x∈X
y∈Y
PX(x)PY|X(y|x) log2
PY|X(y|x)
X
x∈X
PX(x)PY|X(y|x)+λ X
x∈X
PX(x)1!.
We then take the derivative of the above quantity with respect to PX(x′′), and
obtain that8
∂f (PX)
∂PX(x′′)=I(x′′;Y)log2(e) + λ.
By Property 2 of Lemma 2.46, I(X;Y) = I(PX, PY|X) is a concave function in
PX(for a fixed PY|X). Therefore, the maximum of I(PX, PY|X) occurs for a zero
derivative when PX(x) does not lie on the boundary, namely 1 > PX(x)>0.
For those PX(x) lying on the boundary, i.e., PX(x) = 0, the maximum occurs iff
a displacement from the boundary to the interior decreases the quantity, which
implies a non-positive derivative, namely
I(x;Y) λ+ log2(e),for those xwith PX(x) = 0.
To summarize, if an input distribution PXachieves the channel capacity, then
I(x′′;Y) = λ+ log2(e),for PX(x′′)>0;
I(x′′;Y) λ+ log2(e),for PX(x′′) = 0.
8The details for taking the derivative are as follows:
∂PX(x′′ )
X
x∈X X
y∈Y
PX(x)PY|X(y|x) log2PY|X(y|x)
X
x∈X X
y∈Y
PX(x)PY|X(y|x) log2"X
x∈X
PX(x)PY|X(y|x)#+λ X
x∈X
PX(x)1!
=X
y∈Y
PY|X(y|x′′) log2PY|X(y|x′′ )
X
y∈Y
PY|X(y|x′′) log2"X
x∈X
PX(x)PY|X(y|x)#
+ log2(e)X
x∈X X
y∈Y
PX(x)PY|X(y|x)PY|X(y|x′′)
Px∈X PX(x)PY|X(y|x)
+λ
=I(x′′;Y)log2(e)X
y∈Y "X
x∈X
PX(x)PY|X(y|x)#PY|X(y|x′′)
Px∈X PX(x)PY|X(y|x)+λ
=I(x′′;Y)log2(e)X
y∈Y
PY|X(y|x′′) + λ
=I(x′′;Y)log2(e) + λ.
110
for some λ. With the above result, setting C=λ+ 1 yields (4.4.9). Finally,
multiplying both sides of each equation in (4.4.9) by PX(x) and summing over
xyields that maxPXI(X;Y) on the left and the constant Con the right, thus
proving that the constant Cis indeed the channel’s capacity. 2
Example 4.22 (Quasi-symmetric channels.) For a quasi-symmetric chan-
nel, one can directly verify that the uniform input distribution satisfies the
KKT condition of Lemma 4.21 and yields that the channel capacity is given
by (4.4.6); this is left as an exercise. As we already saw, the BSC, the q-ary
symmetric channel, the BEC and the BSEC are all quasi-symmetric.
Example 4.23 Consider a DMC with a ternary input alphabet X={0,1,2},
binary output alphabet Y={0,1}and the following transition matrix
Q=
1 0
1
2
1
2
0 1
.
This channel is not quasi-symmetric. However, one may guess that the capacity
of this channel is achieved by the input distribution (PX(0), PX(1), PX(2)) =
(1
2,0,1
2) since the input x= 1 has an equal conditional probability of being re-
ceived as 0 or 1 at the output. Under this input distribution, we obtain that
I(x= 0; Y) = I(x= 2; Y) = 1 and that I(x= 1; Y) = 0. Thus the KKT con-
dition of (4.4.9) is satisfied; hence confirming that the above input distribution
achieves channel capacity and that channel capacity is equal to 1 bit.
Observation 4.24 (Capacity achieved by a uniform input distribution.)
We close this chapter by noting that there is a class of DMC’s that is larger
than that of quasi-symmetric channels for which the uniform input distribution
achieves capacity. It concerns the class of so-called T-symmetric” channels [36,
Section V, Definition 1] for which
T(x),I(x;Y)log2|X| =X
y∈Y
PY|X(y|x) log2
PY|X(y|x)
Px∈X PY|X(y|x)
is a constant function of x(i.e., independent of x), where I(x;Y) is the mu-
tual information for input xunder a uniform input distribution. Indeed the
T-symmetry condition is equivalent to the property of having the uniform input
distribution achieve capacity. This directly follows from the KKT condition of
Lemma 4.21. An example of a T-symmetric channel that is not quasi-symmetric
is the binary-input ternary-output channel with the following transition matrix
Q=1
3
1
3
1
3
1
6
1
6
2
3.
111
Hence its capacity is achieved by the uniform input distribution. See [36, Fig. 2]
for (infinitely-many) other examples of T-symmetric channels. However, unlike
quasi-symmetric channels, T-symmetric channels do not admit in general a sim-
ple closed-form expression for their capacity (such as the one given in (4.4.6)).
112
Chapter 5
Differential Entropy and Gaussian
Channels
We have so for examined information measures and their operational character-
ization for discrete-time discrete-alphabet systems. In this chapter, we turn our
focus to continuous-alphabet (real-valued) systems. Except for a brief interlude
with the continuous-time (waveform) Gaussian channel, we consider discrete-
time systems, as treated throughout the book.
We first recall that a real-valued (continuous) random variable Xis described
by its cumulative distribution function (cdf)
FX(x),Pr[Xx]
for xR, the set of real numbers. The distribution of Xis called absolutely con-
tinuous (with respect to the Lebesgue measure) if a probability density function
(pdf) fX(·) exists such that
FX(x) = Zx
−∞
fX(t)dt
where fX(t)0tand R+
−∞ fX(t)dt = 1. If FX(·) is differentiable everywhere,
then the pdf fX(·) exists and is given by the derivative of FX(·): fX(t) = dFX(t)
dt .
The support of a random variable Xwith pdf fX(·) is denoted by SXand defined
as
SX={xR:fX(x)>0}.
We will deal with random variables that admit a pdf.1
1A rigorous (measure-theoretic) study for general continuous systems, initiated by Kol-
mogorov [25], can be found in [34, 22].
113
5.1 Differential entropy
Recall that the definition of entropy for a discrete random variable Xrepresent-
ing a DMS is
H(X),X
x∈X PX(x) log2PX(x) (in bits).
As already seen in Shannon’s source coding theorem, this quantity is the mini-
mum average code rate achievable for the lossless compression of the DMS. But if
the random variable takes on values in a continuum, the minimum number of bits
per symbol needed to losslessly describe it must be infinite. This is illustrated
in the following example, where we take a discrete approximation (quantization)
of a random variable uniformly distributed on the unit interval and study the
entropy of the quantized random variable as the quantization becomes finer and
finer.
Example 5.1 Consider a real-valued random variable Xthat is uniformly dis-
tributed on the unit interval, i.e., with pdf given by
fX(x) = (1 if x[0,1);
0 otherwise.
Given a positive integer m, we can discretize Xby uniformly quantizing it into
mlevels by partitioning the support of Xinto equal-length segments of size
= 1
m(∆ is called the quantization step-size) such that:
qm(X) = i
m,if i1
mX < i
m,
for 1 im. Then the entropy of the quantized random variable qm(X) is
given by
H(qm(X)) =
m
X
i=1
1
mlog21
m= log2m(in bits).
Since the entropy H(qm(X)) of the quantized version of Xis a lower bound to
the entropy of X(as qm(X) is a function of X) and satisfies in the limit
lim
m→∞ H(qm(X)) = lim
m→∞ log2m=,
we obtain that the entropy of Xis infinite.
The above example indicates that to compress a continuous source without
incurring any loss or distortion indeed requires an infinite number of bits.2Thus
2In fact, all continuous random variables (including those not admitting a pdf) have infinite
entropy. We sketch the proof as follows. For any continuous random variable X, there must
114
when studying continuous sources, the entropy measure is limited in its effective-
ness and the introduction of a new measure is necessary. Such new measure is
indeed obtained upon close examination of the entropy of a uniformly quantized
real-valued random-variable minus the quantization accuracy as the accuracy
increases without bound.
Lemma 5.2 Consider a real-valued random variable Xwith pdf fXsuch that
fXlog2fXis integrable.3Then a uniform quantization of Xwith an n-bit
accuracy (i.e., with a quantization step-size of = 2n) yields an entropy ap-
proximately equal to RfX(x) log2fX(x)dx +nbits for nsufficiently large. In
other words,
lim
n→∞ [H(qn(X)) n] = ZfX(x) log2fX(x)dx
where qn(X) is the uniformly quantized version of Xwith quantization step-size
= 2n.
Proof: We assume without loss of generality that the support of Xis given by
the entire real line.
Step 1: Mean-value theorem.
Let = 2nbe the quantization step-size and let ti=i for integer
i(−∞,). From the mean-value theorem (e.g., cf [32]), we can choose
xi[ti1, ti] such that
pi,Zti
ti1
fX(x)dx =f(xi)(titi1) = ·fX(xi).
exist a non-empty open interval for which the cdf FX(·) is strictly increasing. Now quantize
the source into m+ 1 levels as follows:
Assign one level to the complement of this open interval.
Assign mlevels to this open interval such that the total probability mass on this interval,
denoted by a, is equally distributed among these mlevels.
Then the entropy of Xis lower bounded by
H(q(X)) = (1 a)·log(1 a)a·log a
m,
where q(X) is the quantized version of X. The lower bound H(q(X)) as m .
3By integrability, we mean the usual Riemann integrability (e.g, see [37]).
115
Step 2: Definition of h(n)(X).Let
h(n)(X),
X
i=−∞
[fX(xi) log2fX(xi)]2n.
Since fX(x) log2fX(x)dx is integrable,
h(n)(X) ZfX(x) log2fX(x)dx as n .
Therefore, given any ε > 0, there exists Nsuch that for all n > N,
ZfX(x) log2fX(x)dx h(n)(X)< ε.
Step 3: Computation of H(qn(X)).The entropy of the (uniformly) quan-
tized version of X,qn(X), is given by
H(qn(X)) =
X
i=−∞
pilog2pi
=
X
i=−∞
(fX(xi)∆) log2(fX(xi)∆)
=
X
i=−∞
(fX(xi)2n) log2(fX(xi)2n).
Step 4: H(qn(X)) h(n)(X) .
From Steps 2 and 3,
H(qn(X)) h(n)(X) =
X
i=−∞
[fX(xi)2n] log2(2n)
=n
X
i=−∞ Zti
ti1
fX(x)dx
=nZ
−∞
fX(x)dx =n.
Hence, we have that for n > N,
ZfX(x) log2fX(x)dx +nε < H(qn(X)) = h(n)(X) + n
<ZfX(x) log2fX(x)dx +n+ε,
116
yielding that
lim
n→∞ [H(qn(X)) n] = ZfX(x) log2fX(x)dx.
2
In light of the above result, we can define the following information measure.
Definition 5.3 (Differential entropy) The differential entropy (in bits) of a
continuous random variable Xwith pdf fXand support SXis defined as
h(X),ZSX
fX(x)·log2fX(x)dx =E[log2fX(X)],
when the integral exists.
Thus the differential entropy h(X) of a real-valued random variable Xhas
an operational meaning in the following sense. Since H(qn(X)) is the minimum
average number of bits needed to losslessly describe qn(X), we thus obtain that
h(X) + nis approximately needed to describe Xwhen uniformly quantizing it
with an n-bit accuracy. Therefore, we may conclude that the larger h(X) is, the
larger is the average number of bits required to describe a uniformly quantized
Xwithin a fixed accuracy.
Example 5.4 A continuous random variable Xwith support SX= [0,1) and
pdf fX(x) = 2xfor xSXhas differential entropy equal to
Z1
02x·log2(2x)dx =x2(log2e2 log2(2x))
2
1
0
=1
2 ln 2 log2(2) = 0.278652 bits.
We herein illustrate Lemma 5.2 by uniformly quantizing Xto an n-bit accuracy
and computing the entropy H(qn(X)) and H(qn(X)) nfor increasing values of
n, where qn(X) is the quantized version of X.
We have that qn(X) is given by
qn(X) = i
2n,if i1
2nX < i
2n,
for 1 i2n. Hence,
Pr qn(X) = i
2n=(2i1)
22n,
117
n H(qn(X)) H(qn(X)) n
1 0.811278 bits -0.188722 bits
2 1.748999 bits -0.251000 bits
3 2.729560 bits -0.270440 bits
4 3.723726 bits -0.276275 bits
5 4.722023 bits -0.277977 bits
6 5.721537 bits -0.278463 bits
7 6.721399 bits -0.278600 bits
8 7.721361 bits -0.278638 bits
9 8.721351 bits -0.278648 bits
Table 5.1: Quantized random variable qn(X) under an n-bit accuracy:
H(qn(X)) and H(qn(X)) nversus n.
which yields
H(qn(X)) =
2n
X
i=1
2i1
22nlog22i1
22n
="1
22n
2n
X
i=1
(2i1) log2(2i1) + 2 log2(2n)#.
As shown in Table 5.1, we indeed observe that as nincreases, H(qn(X)) tends
to infinity while H(qn(X)) nconverges to h(X) = 0.278652 bits.
Thus a continuous random variable Xcontains an infinite amount of infor-
mation; but we can measure the information contained in its n-bit quantized
version qn(X) as: H(qn(X)) h(X) + n(for nlarge enough).
Example 5.5 Let us determine the minimum average number of bits required
to describe the uniform quantization with 3-digit accuracy of the decay time
(in years) of a radium atom assuming that the half-life of the radium (i.e., the
median of the decay time) is 80 years and that its pdf is given by fX(x) = λeλx,
where x > 0.
Since the median of the decay time is 80, we obtain:
Z80
0
λeλxdx = 0.5,
which implies that λ= 0.00866.Also, 3-digit accuracy is approximately equiva-
lent to log2999 = 9.96 10 bits accuracy. Therefore, by Lemma 5.2, the number
118
of bits required to describe the quantized decay time is approximately
h(X) + 10 = log2
e
λ+ 10 = 18.29 bits.
We close this section by computing the differential entropy for two common
real-valued random variables: the uniformly distributed random variable and
the Gaussian distributed random variable.
Example 5.6 (Differential entropy of a uniformly distributed random
variable) Let Xbe a continuous random variable that is uniformly distributed
over the interval (a, b), where b > a; i.e., its pdf is given by
fX(x) = (1
baif x(a, b);
0 otherwise.
So its differential entropy is given by
h(X) = Zb
a
1
balog2
1
ba= log2(ba) bits.
Note that if (ba)<1 in the above example, then h(X) is negative, unlike
entropy. The above example indicates that although differential entropy has a
form analogous to entropy (in the sense that summation and pmf for entropy are
replaced by integration and pdf, respectively, for differential entropy), differen-
tial entropy does not retain all the properties of entropy (one such operational
difference was already highlighted in the previous lemma).
Example 5.7 (Differential entropy of a Gaussian random variable) Let
X N(µ, σ2); i.e., Xis a Gaussian (or normal) random variable with finite mean
µ, variance Var(X) = σ2>0 and pdf
fX(x) = 1
2πσ2e(xµ)2
2σ2
for xR. Then its differential entropy is given by
h(X) = ZfX(x)1
2log2(2πσ2) + (xµ)2
2σ2log2edx
=1
2log2(2πσ2) + log2e
2σ2E[(Xµ)2]
=1
2log2(2πσ2) + 1
2log2e
=1
2log2(2πeσ2) bits.(5.1.1)
Note that for a Gaussian random variable, its differential entropy is only a func-
tion of its variance σ2(it is independent from its mean µ).
119
5.2 Joint and conditional differential entropies, divergence
and mutual information
Definition 5.8 (Joint differential entropy) If Xn= (X1, X2,···, Xn) is a
continuous random vector of size n(i.e., a vector of ncontinuous random vari-
ables) with joint pdf fXnand support SXnRn, then its joint differential
entropy is defined as
h(Xn),ZSXn
fXn(x1, x2,···, xn) log2fXn(x1, x2,···, xn)dx1dx2··· dxn
=E[log2fXn(Xn)]
when the n-dimensional integral exists.
Definition 5.9 (Conditional differential entropy) Let Xand Ybe two jointly
distributed continuous random variables with joint pdf fX,Y and support SX Y
R2such that the conditional pdf of Ygiven Xgiven by fY|X(y|x) = fX,Y (x,y)
fX(x)is
well defined for all (x, y)SXY , where fXis the marginal pdf of X. Then the
conditional entropy of Ygiven Xis defined as
h(Y|X),ZSXY
fX,Y (x, y) log2fY|X(y|x)dx dy =E[log2fY|X(Y|X)],
when the integral exists.
Note that as the in the case of (discrete) entropy, the chain rule holds for
differential entropy:
h(X, Y ) = h(X) + h(Y|X) = h(Y) + h(X|Y).
Definition 5.10 (Divergence or relative entropy) Let Xand Ybe two con-
tinuous random variables with marginal pdfs fXand fY, respectively, such that
their supports satisfy SXSYR. Then the divergence (or relative entropy or
Kullback-Leibler distance) between Xand Yis written as D(XkY) or D(fXkfY)
and defined by
D(XkY),ZSX
fX(x) log2
fX(x)
fY(x)dx =EfX(X)
fY(X)
when the integral exists. The definition carries over similarly in the multivariate
case: for Xn= (X1, X2,··· , Xn) and Yn= (Y1, Y2,···, Yn) two random vectors
with joint pdfs fXnand fYn, respectively, and supports satisfying SXnSYn
Rn, then the divergence between Xnand Ynis defined as
D(XnkYn),ZSXn
fXn(x1, x2,···, xn) log2
fXn(x1, x2,···, xn)
fYn(x1, x2,···, xn)dx1dx2··· dxn
when the integral exists.
120
Definition 5.11 (Mutual information) Let Xand Ybe two jointly distributed
continuous random variables with joint pdf fX,Y and support SXY R2, then
the mutual information between Xand Yis defined by
I(X;Y),D(fX,Y kfXfY) = ZSX,Y
fX,Y (x, y) log2
fX,Y (x, y)
fX(x)fY(y)dx dy,
assuming the integral exists, where fXand fYare the marginal pdfs of Xand
Y, respectively.
Observation 5.12 For two jointly distributed continuous random variables X
and Ywith joint pdf fX,Y , support SX Y R2and joint differential entropy
h(X, Y ) = ZSXY
fX,Y (x, y) log2fX,Y (x, y)dx dy,
then as in Lemma 5.2 and the ensuing discussion, one can write
H(qn(X), qm(Y)) h(X, Y ) + n+m
for nand msufficiently large, where qs(Z) denotes the (uniformly) quantized
version of random variable Zwith an s-bit accuracy.
On the other hand, for the above continuous Xand Y,
I(qn(X); qm(Y)) = H(qn(X)) + H(qm(Y)) H(qn(X), qm(Y))
[h(X) + n] + [h(Y) + m][h(X, Y ) + n+m]
=h(X) + h(Y)h(X, Y )
=ZSX,Y
fX,Y (x, y) log2
fX,Y (x, y)
fX(x)fY(y)dx dy
for nand msufficiently large; in other words,
lim
n,m→∞ I(qn(X); qm(Y)) = h(X) + h(Y)h(X, Y ).
Furthermore, it can be shown that
lim
n→∞ D(qn(X)kqn(Y)) = ZSX
fX(x) log2
fX(x)
fY(x)dx.
Thus mutual information and divergence can be considered as the true tools
of Information Theory, as their retain the same operational characteristics and
properties for both discrete and continuous probability spaces (as well as general
spaces where they can be defined in terms of Radon-Nikodym derivatives (e.g.,
cf. [22]).4
4This justifies using identical notations for both I(·;·) and D(·k·) as opposed to the dis-
cerning notations of H(·) for entropy and h(·) for differential entropy.
121
The following lemma illustrates that for continuous systems, I(·;·) and D(·k·)
keep the same properties already encountered for discrete systems, while differ-
ential entropy (as already seen with its possibility if being negative) satisfies
some different properties than entropy. The proof is left as an exercise.
Lemma 5.13 The following properties hold for the information measures of
continuous systems.
1. Non-negativity of divergence: Let Xand Ybe two continuous ran-
dom variables with marginal pdfs fXand fY, respectively, such that their
supports satisfy SXSYR. Then
D(fXkfY)0
with equality iff fX(x) = fY(x) for all xS(i.e., X=Yalmost surely).
2. Non-negativity of mutual information: For any two continuous ran-
dom variables Xand Y,
I(X;Y)0
with equality iff Xand Yare independent.
3. Conditioning never increases differential entropy: For any two con-
tinuous random variables Xand Ywith joint pdf fX,Y and well-defined
conditional pdf fX|Y,
h(X|Y)h(X)
with equality iff Xand Yare independent.
4. Chain rule for differential entropy: For a continuous random vector
Xn= (X1, X2,···, Xn),
h(X1, X2,...,Xn) =
n
X
i=1
h(Xi|X1, X2,...,Xi1),
where h(Xi|X1, X2,...,Xi1),h(X1) for i= 1.
5. Chain rule for mutual information: For continuous random vector
Xn= (X1, X2,···, Xn) and random variable Ywith joint pdf fXn,Y and
well-defined conditional pdfs fXi,Y |Xi1,fXi|Xi1and fY|Xi1for i= 1,···, n,
we have that
I(X1, X2,···, Xn;Y) =
n
X
i=1
I(Xi;Y|Xi1,···, X1),
where I(Xi;Y|Xi1,···, X1),I(X1;Y) for i= 1.
122
6. Data processing inequality: For continuous random variables X,Y
and Zsuch that XYZ,
I(X;Y)I(X;Z).
7. Independence bound for differential entropy: For a continuous ran-
dom vector Xn= (X1, X2,···, Xn),
h(Xn)
n
X
i=1
h(Xi)
with equality iff all the Xi’s are independent from each other.
8. Invariance of differential entropy under translation: For continuous
random variables Xand Ywith joint pdf fX,Y and well-defined conditional
pdf fX|Y,
h(X+c) = h(X) for any constant cR,
and
h(X+Y|Y) = h(X|Y).
The results also generalize in the multivariate case: for two continuous
random vectors Xn= (X1, X2,···, Xn) and Yn= (Y1, Y2,··· , Yn) with
joint pdf fXn,Y nand well-defined conditional pdf fXn|Yn,
h(Xn+cn) = h(Xn)
for any constant n-tuple cn= (c1, c2,···, cn)Rn, and
h(Xn+Yn|Yn) = h(Xn|Yn),
where the addition of two n-tuples is performed component-wise.
9. Differential entropy under scaling: For any continuous random vari-
able Xand any non-zero real constant a,
h(aX) = h(X) + log2|a|.
10. Joint differential entropy under linear mapping: Consider the ran-
dom (column) vector X= (X1, X2,···, Xn)Twith joint pdf fXn, where T
denotes transposition, and let Y= (Y1, Y2,···, Yn)Tbe a random (column)
vector obtained from the linear transformation Y=AX, where Ais an
invertible (non-singular) n×nreal-valued matrix. Then
h(Y) = h(Y1, Y2,···, Yn) = h(X1, X2,···, Xn) + log2|det(A)|,
where det(A) is the determinant of the square matrix A.
123
11. Joint differential entropy under nonlinear mapping: Consider the
random (column) vector X= (X1, X2,···, Xn)Twith joint pdf fXn, and
let Y= (Y1, Y2,···, Yn)Tbe a random (column) vector obtained from the
nonlinear transformation
Y=g(X),(g1(X1), g2(X2),···, gn(Xn))T,
where each gi:RnRis a differentiable function, i= 1,2,···, n. Then
h(Y) = h(Y1, Y2,···, Yn)
=h(X1,···, Xn) + ZRn
fXn(x1,···, xn) log2|det(J)|dx1··· dxn,
where Jis the n×nJacobian matrix given by
J,
∂g1
∂x1
∂g1
∂x2··· g1
∂xn
∂g2
∂x1
∂g2
∂x2··· g2
∂xn
.
.
..
.
.··· .
.
.
∂gn
∂x1
∂gn
∂x2··· gn
∂xn
.
Observation 5.14 Property 9 of the above Lemma indicates that for a contin-
uous random variable X,h(X)6=h(aX) (except for the trivial case of a= 1)
and hence differential entropy is not in general invariant under invertible maps.
This is in contrast to entropy, which is always invariant under invertible maps:
given a discrete random variable Xwith alphabet X,
H(f(X)) = H(X)
for all invertible maps f:X Y, where Yis a discrete set; in particular
H(aX) = H(X) for all non-zero reals a.
On the other hand, for both discrete and continuous systems, mutual infor-
mation and divergence are invariant under invertible maps:
I(X;Y) = I(g(X); Y) = I(g(X); h(Y))
and
D(XkY) = D(g(X)kg(Y))
for all invertible maps gand hproperly defined on the alphabet/support of the
concerned random variables. This reinforces the notion that mutual information
and divergence constitute the true tools of Information Theory.
124
Definition 5.15 (Multivariate Gaussian) A continuous random vector X=
(X1, X2,···, Xn)Tis called a size-n(multivariate) Gaussian random vector with
a finite mean vector µ,(µ1, µ2,···, µn)T, where µi,E[Xi]<for i=
1,2,···, n, and an n×ninvertible (real-valued) covariance matrix
KX= [Ki,j]
,E[(Xµ)(Xµ)T]
=
Cov(X1, X1) Cov(X1, X2)··· Cov(X1, Xn)
Cov(X2, X1) Cov(X2, X2)··· Cov(X2, Xn)
.
.
..
.
.··· .
.
.
Cov(Xn, X1) Cov(Xn, X2)··· Cov(Xn, Xn)
,
where Ki,j = Cov(Xi, Xj),E[(Xiµi)(Xjµj)] is the covariance5between Xi
and Xjfor i, j = 1,2,···, n, if its joint pdf is given by the multivariate Gaussian
pdf
fXn(x1, x2,···, xn) = 1
(2π)npdet(KX)e1
2(xµ)TK1
X(xµ)
for any (x1, x2,···, xn)Rn, where x= (x1, x2,··· , xn)T. As in the scalar case
(i.e., for n= 1), we write X Nn(µ, KX) to denote that Xis a size-nGaussian
random vector with mean vector µand covariance matrix KX.
Observation 5.16 In light of the above definition, we make the following re-
marks.
1. Note that a covariance matrix Kis always symmetric (i.e., KT=K)
and positive-semidefinite.6But as we require KXto be invertible in the
definition of the multivariate Gaussian distribution above, we will hereafter
assume that the covariance matrix of Gaussian random vectors is positive-
definite (which is equivalent to having all the eigenvalues of KXpositive),
thus rendering the matrix invertible.
5Note that the diagonal components of KXyield the variance of the different random
variables: Ki,i = Cov(Xi, Xi) = Var(Xi) = σ2
Xi,i= 1,···, n.
6An n×nreal-valued symmetric matrix Kis positive-semidefinite (e.g., cf. [15]) if for every
real-valued vector x= (x1, x2···, xn)T,
xTKx= (x1,···, xn)K
x1
.
.
.
xn
0,
with equality holding only when xi= 0 for i= 1,2,···, n. Furthermore, the matrix is positive-
definite if xTKx>0 for all real-valued vectors x6= 0, where 0 is the all-zero vector of size
n.
125
2. If a random vector X= (X1, X2,···, Xn)Thas a diagonal covariance ma-
trix KX(i.e., all the off-diagonal components of KXare zero: Ki,j = 0
for all i6=j,i, j = 1,···, n), then all its component random variables are
uncorrelated but not necessarily independent. However, if Xis Gaussian
and have a diagonal covariance matrix, then all its component random
variables are independent from each other.
3. Any linear transformation of a Gaussian random vector yields another
Gaussian random vector. Specifically, if X Nn(µ, KX) is a size-nGaus-
sian random vector with mean vector µand covariance matrix KX, and if
Y=AmnX, where Amn is a given m×nreal-valued matrix, then
Y Nm(Amnµ, AmnKXAT
mn)
is a size-mGaussian random vector with mean vector Amnµand covariance
matrix AmnKXAT
mn.
More generally, any affine transformation of a Gaussian random vector
yields another Gaussian random vector: if X Nn(µ, KX) and Y=
AmnX+bm, where Amn is a m×nreal-valued matrix and bmis a size-m
real-valued vector, then
Y Nm(Amnµ+bm,AmnKXAT
mn).
Theorem 5.17 (Joint differential entropy of the multivariate Gaussian)
If X Nn(µ, KX) is a Gaussian random vector with mean vector µand (positive-
definite) covariance matrix KX, then its joint differential entropy is given by
h(X) = h(X1, X2,···, Xn) = 1
2log2[(2πe)ndet(KX)] .(5.2.1)
In particular, in the univariate case of n= 1, (5.2.1) reduces to (5.1.1).
Proof: Without loss of generality we assume that Xhas a zero mean vec-
tor since its differential entropy is invariant under translation by Property 8 of
Lemma 5.13:
h(X) = h(Xµ);
so we assume that µ= 0.
Since the covariance matrix KXis a real-valued symmetric matrix, then
it is orthogonally diagonizable; i.e., there exits a square (n×n) orthogonal
matrix A(i.e., satisfying AT=A1) such that AKXATis a diagonal ma-
trix whose entries are given by the eigenvalues of KX(Ais constructed using
the eigenvectors of KX; e.g., see [15]). As a result the linear transformation
126
Y=AX Nn0,AKXATis a Gaussian vector with the diagonal covariance
matrix KY=AKXATand has therefore independent components (as noted in
Observation 5.16). Thus
h(Y) = h(Y1, Y2,···, Yn)
=h(Y1) + h(Y2) + ···+h(Yn) (5.2.2)
=
n
X
i=1
1
2log2[2πeVar(Yi)] (5.2.3)
=n
2log2(2πe) + 1
2log2"n
Y
i=1
Var(Yi)#
=n
2log2(2πe) + 1
2log2[det (KY)] (5.2.4)
=1
2log2(2πe)n+1
2log2[det (KX)] (5.2.5)
=1
2log2[(2πe)ndet (KX)] ,(5.2.6)
where (5.2.2) follows by the independence of the random variables Y1,...,Yn
(e.g., see Property 7 of Lemma 5.13), (5.2.3) follows from (5.1.1), (5.2.4) holds
since the matrix KYis diagonal and hence its determinant is given by the product
of its diagonal entries, and (5.2.5) holds since
det (KY) = det AKXAT
= det(A)det (KX) det(AT)
= det(A)2det (KX)
= det (KX),
where the last equality holds since det(A)2= 1, as the matrix Ais orthogonal
(AT=A1=det(A) = det(AT) = 1/[det(A)]; thus, det(A)2= 1).
Now invoking Property 10 of Lemma 5.13 and noting that |det(A)|= 1 yield
that
h(Y1, Y2,···, Yn) = h(X1, X2,···, Xn) + log2|det(A)|
|{z }
=0
=h(X1, X2,···, Xn).
We therefore obtain using (5.2.6) that
h(X1, X2,···, Xn) = 1
2log2[(2πe)ndet (KX)] ,
hence completing the proof.
An alternate (but rather mechanical) proof to the one presented above con-
sists of directly evaluating the joint differential entropy of Xby integrating
fXn(xn) log2fXn(xn) over Rn; it is left as an exercise. 2
127
Corollary 5.18 (Hadamard’s inequality) For any real-valued n×npositive-
definite matrix K= [Ki,j]i,j=1,··· ,n,
det(K)
n
Y
i=1
Ki,i
with equality iff Kis a diagonal matrix, where Ki,i are the diagonal entries of
K.
Proof: Since every positive-definite matrix is a covariance matrix (e.g., see
[20]), let X= (X1, X2,···, Xn)T Nn(0,K) be a jointly Gaussian random
vector with zero mean vector and covariance matrix K. Then
1
2log2[(2πe)ndet(K)] = h(X1, X2,···, Xn) (5.2.7)
n
X
i=1
h(Xi) (5.2.8)
=
n
X
i=1
1
2log2[2πeVar(Xi)] (5.2.9)
=1
2log2"(2πe)n
n
Y
i=1
Ki,i#,(5.2.10)
where (5.2.7) follows from Theorem 5.17, (5.2.8) follows from Property 7 of
Lemma 5.13 and (5.2.9)-(5.2.10) hold using (5.1.1) along with the fact that
each random variable Xi N(0, Ki,i) is Gaussian with zero mean and variance
Var(Xi) = Ki,i for i= 1,2,···, n (as the marginals of a multivariate Gaussian
are also Gaussian (e,g., cf.[20])).
Finally, from (5.2.10), we directly obtain that
det(K)
n
Y
i=1
Ki,i,
with equality iff the jointly Gaussian random variables X1,X2,...,Xnare inde-
pendent from each other, or equivalently iff the covariance matrix Kis diagonal.
2
The next theorem states that among all real-valued size-nrandom vectors (of
support Rn) with identical mean vector and covariance matrix, the Gaussian
random vector has the largest differential entropy.
128
Theorem 5.19 (Maximal differential entropy for real-valued random
vectors)Let X= (X1, X2,···, Xn)Tbe a real-valued random vector with
support SXn=Rn, mean vector µand covariance matrix KX. Then
h(X1, X2,···, Xn)1
2log2[(2πe)ndet(KX)] ,(5.2.11)
with equality iff Xis Gaussian; i.e., X Nnµ, KX.
Proof: We will present the proof in two parts: the scalar or univariate case,
and the multivariate case.
(i) Scalar case (n= 1): For a real-valued random variable with support SX=R,
mean µand variance σ2, let us show that
h(X)1
2log22πeσ2,(5.2.12)
with equality iff X N(µ, σ2).
For a Gaussian random variable Y N(µ, σ2), using the non-negativity of
divergence, can write
0D(XkY)
=ZR
fX(x) log2
fX(x)
1
2πσ2e(xµ)2
2σ2
dx
=h(X) + ZR
fX(x)log22πσ2+(xµ)2
2σ2log2edx
=h(X) + 1
2log2(2πσ2) + log2e
2σ2ZR
(xµ)2fX(x)dx
|{z }
=σ2
=h(X) + 1
2log22πeσ2.
Thus
h(X)1
2log22πeσ2,
with equality iff X=Y(almost surely); i.e., X N(µ, σ2).
(ii). Multivariate case (n > 1): As in the proof of Theorem 5.17, we can use an
orthogonal square matrix A(i.e., satisfying AT=A1and hence |det(A)|= 1)
such that AKXATis diagonal. Therefore, the random vector generated by the
linear map
Z=AX
129
will have a covariance matrix given by KZ=AKXATand hence have uncorre-
lated (but not necessarily independent) components. Thus
h(X) = h(Z)log2|det(A)|
|{z }
=0
(5.2.13)
=h(Z1, Z2,···, Zn)
n
X
i=1
h(Zi) (5.2.14)
n
X
i=1
1
2log2[2πeVar(Zi)] (5.2.15)
=n
2log2(2πe) + 1
2log2"n
Y
i=1
Var(Zi)#
=1
2log2(2πe)n+1
2log2[det (KZ)] (5.2.16)
=1
2log2(2πe)n+1
2log2[det (KX)] (5.2.17)
=1
2log2[(2πe)ndet (KX)] ,
where (5.2.13) holds by Property 10 of Lemma 5.13 and since |det(A)|= 1,
(5.2.14) follows from Property 7 of Lemma 5.13, (5.2.15) follows from (5.2.12)
(the scalar case above), (5.2.16) holds since KZis diagonal, and (5.2.17) follows
from the fact that det (KZ) = det (KX) (as Ais orthogonal). Finally, equality is
achieved in both (5.2.14) and (5.2.15) iff the random variables Z1,Z2,...,Znare
Gaussian and independent from each other, or equivalently iff X Nnµ, KX.
2
Observation 5.20 The following two results can also be shown (the proof is
left as an exercise):
1. Among all continuous random variables admitting a pdf with support the
interval (a, b), where b > a are real numbers, the uniformly distributed
random variable maximizes differential entropy.
2. Among all continuous random variables admitting a pdf with support the
interval [0,) and finite mean µ, the exponential distribution with param-
eter (or rate parameter) λ= 1 maximizes differential entropy.
A systematic approach to finding distributions that maximize differential entropy
subject to various support and moments constraints can be found in [12, 45].
130
5.3 AEP for continuous memoryless sources
The AEP theorem and its consequence for discrete memoryless (i.i.d.) sources
reveal to us that the number of elements in the typical set is approximately
2nH(X), where H(X) is the source entropy, and that the typical set carries al-
most all the probability mass asymptotically (see Theorems 3.3 and 3.4). An
extension of this result from discrete to continuous memoryless sources by just
counting the number of elements in a continuous (typical) set defined via a law-
of-large-numbers argument is not possible, since the total number of elements
in a continuous set is infinite. However, when considering the volume of that
continuous typical set (which is a natural analog to the size of a discrete set),
such an extension, with differential entropy playing a similar role as entropy,
becomes straightforward.
Theorem 5.21 (AEP for continuous memoryless sources) Let {Xi}
i=1 be
a continuous memoryless source (i.e., an infinite sequence of continuous i.i.d. ran-
dom variables) with pdf fX(·) and differential entropy h(X). Then
1
nlog fX(X1,...,Xn)E[log2fX(X)] = h(X) in probability.
Proof: The proof is an immediate result of the law of large numbers (e.g., see
Theorem 3.3). 2
Definition 5.22 (Typical set) For δ > 0 and any ngiven, define the typical
set for the above continuous source as
Fn(δ),xnRn:1
nlog2fX(X1,...,Xn)h(X)< δ.
Definition 5.23 (Volume) The volume of a set A Rnis defined as
Vol(A),ZA
dx1···dxn.
Theorem 5.24 (Consequence of the AEP for continuous memoryless
sources) For a continuous memoryless source {Xi}
i=1 with differential entropy
h(X), the following hold.
1. For nsufficiently large, PXn{Fn(δ)}>1δ.
2. Vol(Fn(δ))2n(h(X)+δ)for all n.
3. Vol(Fn(δ))(1 δ)2n(h(X)δ)for nsufficiently large.
Proof: The proof is quite analogous to the corresponding theorem for discrete
memoryless sources (Theorem 3.4) and is left as an exercise. 2
131
5.4 Capacity of the discrete-time memoryless Gaussian
channel
132
Appendix A
Overview on Suprema and Limits
We herein review basic results on suprema and limits which are useful for the
development of information theoretic coding theorems; they can be found in
standard real analysis texts (e.g., see [32, 43]).
A.1 Supremum and maximum
Throughout, we work on subsets of R, the set of real numbers.
Definition A.1 (Upper bound of a set) A real number uis called an upper
bound of a non-empty subset Aof Rif every element of Ais less than or equal
to u; we say that Ais bounded above. Symbolically, the definition becomes:
A Ris bounded above (uR) such that (a A), a u.
Definition A.2 (Least upper bound or supremum) If Ais a non-empty
subset of R, then we say that a real number sis a least upper bound or supremum
of Aif sis an upper bound of the set Aand if ssfor each upper bound s
of A. In this case, we write s= sup A; other notations are s= supx∈A xand
s= sup{x:x A}.
Completeness Axiom: (Least upper bound property) Let Abe a non-
empty subset of Rthat is bounded above. Then Ahas a least upper bound.
It follows directly that if a non-empty set in Rhas a supremum, then this
supremum is unique. Furthermore, note that the empty set () and any set
not bounded above do not admit a supremum in R. However, when working
in the set of extended real numbers given by R {−∞,∞}, we can define the
133
supremum of the empty set as −∞ and that of a set not bounded above as .
These extended definitions will be adopted in the text.
We now distinguish between two situations: (i) the supremum of a set A
belongs to A, and (ii) the supremum of a set Adoes not belong to A. It is quite
easy to create examples for both situations. A quick example for (i) involves
the set (0,1], while the set (0,1) can be used for (ii). In both examples, the
supremum is equal to 1; however, in the former case, the supremum belongs to
the set, while in the latter case it does not. When a set contains its supremum,
we call the supremum the maximum of the set.
Definition A.3 (Maximum) If sup A A, then sup Ais also called the max-
imum of A, and is denoted by max A. However, if sup A 6∈ A, then we say that
the maximum of Adoes not exist.
Property A.4 (Properties of the supremum)
1. The supremum of any set in R {−∞,∞} always exits.
2. (a A)asup A.
3. If −∞ <sup A<, then (ε > 0)(a0 A)a0>sup A ε.
(The existence of a0(sup Aε, sup A] for any ε > 0 under the condition
of |sup A| <is called the approximation property for the supremum.)
4. If sup A=, then (LR)(B0 A)B0> L.
5. If sup A=−∞, then Ais empty.
Observation A.5 In Information Theory, a typical channel coding theorem
establishes that a (finite) real number αis the supremum of a set A. Thus, to
prove such a theorem, one must show that αsatisfies both properties 3 and 2
above, i.e.,
(ε > 0)(a0 A)a0> α ε(A.1.1)
and
(a A)aα, (A.1.2)
where (A.1.1) and (A.1.2) are called the achievability (or forward) part and the
converse part, respectively, of the theorem. Specifically, (A.1.2) states that αis
an upper bound of A, and (A.1.1) states that no number less than αcan be an
upper bound for A.
134
Property A.6 (Properties of the maximum)
1. (a A)amax A, if max Aexists in R {−∞,∞}.
2. max A A.
From the above property, in order to obtain α= max A, one needs to show
that αsatisfies both
(a A)aαand α A.
A.2 Infimum and minimum
The concepts of infimum and minimum are dual to those of supremum and
maximum.
Definition A.7 (Lower bound of a set) A real number is called a lower
bound of a non-empty subset Ain Rif every element of Ais greater than or
equal to ; we say that Ais bounded below. Symbolically, the definition becomes:
A Ris bounded below (R) such that (a A)aℓ.
Definition A.8 (Greatest lower bound or infimum) If Ais a non-empty
subset of R, then we say that a real number is a greatest lower bound or
infimum of Aif is a lower bound of Aand if for each lower bound
of A. In this case, we write = inf A; other notations are = infx∈A xand
= inf{x:x A}.
Completeness Axiom: (Greatest lower bound property) Let Abe a
non-empty subset of Rthat is bounded below. Then Ahas a greatest lower
bound.
As for the case of the supremum, it directly follows that if a non-empty set
in Rhas an infimum, then this infimum is unique. Furthermore, working in the
set of extended real numbers, the infimum of the empty set is defined as and
that of a set not bounded below as −∞.
Definition A.9 (Minimum) If inf A A, then inf Ais also called the min-
imum of A, and is denoted by min A. However, if inf A 6∈ A, we say that the
minimum of Adoes not exist.
135
Property A.10 (Properties of the infimum)
1. The infimum of any set in R {−∞,∞} always exists.
2. (a A)ainf A.
3. If >inf A>−∞, then (ε > 0)(a0 A)a0<inf A+ε.
(The existence of a0[inf A,inf A+ε) for any ε > 0 under the assumption
of |inf A| <is called the approximation property for the infimum.)
4. If inf A=−∞, then (AR)(B0 A)B0< L.
5. If inf A=, then Ais empty.
Observation A.11 Analogously to Observation A.5, a typical source coding
theorem in Information Theory establishes that a (finite) real number αis the
infimum of a set A. Thus, to prove such a theorem, one must show that α
satisfies both properties 3 and 2 above, i.e.,
(ε > 0)(a0 A)a0< α +ε(A.2.1)
and
(a A)aα. (A.2.2)
Here, (A.2.1) is called the achievability or forward part of the coding theorem;
it specifies that no number greater than αcan be a lower bound for A. Also,
(A.2.2) is called the converse part of the theorem; it states that αis a lower
bound of A.
Property A.12 (Properties of the minimum)
1. (a A)amin A, if min Aexists in R {−∞,∞}.
2. min A A.
A.3 Boundedness and suprema operations
Definition A.13 (Boundedness) A subset Aof Ris said to be bounded if it
is both bounded above and bounded below; otherwise it is called unbounded.
Lemma A.14 (Condition for boundedness) A subset Aof Ris bounded iff
(kR) such that (a A)|a| k.
136
Lemma A.15 (Monotone property) Suppose that Aand Bare non-empty
subsets of Rsuch that A B. Then
1. sup A sup B.
2. inf A inf B.
Lemma A.16 (Supremum for set operations) Define the “addition” of two
sets Aand Bas
A+B,{cR:c=a+bfor some a A and b B}.
Define the “scaler multiplication” of a set Aby a scalar kRas
k· A ,{cR:c=k·afor some a A}.
Finally, define the “negation” of a set Aas
−A ,{cR:c=afor some a A}.
Then the following hold.
1. If Aand Bare both bounded above, then A+Bis also bounded above
and sup(A+B) = sup A+ sup B.
2. If 0 < k < and Ais bounded above, then k· A is also bounded above
and sup(k· A) = k·sup A.
3. sup A=inf(−A) and inf A=sup(−A).
Property 1 does not hold for the “product” of two sets, where the “product”
of sets Aand Bis defined as as
A · B ,{cR:c=ab for some a A and b B}.
In this case, both of these two situations can occur:
sup(A · B)>(sup A)·(sup B)
sup(A · B) = (sup A)·(sup B).
137
Lemma A.17 (Supremum/infimum for monotone functions)
1. If f:RRis a non-decreasing function, then
sup{xR:f(x)< ε}= inf{xR:f(x)ε}
and
sup{xR:f(x)ε}= inf{xR:f(x)> ε}.
2. If f:RRis a non-increasing function, then
sup{xR:f(x)> ε}= inf{xR:f(x)ε}
and
sup{xR:f(x)ε}= inf{xR:f(x)< ε}.
The above lemma is illustrated in Figure A.1.
A.4 Sequences and their limits
Let Ndenote the set of “natural numbers” (positive integers) 1,2,3,···. A
sequence drawn from a real-valued function is denoted by
f:NR.
In other words, f(n) is a real number for each n= 1,2,3,···. It is usual to write
f(n) = an, and we often indicate the sequence by any one of these notations
{a1, a2, a3,···, an,··· } or {an}
n=1.
One important question that arises with a sequence is what happens when n
gets large. To be precise, we want to know that when nis large enough, whether
or not every anis close to some fixed number L(which is the limit of an).
Definition A.18 (Limit) The limit of {an}
n=1 is the real number Lsatisfying:
(ε > 0)(N) such that (n > N)
|anL|< ε.
In this case, we write L= limn→∞ an. If no such Lsatisfies the above statement,
we say that the limit of {an}
n=1 does not exist.
138
-
6
f(x)
ε
sup{x:f(x)< ε}
= inf{x:f(x)ε}
sup{x:f(x)ε}
= inf{x:f(x)> ε}
-
6
f(x)
ε
sup{x:f(x)ε}
= inf{x:f(x)< ε}
sup{x:f(x)> ε}
= inf{x:f(x)ε}
Figure A.1: Illustration of Lemma A.17.
Property A.19 If {an}
n=1 and {bn}
n=1 both have a limit in R, then the fol-
lowing hold.
1. limn→∞(an+bn) = limn→∞ an+ limn→∞ bn.
2. limn→∞(α·an) = α·limn→∞ an.
3. limn→∞(anbn) = (limn→∞ an)(limn→∞ bn).
Note that in the above definition, −∞ and cannot be a legitimate limit
for any sequence. In fact, if (L)(N) such that (n > N)an> L, then we
139
say that andiverges to and write an . A similar argument applies to
andiverging to −∞. For convenience, we will work in the set of extended real
numbers and thus state that a sequence {an}
n=1 that diverges to either or
−∞ has a limit in R {−∞,∞}.
Lemma A.20 (Convergence of monotone sequences) If {an}
n=1 is non-de-
creasing in n, then limn→∞ anexists in R{−∞,∞}. If {an}
n=1 is also bounded
from above i.e., anLnfor some Lin R then limn→∞ anexists in R.
Likewise, if {an}
n=1 is non-increasing in n, then limn→∞ anexists in R
{−∞,∞}. If {an}
n=1 is also bounded from below i.e., anLnfor some L
in R then limn→∞ anexists in R.
As stated above, the limit of a sequence may not exist. For example, an=
(1)n. Then anwill be close to either 1 or 1 for nlarge. Hence, more general-
ized definitions that can describe the general limiting behavior of a sequence is
required.
Definition A.21 (limsup and liminf) The limit supremum of {an}
n=1 is the
extended real number in R {−∞,∞} defined by
lim sup
n→∞
an,lim
n→∞(sup
kn
ak),
and the limit infimum of {an}
n=1 is the extended real number defined by
lim inf
n→∞ an,lim
n→∞(inf
knak).
Some also use the notations lim and lim to denote limsup and liminf, respectively.
Note that the limit supremum and the limit infimum of a sequence is always
defined in R {−∞,∞}, since the sequences supknak= sup{ak:kn}and
infknak= inf{ak:kn}are monotone in n(cf. Lemma A.20). An immediate
result follows from the definitions of limsup and liminf.
Lemma A.22 (Limit) For a sequence {an}
n=1,
lim
n→∞ an=L lim sup
n→∞
an= lim inf
n→∞ an=L.
Some properties regarding the limsup and liminf of sequences (which are
parallel to Properties A.4 and A.10) are listed below.
140
Property A.23 (Properties of the limit supremum)
1. The limit supremum always exists in R {−∞,∞}.
2. If |lim supm→∞ am|<, then (ε > 0)(N) such that (n > N )
an<lim supm→∞ am+ε. (Note that this holds for every n > N .)
3. If |lim supm→∞ am|<, then (ε > 0 and integer K)(N > K) such
that aN>lim supm→∞ amε. (Note that this holds only for one N, which
is larger than K.)
Property A.24 (Properties of the limit infimum)
1. The limit infimum always exists in R {−∞,∞}.
2. If |lim inf m→∞ am|<, then (ε > 0 and K)(N > K) such that
aN<lim infm→∞ am+ε. (Note that this holds only for one N, which is
larger than K.)
3. If |lim infm→∞ am|<, then (ε > 0)(N) such that (n > N)an>
lim infm→∞ amε. (Note that this holds for every n > N .)
The last two items in Properties A.23 and A.24 can be stated using the
terminology of sufficiently large and infinitely often, which is often adopted in
Information Theory.
Definition A.25 (Sufficiently large) We say that a property holds for a se-
quence {an}
n=1 almost always or for all sufficiently large nif the property holds
for every n > N for some N.
Definition A.26 (Infinitely often) We say that a property holds for a se-
quence {an}
n=1 infinitely often or for infinitely many nif for every K, the prop-
erty holds for one (specific) Nwith N > K.
Then properties 2 and 3 of Property A.23 can be respectively re-phrased as:
if |lim supm→∞ am|<, then (ε > 0)
an<lim sup
m→∞
am+εfor all sufficiently large n
and
an>lim sup
m→∞
amεfor infinitely many n.
141
Similarly, properties 2 and 3 of Property A.24 becomes: if |lim infm→∞ am|<,
then (ε > 0)
an<lim inf
m→∞ am+εfor infinitely many n
and
an>lim inf
m→∞ amεfor all sufficiently large n.
Lemma A.27
1. lim infn→∞ anlim supn→∞ an.
2. If anbnfor all sufficiently large n, then
lim inf
n→∞ anlim inf
n→∞ bnand lim sup
n→∞
anlim sup
n→∞
bn.
3. lim supn→∞ an< r an< r for all sufficiently large n.
4. lim supn→∞ an> r an> r for infinitely many n.
5.
lim inf
n→∞ an+ lim inf
n→∞ bnlim inf
n→∞ (an+bn)
lim sup
n→∞
an+ lim inf
n→∞ bn
lim sup
n→∞
(an+bn)
lim sup
n→∞
an+ lim sup
n→∞
bn.
6. If limn→∞ anexists, then
lim inf
n→∞ (an+bn) = lim
n→∞ an+ lim inf
n→∞ bn
and
lim sup
n→∞
(an+bn) = lim
n→∞ an+ lim sup
n→∞
bn.
Finally, one can also interpret the limit supremum and limit infimum in terms
of the concept of clustering points. A clustering point is a point that a sequence
{an}
n=1 approaches (i.e., belonging to a ball with arbitrarily small radius and
that point as center) infinitely many times. For example, if an= sin(nπ/2),
then {an}
n=1 ={1,0,1,0,1,0,1,0,...}. Hence, there are three clustering
points in this sequence, which are 1, 0 and 1. Then the limit supremum of
the sequence is nothing but its largest clustering point, and its limit infimum
is exactly its smallest clustering point. Specifically, lim supn→∞ an= 1 and
lim infn→∞ an=1. This approach can sometimes be useful to determine the
limsup and liminf quantities.
142
A.5 Equivalence
We close this appendix by providing some equivalent statements that are often
used to simplify proofs. For example, instead of directly showing that quantity
xis less than or equal to quantity y, one can take an arbitrary constant ε > 0
and prove that x < y +ε. Since y+εis a larger quantity than y, in some cases
it might be easier to show x < y +εthan proving xy. By the next theorem,
any proof that concludes that x < y +εfor all ε > 0” immediately gives the
desired result of xy.
Theorem A.28 For any x, y and ain R,
1. x < y +εfor all ε > 0 iff xy;
2. x < y εfor some ε > 0 iff x < y;
3. x > y εfor all ε > 0 iff xy;
4. x > y +εfor some ε > 0 iff x > y;
5. |a|< ε for all ε > 0 iff a= 0.
143
Appendix B
Overview in Probability and Random
Processes
This appendix presents a quick overview of basic concepts from probability the-
ory and the theory of random processes. The reader can consult comprehensive
texts on these subjects for a thorough study (e.g., cf. [2, 6, 20]). We close the ap-
pendix with a brief discussion of Jensen’s inequality and the Lagrange multipliers
technique for the optimization of convex functions [5, 11].
B.1 Probability space
Definition B.1 (σ-Fields) Let Fbe a collection of subsets of a non-empty set
Ω. Then Fis called a σ-field (or σ-algebra) if the following hold:
1. F.
2. Closure of Funder complementation: If A F, then Acfc, where
Ac is the complement set of A(relative to Ω).
3. Closure of Funder countable union: If Ai F for i= 1,2,3,..., then
S
i=1 Ai F.
It directly follows that the empty set is also an element of F(as c=)
and that Fis closed under countable intersection since
\
i=1
Ai=
[
i=1
Ai!c
.
The largest σ-field of subsets of a given set is the collection of all subsets of
(i.e., its powerset), while the smallest σ-field is given by {,∅}. Also, if Ais
144
a proper (strict) non-empty subset of Ω, then the smallest σ-field containing A
is given by {,, A, Ac}.
Definition B.2 (Probability space) Aprobability space is a triple (Ω,F, P ),
where is a given set called sample space containing all possible outcomes
(usually observed from an experiment), Fis the σ-field of subsets and P
is a probability measure P:F [0,1] on the σ-field satisfying the following:
1. 0 P(A)1 for all A F.
2. P(Ω) = 1.
3. Countable additivity: If A1,A2,... is a sequence of disjoint sets (i.e.,
AiAj= i6=j) in F, then
P
[
k=1
Ak!=
X
k=1
P(Ak).
It directly follows from properties 1-3 of the above definition that P() = 0.
Usually, the σ-field Fis called the event space and its elements (which are
subsets of satisfying the properties of Definition B.1) are called events.
B.2 Random variable and random process
B.3 Central limit theorem
Theorem B.3 (Central limit theorem) If {Xn}
n=1 is a sequence of i.i.d. ran-
dom variables with finite common marginal mean µand variance σ2, then
1
n
n
X
i=1
(Xiµ)d
Z N(0, σ2),
where the convergence is in distribution (as n ) and Z N(0, σ2) is a
Gaussian distributed random variable with mean 0 and variance σ2.
B.4 Convexity, concavity and Jensen’s inequality
Jensen’s inequality provides a useful bound for the expectation of convex (or
concave) functions.
145
Definition B.4 (Convexity) Consider a convex set1O Rm, where mis a
fixed positive integer. Then a function f:O Ris said to be convex over Oif
for every x,yin Oand 0 λ1,
fλx + (1 λ)yλf(x) + (1 λ)f(y).
Furthermore, a function fis said to be strictly convex if equality holds only when
λ= 0 or λ= 1.
Definition B.5 (Concavity) A function fis concave if fis convex.
Note that when O= (a, b) is an interval in Rand function f:O Rhas a
non-negative (respectively positive) second derivative over O, then the function
is convex (resp. strictly convex). This can be easily shown via the Taylor series
expansion of the function.
Theorem B.6 (Jensen’s inequality) If f:O Ris convex over a convex
set O Rm, and X= (X1, X2,···, Xm)Tis an m-dimensional random vector
with alphabet X O, then
E[f(X)] f(E[X]).
Moreover, if fis strictly convex, then equality in the above inequality immedi-
ately implies X=E[X] with probability 1.
Note: Ois a convex set; hence, X O implies E[X] O. This guarantees
that f(E[X]) is defined. Similarly, if fis concave, then
E[f(X)] f(E[X]).
Furthermore, if fis strictly concave, then equality in the above inequality im-
mediately implies that X=E[X] with probability 1.
Proof: Let y=aTx+bbe a “support hyperplane” for fwith “slope” vector aT
and affine parameter bthat passes through the point (E[X], f (E[X])), where a
1A set O Rmis said to be convex if for every x= (x1, x2,···, xm)Tand y=
(y1, y2,···, ym)Tin O(where Tdenotes transposition), and every 0 λ1, λx+(1λ)y O;
in other words, the “convex combination” of any two “points” xand yin Oalso belongs to O.
146
support hyperplane2for function fat xis by definition a hyperplane passing
through the point (x, f (x)) and lying entirely below the graph of f(see Fig. B.4
for an illustration of a support line for a convex function over R).
-
6
x
y
support line
y=ax +b
f(x)
Figure B.1: The support line y=ax +bof the convex function f(x).
Thus,
(x X)aTx+bf(x).
By taking the expectation value of both sides, we obtain
aTE[X] + bE[f(X)],
but we know that aTE[X] + b=f(E[X]). Consequently,
f(E[X]) E[f(X)].
2
2A hyperplane y=aTx+bis said to be a support hyperplane for a function fwith “slope”
vector aTRmand affine parameter bRif among all hyperplanes of the same slope vector
a, it is the largest one satisfying aTx+bf(x) for every x O. Hence, a support hyperplane
may not necessarily pass through the point (x, f(x)) for every x O. Here, since we only
consider convex functions, the validity of the support hyperplane at x0passing (x, f (x)) is
therefore guaranteed. Note that when xis one-dimensional (i.e., m= 1), a support hyperplane
is simply referred to as a support line.
147
Bibliography
[1] S. Arimoto, “An algorithm for computing the capacity of arbitrary discrete
memoryless channel,” IEEE Trans. Inform. Theory, vol. 18, no. 1, pp. 14-20,
Jan. 1972.
[2] R. B. Ash and C. A. Dol´eans-Dade, Probability and Measure Theory, Aca-
demic Press, MA, 2000.
[3] C. Berrou, A. Glavieux and P. Thitimajshima, “Near Shannon limit error-
correcting coding and decoding: Turbo-codes(1),” Proc. IEEE Int. Conf.
Commun., pp. 1064-1070, Geneva, Switzerland, May 1993.
[4] C. Berrou and A. Glavieux, “Near optimum error correcting coding and
decoding: Turbo-codes,” IEEE Trans. Commun., vol. 44, no. 10, pp. 1261-
1271, Oct. 1996.
[5] D. P. Bertsekas, with A. Nedi´c and A. E. Ozdagler, Convex Analysis and
Optimization, Athena Scientific, Belmont, MA, 2003.
[6] P. Billingsley. Probability and Measure, 2nd. Ed., John Wiley and Sons, NY,
1995.
[7] R. E. Blahut, “Computation of channel capacity and rate-distortion func-
tions,” IEEE Trans. Inform. Theory, vol. 18, no. 4, pp. 460-473, Jul. 1972.
[8] R. E. Blahut, Theory and Practice of Error Control Codes, Addison-Wesley,
MA, 1983.
[9] R. E. Blahut. Principles and Practice of Information Theory. Addison Wes-
ley, MA, 1988.
[10] R. E. Blahut, Algebraic Codes for Data Transmission, Cambridge Univ.
Press, 2003.
[11] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University
Press, Cambridge, UK, 2003.
148
[12] T. M. Cover and J.A. Thomas, Elements of Information Theory, 2nd Ed.,
Wiley, NY, 2006.
[13] I. Csisz´ar and J. orner, Information Theory: Coding Theorems for Discrete
Memoryless Systems, Academic, NY, 1981.
[14] I. Csisz´ar and G. Tusnady, “Information geometry and alternating min-
imization procedures,” Statistics and Decision, Supplement Issue, vol. 1,
pp. 205-237, 1984.
[15] S. H. Friedberg, A.J. Insel and L. E. Spence, Linear Algebra, 4th Ed., Pren-
tice Hall, 2002.
[16] R. G. Gallager, “Low-density parity-check codes,” IRE Trans. Inform. The-
ory, vol. 28, no. 1, pp. 8-21, Jan. 1962.
[17] R. G. Gallager, Low-Density Parity-Check Codes, MIT Press, 1963.
[18] R. G. Gallager, Information Theory and Reliable Communication, Wiley,
1968.
[19] R. Gallager, “Variations on theme by Huffman,” IEEE Trans. Inform. The-
ory, vol. 24, no. 6, pp. 668-674, Nov. 1978.
[20] G. R. Grimmett and D. R. Stirzaker, Probability and Random Processes,
Third Edition, Oxford University Press, NY, 2001.
[21] T. S. Han and S. Verd´u, “Approximation theory of output statistics,” IEEE
Trans. Inform. Theory, vol. 39, no. 3, pp. 752-772, May 1993.
[22] S. Ihara, Information Theory for Continuous Systems, World-Scientific, Sin-
gapore, 1993.
[23] R. Johanesson and K. Zigangirov, Fundamentals of Convolutional Coding,
IEEE, 1999.
[24] W. Karush, Minima of Functions of Several Variables with Inequalities as
Side Constraints, M.Sc. Dissertation, Dept. Mathematics, Univ. Chicago,
Chicago, Illinois, 1939.
[25] A. N. Kolmogorov, “On the Shannon theory of information transmission in
the case of continuous signals,” IEEE Trans. Inform. Theory, vol. 2, no. 4,
pp. 102-108, Dec. 1956.
[26] A. N. Kolmogorov and S. V. Fomin, Introductory Real Analysis, Dover Pub-
lications, NY, 1970.
149
[27] H. W. Kuhn and A. W. Tucker, “Nonlinear programming,” Proc. 2nd Berke-
ley Symposium, Berkeley, University of California Press, pp. 481-492, 1951.
[28] S. Lin and D. J. Costello, Error Control Coding: Fundamentals and Appli-
cations, 2nd Edition, Prentice Hall, NJ, 2004.
[29] D. J. C. MacKay and R. M. Neal, “Near Shannon limit performance of low
density parity check codes,” Electronics Letters, vol. 33, no. 6, Mar. 1997.
[30] D. J. C. MacKay, “Good error correcting codes based on very sparse matri-
ces,” IEEE Trans. Inform. Theory, vol. 45, no. 2, pp. 399-431, Mar. 1999.
[31] F. J. MacWilliams and N. J. A. Sloane, The Theory of Error Correcting
Codes, North-Holland Pub. Co., 1978.
[32] J. E. Marsden and M. J.Hoffman, Elementary Classical Analysis, W.H.
Freeman & Company, 1993.
[33] R. J. McEliece, The Theory of Information and Coding, 2nd. Ed., Cam-
bridge University Press, 2002.
[34] M. S. Pinsker, Information and Information Stability of Random Variables
and Processes, Holden-Day, San Francisco, 1964.
[35] T. J. Richardson and R. L. Urbanke, Modern Coding Theory, Cambridge
University Press, 2008.
[36] M. Rezaeian and A. Grant, “Computation of total capacity for discrete
memoryless multiple-access channels,” IEEE Trans. Inform. Theory, vol. 50,
no. 11, pp. 2779-2784, Nov. 2004.
[37] H. L. Royden. Real Analysis, Macmillan Publishing Company, 3rd. Ed., NY,
1988.
[38] C. E. Shannon, “A mathematical theory of communications,” Bell Syst.
Tech. Journal, vol. 27, pp. 379-423, 1948.
[39] C. E. Shannon, “Coding theorems for a discrete source with a fidelity cri-
terion,” IRE Nat. Conv. Rec., Pt. 4, pp. 142-163, 1959.
[40] C. E. Shannon and W. W. Weaver, The Mathematical Theory of Commu-
nication, Univ. of Illinois Press, Urbana, IL, 1949.
[41] P. C. Shields, The Ergodic Theory of Discrete Sample Paths, American
Mathematical Society, 1991.
150
[42] N. J. A. Sloane and A. D. Wyner, Ed., Claude Elwood Shannon: Collected
Papers, IEEE Press, NY, 1993.
[43] W. R. Wade, An Introduction to Analysis, Prentice Hall, NJ, 1995.
[44] S. Wicker, Error Control Systems for Digital Communication and Storage,
Prentice Hall, NJ, 1995.
[45] R. W. Yeung, Information Theory and Network Coding, Springer, NY, 2008.
151
... Lemma A.1: [54,Prop. 17.4] Let X and Y denote two discrete or continuous random variables taking values in spaces X and Y respectively. Let P X,Y and Q X,Y denote two well-defined joint probability distributions (pmf or pdf) defined on the space X × Y such that P X,Y (x, y) = 0 almost surely for all x ∈ X , y ∈ Y such that Q X,Y (x, y) = 0. Let f : X × Y → R be such that f (x, y) is zero whenever (P X,Y (x, y)/Q X,Y (x, y)) = 0 1 . ...
... using Lemma A.1 then yields inequality (52). To get to (53), we start from (54), and perform change of measure from ...
Article
Full-text available
Meta-learning automatically infers an inductive bias by observing data from a number of related tasks. The inductive bias is encoded by hyperparameters that determine aspects of the model class or training algorithm, such as initialization or learning rate. Meta-learning assumes that the learning tasks belong to a task environment, and that tasks are drawn from the same task environment both during meta-training and meta-testing. This, however, may not hold true in practice. In this paper, we introduce the problem of transfer meta-learning, in which tasks are drawn from a target task environment during meta-testing that may differ from the source task environment observed during meta-training. Novel information-theoretic upper bounds are obtained on the transfer meta-generalization gap, which measures the difference between the meta-training loss, available at the meta-learner, and the average loss on meta-test data from a new, randomly selected, task in the target task environment. The first bound, on the average transfer meta-generalization gap, captures the meta-environment shift between source and target task environments via the KL divergence between source and target data distributions. The second, PAC-Bayesian bound, and the third, single-draw bound, account for this shift via the log-likelihood ratio between source and target task distributions. Furthermore, two transfer meta-learning solutions are introduced. For the first, termed Empirical Meta-Risk Minimization (EMRM), we derive bounds on the average optimality gap. The second, referred to as Information Meta-Risk Minimization (IMRM), is obtained by minimizing the PAC-Bayesian bound. IMRM is shown via experiments to potentially outperform EMRM.
... However, as such joint mappings are difficult to implement, quantization is most commonly performed using analog-to-digital convertors (ADCs), which operate in a serial and scalar manner; specifically, the analog signal is sampled and each incoming sample is sequentially mapped into a discrete representation using the same mapping [3]. Since ADCs operating at high frequencies are costly in terms of memory and power usage, it is often desirable to utilize low quantization rates, i.e., assign a limited number of bits per each input sample, inducing an additional quantization error which degrades the digital representation accuracy [4,Ch. 23]. ...
... For example, for quantizing a large-scale real-valued Gaussian random vector with i.i.d. entries and a sufficiently large quantization rate R, where intuitively there is little benefit in quantizing the entries jointly over quantizing each entry independently, vector quantization notably outperforms serial scalar quantization [4,Ch. 23.2]. ...
Article
Sparse signals are encountered in a broad range of applications. In order to process these signals using digital hardware, they must be first sampled and quantized using an analog-to-digital convertor (ADC), which typically operates in a serial scalar manner. In this work, we propose a method for serial quantization of sparse time sequences (SQuaTS) inspired by group testing theory, which is designed to reliably and accurately quantize sparse signals acquired in a sequential manner using serial scalar ADCs. Unlike previously proposed approaches which combine quantization and compressed sensing (CS), our SQuaTS scheme updates its representation on each incoming analog sample and does not require the complete signal to be observed and stored in analog prior to quantization. We characterize the asymptotic tradeoff between accuracy and quantization rate of SQuaTS as well as its computational burden. We also propose a variation of SQuaTS, which trades rate for computational efficiency. Next, we show how \ac{sqrss} can be naturally extended to distributed quantization scenarios, where a set of jointly sparse time sequences are acquired individually and processed jointly. Our numerical results demonstrate that SQuaTS is capable of achieving substantially improved representation accuracy over previous CS-based schemes without requiring the complete set of analog signal samples to be observed prior to its quantization, making it an attractive approach for acquiring sparse time sequences.
... We use it in the proof of Theorem 1 in the specific situation where one of the conditional distributions is absolutely continuous with respect to the other for each individual input. As in[28, Remark 2.4], Doob's version of the Radon-Nikodym theorem can be used to derive that conditioning increases divergence in our case. For completeness, we add a proof of this lemma here. ...
Preprint
Full-text available
Most differential privacy mechanisms are applied (i.e., composed) numerous times on sensitive data. We study the design of optimal differential privacy mechanisms in the limit of a large number of compositions. As a consequence of the law of large numbers, in this regime the best privacy mechanism is the one that minimizes the Kullback-Leibler divergence between the conditional output distributions of the mechanism given two different inputs. We formulate an optimization problem to minimize this divergence subject to a cost constraint on the noise. We first prove that additive mechanisms are optimal. Since the optimization problem is infinite dimensional, it cannot be solved directly; nevertheless, we quantize the problem to derive near-optimal additive mechanisms that we call "cactus mechanisms" due to their shape. We show that our quantization approach can be arbitrarily close to an optimal mechanism. Surprisingly, for quadratic cost, the Gaussian mechanism is strictly sub-optimal compared to this cactus mechanism. Finally, we provide numerical results which indicate that cactus mechanism outperforms the Gaussian mechanism for a finite number of compositions.
Article
We establish a phase transition known as the “all-or-nothing” phenomenon for noiseless discrete channels. This class of models includes the Bernoulli group testing model and the planted Gaussian perceptron model. Previously, the existence of the all-or-nothing phenomenon for such models was only known in a limited range of parameters. Our work extends the results to all signals with arbitrary sublinear sparsity. Over the past several years, the all-or-nothing phenomenon has been established in various models as an outcome of two seemingly disjoint results: one positive result establishing the “all” half of all-or-nothing, and one impossibility result establishing the “nothing” half. Our main technique in the present work is to show that for noiseless discrete channels, the “all” half implies the “nothing” half, that is, a proof of “all” can be turned into a proof of “nothing.” Since the “all” half can often be proven by straightforward means—for instance, by the first-moment method—our equivalence gives a powerful and general approach towards establishing the existence of this phenomenon in other contexts.
Article
Determining the presence of an anomaly or whether a system is safe or not is a problem with wide applicability. The model adopted for this problem is that of verifying whether a multi-component system has anomalies or not. Components can be probed over time individually or as groups in a data-driven manner. The collected observations are noisy and contain information on whether the selected group contains an anomaly or not. The aim is to minimize the probability of incorrectly declaring the system to be free of anomalies while ensuring that the probability of correctly declaring it to be safe is sufficiently large. This problem is modeled as an active hypothesis testing problem in the Neyman-Pearson setting. Asymptotically optimal rates and strategies are characterized. The general strategies are data driven and outperform previously proposed asymptotically optimal methods in the finite sample regime. Furthermore, novel component-selection are designed and analyzed in the non-asymptotic regime. For a specific class of problems admitting a key form of symmetry, strong non-asymptotic converse and achievability bounds are provided which are tighter than previously proposed bounds.
Article
We study hypothesis testing problems with fixed compression mappings and with user-dependent compression mappings to decide whether or not an observation sequence is related to one of the users in a database, which contains compressed versions of previously enrolled users’ data. We first provide the optimal characterization of the exponent of the probability of the second type of error for the fixed compression mappings scenario when the number of users in the database grows exponentially. We then establish operational equivalence relations between the Wyner-Ahlswede-Körner network, the single-user hypothesis testing problem, the multi-user hypothesis testing problem with user-dependent compression mappings and the identification systems with user-dependent compression mappings. These equivalence relations imply the strong converse and exponentially strong converse for the multi-user hypothesis testing and the identification systems both with user-dependent compression mappings. Finally they also show how an identification scheme can be turned into a multi-user hypothesis testing scheme with an explicit transfer of rate and error probability conditions and vice versa.
Book
Information Theory and Network Coding consists of two parts: Components of Information Theory, and Fundamentals of Network Coding Theory. Part I is a rigorous treatment of information theory for discrete and continuous systems. In addition to the classical topics, there are such modern topics as the I-Measure, Shannon-type and non-Shannon-type information inequalities, and a fundamental relation between entropy and group theory. With information theory as the foundation, Part II is a comprehensive treatment of network coding theory with detailed discussions on linear network codes, convolutional network codes, and multi-source network coding. Other important features include: • Derivations that are from the first principle • A large number of examples throughout the book • Many original exercise problems • Easy-to-use chapter summaries • Two parts that can be used separately or together for a comprehensive course Information Theory and Network Coding is for senior undergraduate and graduate students in electrical engineering, computer science, and applied mathematics. This work can also be used as a reference for professional engineers in the area of communications.
Article
Csiszár and Körner’s book is widely regarded as a classic in the field of information theory, providing deep insights and expert treatment of the key theoretical issues. It includes in-depth coverage of the mathematics of reliable information transmission, both in two-terminal and multi-terminal network scenarios. Updated and considerably expanded, this new edition presents unique discussions of information theoretic secrecy and of zero-error information theory, including the deep connections of the latter with extremal combinatorics. The presentations of all core subjects are self contained, even the advanced topics, which helps readers to understand the important connections between seemingly different problems. Finally, 320 end-of-chapter problems, together with helpful solving hints, allow readers to develop a full command of the mathematical techniques. It is an ideal resource for graduate students and researchers in electrical and electronic engineering, computer science and applied mathematics. © Akadémiai Kiadó, Budapest 1981 and Cambridge University Press 2011.
Article
The authors report the empirical performance of Gallager's low density parity check codes on Gaussian channels. They show that performance substantially better than that of standard convolutional and concatenated codes can be achieved; indeed the performance is almost as close to the Shannon limit as that of turbo codes