ArticlePDF Available

Lecture Notes in Information Theory Part I

April 2012

April 2012

Authors:

F. Alajaji

Queen's University

Po-Ning Chen

National Chiao Tung University

Block diagram of a general communication system.

…

Binary entropy function h b (p).

…

Tree structure of a binary prefix code. The codewords are those residing on the leaves, which in this case are 00, 01, 10, 110, 1110 and 1111.

…

Binary symmetric channel.

…

Figures - uploaded by Po-Ning Chen

Content may be subject to copyright.

Content uploaded by Po-Ning Chen

Content may be subject to copyright.

Lecture Notes in Information Theory

Part I

Fady Alajaji†and Po-Ning Chen‡

†Department of Mathematics & Statistics,

Queen’s University, Kingston, ON K7L 3N6, Canada

Email: fady@mast.queensu.ca

‡Department of Electrical Engineering

Institute of Communication Engineering

National Chiao Tung University

1001, Ta Hsueh Road

Hsin Chu, Taiwan 30056

Republic of China

Email: poning@faculty.nctu.edu.tw

September 23, 2010

Copyright by

Fady Alajaji†and Po-Ning Chen‡

September 23, 2010

Preface

This is a work in progress. Comments are welcome; please send them to fady@mast.queensu.ca.

Acknowledgements

Many thanks are due to our families for their endless support.

iii

Table of Contents

Chapter Page

List of Tables vi

List of Figures vii

1 Introduction 1

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Communication system model . . . . . . . . . . . . . . . . . . . . 2

2 Information Measures for Discrete Systems 5

2.1 Entropy, joint entropy and conditional entropy . . . . . . . . . . . 5

2.1.1 Self-information . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.3 Properties of entropy . . . . . . . . . . . . . . . . . . . . . 10

2.1.4 Joint entropy and conditional entropy . . . . . . . . . . . . 12

2.1.5 Properties of joint entropy and conditional entropy . . . . 14

2.2 Mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.1 Properties of mutual information . . . . . . . . . . . . . . 17

2.2.2 Conditional mutual information . . . . . . . . . . . . . . . 18

2.3 Properties of entropy and mutual information for multiple random

variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 Data processing inequality . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Fano’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6 Divergence and variational distance . . . . . . . . . . . . . . . . . 26

2.7 Convexity/concavity of information measures . . . . . . . . . . . . 36

2.8 Fundamentals of hypothesis testing . . . . . . . . . . . . . . . . . 40

3 Lossless Data Compression 43

3.1 Principles of data compression . . . . . . . . . . . . . . . . . . . . 43

3.2 Block codes for asymptotically lossless compression . . . . . . . . 45

3.2.1 Block codes for discrete memoryless sources . . . . . . . . 45

3.2.2 Block codes for stationary ergodic sources . . . . . . . . . 54

3.2.3 Redundancy for lossless block data compression . . . . . . 58

3.3 Variable-length codes for lossless data compression . . . . . . . . . 60

3.3.1 Non-singular codes and uniquely decodable codes . . . . . 60

3.3.2 Preﬁx or instantaneous codes . . . . . . . . . . . . . . . . 65

3.3.3 Examples of binary preﬁx codes . . . . . . . . . . . . . . . 71

A) Huﬀman codes: optimal variable-length codes . . . . . 71

B) Shannon-Fano-Elias code . . . . . . . . . . . . . . . . . 75

3.3.4 Examples of universal lossless variable-length codes . . . . 76

A) Adaptive Huﬀman code . . . . . . . . . . . . . . . . . . 76

B) Lempel-Ziv codes . . . . . . . . . . . . . . . . . . . . . 80

4 Data Transmission and Channel Capacity 82

4.1 Principles of data transmission . . . . . . . . . . . . . . . . . . . . 82

4.2 Discrete memoryless channels . . . . . . . . . . . . . . . . . . . . 84

4.3 Block codes for data transmission over DMCs . . . . . . . . . . . 90

4.4 Calculating channel capacity . . . . . . . . . . . . . . . . . . . . . 102

4.4.1 Symmetric, weakly-symmetric and quasi-symmetric channels103

4.4.2 Channel capacity Karuch-Kuhn-Tucker condition . . . . . 109

5 Diﬀerential Entropy and Gaussian Channels 113

5.1 Diﬀerential entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.2 Joint and conditional diﬀerential entropies, divergence and mutual

information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.3 AEP for continuous memoryless sources . . . . . . . . . . . . . . . 131

5.4 Capacity of the discrete-time memoryless Gaussian channel . . . . 132

A Overview on Suprema and Limits 133

A.1 Supremum and maximum . . . . . . . . . . . . . . . . . . . . . . 133

A.2 Inﬁmum and minimum . . . . . . . . . . . . . . . . . . . . . . . . 135

A.3 Boundedness and suprema operations . . . . . . . . . . . . . . . . 136

A.4 Sequences and their limits . . . . . . . . . . . . . . . . . . . . . . 138

A.5 Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

B Overview in Probability and Random Processes 144

B.1 Probability space . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

B.2 Random variable and random process . . . . . . . . . . . . . . . . 145

B.3 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . 145

B.4 Convexity, concavity and Jensen’s inequality . . . . . . . . . . . . 145

List of Tables

Number Page

3.1 An example of the δ-typical set with n= 2 and δ= 0.4, where

F2(0.4) = {AB, AC, BA, BB, BC, CA, CB }. The codeword set

is {001(AB), 010(AC), 011(BA), 100(BB), 101(BC), 110(CA),

111(CB), 000(AA, AD, BD, CC, CD, DA, DB, DC, DD) }, where

the parenthesis following each binary codeword indicates those

sourcewords that are encoded to this codeword. The source distri-

bution is PX(A) = 0.4, PX(B) = 0.3, PX(C) = 0.2 and PX(D) =

0.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.1 Quantized random variable qn(X) under an n-bit accuracy: H(qn(X))

and H(qn(X)) −nversus n. . . . . . . . . . . . . . . . . . . . . . 118

List of Figures

Number Page

1.1 Block diagram of a general communication system. . . . . . . . . 2

2.1 Binary entropy function hb(p). . . . . . . . . . . . . . . . . . . . . 10

2.2 Relation between entropy and mutual information. . . . . . . . . 17

2.3 Communication context of the data processing lemma. . . . . . . 21

2.4 Permissible (Pe, H(X|Y)) region due to Fano’s inequality. . . . . . 24

3.1 Block diagram of a data compression system. . . . . . . . . . . . . 45

3.2 Possible codebook C∼nand its corresponding Sn. The solid box

indicates the decoding mapping from C∼nback to Sn. . . . . . . . 53

3.3 (Ultimate) Compression rate Rversus source entropy HD(X) and

behavior of the probability of block decoding error as block length

ngoes to inﬁnity for a discrete memoryless source. . . . . . . . . . 54

3.4 Classiﬁcation of variable-length codes. . . . . . . . . . . . . . . . 65

3.5 Tree structure of a binary preﬁx code. The codewords are those

residing on the leaves, which in this case are 00, 01, 10, 110, 1110

and 1111. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.6 Example of the Huﬀman encoding. . . . . . . . . . . . . . . . . . 73

3.7 Example of the sibling property based on the code tree from P(16)

The arguments inside the parenthesis following ajrespectively

indicate the codeword and the probability associated with aj. “b”

is used to denote the internal nodes of the tree with the assigned

(partial) code as its subscript. The number in the parenthesis

following bis the probability sum of all its children. . . . . . . . 78

3.8 (Continuation of Figure 3.7) Example of violation of the sibling

property after observing a new symbol a3at n= 17. Note that

node a1is not adjacent to its sibling a2. . . . . . . . . . . . . . . 79

3.9 (Continuation of Figure 3.8) Updated Huﬀman code. The sibling

property holds now for the new code. . . . . . . . . . . . . . . . . 80

vii

4.1 A data transmission system, where Wrepresents the message for

transmission, Xndenotes the codeword corresponding to message

W,Ynrepresents the received word due to channel input Xn, and

Wdenotes the reconstructed message from Yn. . . . . . . . . . . 82

4.2 Binary symmetric channel. . . . . . . . . . . . . . . . . . . . . . . 87

4.3 Binary erasure channel. . . . . . . . . . . . . . . . . . . . . . . . . 89

4.4 Binary symmetric erasure channel. . . . . . . . . . . . . . . . . . 89

4.5 Ultimate channel coding rate Rversus channel capacity Cand be-

havior of the probability of error as blocklength ngoes to inﬁnity

for a discrete memoryless channel. . . . . . . . . . . . . . . . . . . 102

A.1 Illustration of Lemma A.17. . . . . . . . . . . . . . . . . . . . . . 139

B.1 The support line y=ax +bof the convex function f(x). . . . . . 147

viii

Chapter 1

Introduction

1.1 Overview

Since its inception, the main role of Information Theory has been to provide the

engineering and scientiﬁc communities with a mathematical framework for the

theory of communication by establishing the fundamental limits on the perfor-

mance of various communication systems. The birth of Information Theory was

initiated with the publication of the groundbreaking works [38, 40] of Claude El-

wood Shannon (1916-2001) who asserted that it is possible to send information-

bearing signals at a ﬁxed positive rate through a noisy communication channel

with an arbitrarily small probability of error as long as the transmission rate

is below a certain ﬁxed quantity that depends on the channel statistical char-

acteristics; he “baptized” this quantity with the name of channel capacity. He

further proclaimed that random (stochastic) sources, representing data, speech

or image signals, can be compressed distortion-free at a minimal rate given by

the source’s intrinsic amount of information, which he called source entropy and

deﬁned in terms of the source statistics. He went on proving that if a source has

an entropy that is less than the capacity of a communication channel, then the

source can be reliably transmitted (with asymptotically vanishing probability of

error) over the channel. He further generalized these “coding theorems” from

the lossless (distortionless) to the lossy context where the source can be com-

pressed and reproduced (possibly after channel transmission) within a tolerable

distortion threshold [39].

Inspired and guided by the pioneering ideas of Shannon,1information theo-

rists gradually expanded their interests beyond communication theory, and in-

vestigated fundamental questions in several other related ﬁelds. Among them

we cite:

1See [42] for accessing most of Shannon’s works, including his yet untapped doctoral dis-

sertation on an algebraic framework for population genetics.

•statistical physics (thermodynamics, quantum information theory);

•computer science (algorithmic complexity, resolvability);

•probability theory (large deviations, limit theorems);

•statistics (hypothesis testing, multi-user detection, Fisher information, es-

timation);

•economics (gambling theory, investment theory);

•biology (biological information theory);

•cryptography (data security, watermarking);

•data networks (self-similarity, traﬃc regulation theory).

In this textbook, we focus our attention on the study of the basic theory of

communication for single-user (point-to-point) systems for which Information

Theory was originally conceived.

1.2 Communication system model

A simple block diagram of a general communication system is depicted in Fig. 1.1.

Source - - - Modulator

Physical

Channel

Demodulator



Destination

Discrete

Channel

Transmitter Part

Receiver Part

Focus of

this text

Source

Encoder Channel

Encoder

Channel

Decoder

Source

Decoder

Figure 1.1: Block diagram of a general communication system.

Let us brieﬂy describe the role of each block in the ﬁgure.

•Source: The source, which usually represents data or multimedia signals,

is modelled as a random process (the necessary background regarding ran-

dom processes is introduced in Appendix B). It can be discrete (ﬁnite or

countable alphabet) or continuous (uncountable alphabet) in value and in

time.

•Source Encoder: Its role is to represent the source in a compact fashion

by removing its unnecessary or redundant content (i.e., by compressing it).

•Channel Encoder: Its role is to enable the reliable reproduction of the

source encoder output after its transmission through a noisy communica-

tion channel. This is achieved by adding redundancy (using usually an

algebraic structure) to the source encoder output.

•Modulator: It transforms the channel encoder output into a waveform

suitable for transmission over the physical channel. This is typically ac-

complished by varying the parameters of a sinusoidal signal in proportion

with the data provided by the channel encoder output.

•Physical Channel: It consists of the noisy (or unreliable) medium that

the transmitted waveform traverses. It is usually modelled via a sequence of

conditional (or transition) probability distributions of receiving an output

given that a speciﬁc input was sent.

•Receiver Part: It consists of the demodulator, the channel decoder and

the source decoder where the reverse operations are performed. The desti-

nation represents the sink where the source estimate provided by the source

decoder is reproduced.

In this text, we will model the concatenation of the modulator, physical

channel and demodulator via a discrete-time2channel with a given sequence of

conditional probability distributions. Given a source and a discrete channel, our

objectives will include determining the fundamental limits of how well we can

construct a (source/channel) coding scheme so that:

•the smallest number of source encoder symbols can represent each source

symbol distortion-free or within a prescribed distortion level D, where

D > 0 and the channel is noiseless;

2Except for a brief interlude with the continuous-time (waveform) Gaussian channel in

Chapter 5, we will consider discrete-time communication systems throughout the text.

•the largest rate of information can be transmitted over a noisy channel

between the channel encoder input and the channel decoder output with

an arbitrarily small probability of decoding error;

•we can guarantee that the source is transmitted over a noisy channel and

reproduced at the destination within distortion D, where D > 0.

Chapter 2

Information Measures for Discrete

Systems

In this chapter, we deﬁne information measures for discrete-time discrete-alphabet1

systems from a probabilistic standpoint and develop their properties. Elucidat-

ing the operational signiﬁcance of probabilistically deﬁned information measures

vis-a-vis the fundamental limits of coding constitutes a main objective of this

book; this will be seen in the subsequent chapters.

2.1 Entropy, joint entropy and conditional entropy

2.1.1 Self-information

Let Ebe an event belonging to a given event space and having probability

Pr(E),pE, where 0 ≤pE≤1. Let I(E) – called the self-information of E–

represent the amount of information one gains when learning that Ehas occurred

(or equivalently, the amount of uncertainty one had about Eprior to learning

that it has happened). A natural question to ask is “what properties should I(E)

have?” Although the answer to this question may vary from person to person,

here are some common properties that I(E) is reasonably expected to have.

1. I(E) should be a decreasing function of pE.

In other words, this property ﬁrst states that I(E) = I(pE), where I(·) is

a real-valued function deﬁned over [0,1]. Furthermore, one would expect

that the less likely event Eis, the more information is gained when one

1By discrete alphabets, one usually means ﬁnite or countably inﬁnite alphabets. We how-

ever mostly focus on ﬁnite alphabet systems, although the presented information measures

allow for countable alphabets (when they exist).

learns it has occurred. In other words, I(pE) is a decreasing function of

pE.

2. I(pE) should be continuous in pE.

Intuitively, one should expect that a small change in pEcorresponds to a

small change in the amount of information carried by E.

3. If E1and E2are independent events, then I(E1∩E2) = I(E1) + I(E2),

or equivalently, I(pE1×pE2) = I(pE1) + I(pE2).

This property declares that when events E1and E2are independent from

each other (i.e., when they do not aﬀect each other probabilistically), the

amount of information one gains by learning that both events have jointly

occurred should be equal to the sum of the amounts of information of each

individual event.

Next, we show that the only function that satisﬁes properties 1-3 above is

the logarithmic function.

Theorem 2.1 The only function deﬁned over p∈[0,1] and satisfying

1. I(p) is monotonically decreasing in p;

2. I(p) is a continuous function of pfor 0 ≤p≤1;

3. I(p1×p2) = I(p1) + I(p2);

is I(p) = −c·logb(p), where cis a positive constant and the base bof the

logarithm is any number larger than one.

Proof:

Step 1: Claim. For n= 1,2,3,···,

I1

n=−c·logb1

n,

where c > 0 is a constant.

Proof: First note that for n= 1, condition 3 directly shows the claim, since

it yields that I(1) = I(1) + I(1). Thus I(1) = 0 = −clogb(1).

Now let nbe a ﬁxed positive integer greater than 1. Conditions 1 and 3

respectively imply

n < m ⇒I1

n< I 1

m.(2.1.1)

and

I1

mn=I1

m+I1

n(2.1.2)

where n, m = 1,2,3,···. Now using (2.1.2), we can show by induction (on

k) that

I1

nk=k·I1

n(2.1.3)

for all non-negative integers k.

Now for any positive integer r, there exists a non-negative integer ksuch

that

nk≤2r< nk+1.

By (2.1.1), we obtain

I1

nk≤I1

2r< I 1

nk+1 ,

which together with (2.1.3), yields

k·I1

n≤r·I1

2<(k+ 1) ·I1

n.

Hence, since I(1/n)> I(1) = 0,

r≤I(1/2)

I(1/n)≤k+ 1

On the other hand, by the monotonicity of the logarithm, we obtain

logbnk≤logb2r≤logbnk+1 ⇔k

r≤logb(2)

logb(n)≤k+ 1

Therefore, 

logb(2)

logb(n)−I(1/2)

I(1/n)<1

Since nis ﬁxed, and rcan be made arbitrarily large, we can let r→ ∞ to

get:

I1

n=c·logb(n).

where c=I(1/2)/logb(2) >0. This completes the proof of the claim.

Step 2: Claim. I(p) = −c·logb(p) for positive rational number p, where c > 0

is a constant.

Proof: A positive rational number pcan be represented by a ratio of two

integers, i.e., p=r/s, where rand sare both positive integers. Then

condition 3 yields that

I1

s=Ir

r=Ir

s+I1

r,

which, from Step 1, implies that

I(p) = Ir

s=I1

s−I1

r=c·logbs−c·logbr=−c·logbp.

Step 3: For any p∈[0,1], it follows by continuity and the density of the ratio-

nals in the reals that

I(p) = lim

a↑p, a rational I(a) = lim

b↓p, b rational I(b) = −c·logb(p).

The constant cabove is by convention normalized to c= 1. Furthermore,

the base bof the logarithm determines the type of units used in measuring

information. When b= 2, the amount of information is expressed in bits (i.e.,

binary digits). When b=e– i.e., the natural logarithm (ln) is used – information

is measured in nats (i.e., natural units or digits). For example, if the event E

concerns a Heads outcome from the toss of a fair coin, then its self-information

is I(E) = −log2(1/2) = 1 bit or −ln(1/2) = 0.693 nats.

More generally, under base b > 1, information is in b-ary units or digits.

For the sake of simplicity, we will throughout use the base-2 logarithm unless

otherwise speciﬁed. Note that one can easily convert information units from bits

to b-ary units by dividing the former by log2(b).

2.1.2 Entropy

Let Xbe a discrete random variable taking values in a ﬁnite alphabet Xunder

a probability distribution or probability mass function (pmf) PX(x),P[X=x]

for all x∈ X. Note that Xgenerically represents a memoryless source, which is

a random process {Xn}∞

n=1 with independent and identically distributed (i.i.d.)

random variables (cf. Appendix B).

Deﬁnition 2.2 (Entropy) The entropy of a discrete random variable Xwith

pmf PX(·) is denoted by H(X) or H(PX) and deﬁned by

H(X),−X

x∈X

PX(x)·log2PX(x) (bits).

Thus H(X) represents the statistical average (mean) amount of information

one gains when learning that one of its |X| outcomes has occurred, where |X|

denotes the size of alphabet X. Indeed, we directly note from the deﬁnition that

H(X) = E[−log2PX(X)] = E[I(X)]

where I(x),−log2PX(x) is the self-information of the elementary event [X=

x].

When computing the entropy, we adopt the convention

0·log20 = 0,

which can be justiﬁed by a continuity argument since xlog2x→0 as x→0.

Also note that H(X) only depends on the probability distribution of Xand is

not aﬀected by the symbols that represent the outcomes. For example when

tossing a fair coin, we can denote Heads by 2 (instead of 1) and Tail by 100

(instead of 0), and the entropy of the random variable representing the outcome

would remain equal to log2(2) = 1 bit.

Example 2.3 Let Xbe a binary (valued) random variable with alphabet X=

{0,1}and pmf given by PX(1) = pand PX(0) = 1 −p, where 0 ≤p≤1 is ﬁxed.

Then H(X) = −p·log2p−(1−p)·log2(1−p). This entropy is conveniently called

the binary entropy function and is usually denoted by hb(p): it is illustrated in

Fig. 2.1. As shown in the ﬁgure, hb(p) is maximized for a uniform distribution

(i.e., p= 1/2).

The units for H(X) above are in bits as base-2 logarithm is used. Setting

HD(X),−X

x∈X

PX(x)·logDPX(x)

yields the entropy in D-ary units, where D > 1. Note that we abbreviate H2(X)

as H(X) throughout the book since bits are common measure units for a coding

system, and hence

HD(X) = H(X)

log2D.

Thus

He(X) = H(X)

log2(e)= (ln 2) ·H(X)

gives the entropy in nats, where eis the base of the natural logarithm.

0 0.5 1

Figure 2.1: Binary entropy function hb(p).

2.1.3 Properties of entropy

When developing or proving the basic properties of entropy (and other informa-

tion measures), we will often use the following fundamental inequality on the

logarithm (its proof is left as an exercise).

Lemma 2.4 (Fundamental inequality (FI)) For any x > 0 and D > 1, we

have that

logDx≤logD(e)(x−1)

with equality if and only if (iﬀ) x= 1.

Setting y= 1/x and using FI above directly yields that for any y > 0, we also

have that

logDy≥logD(e)1−1

y,

also with equality iﬀ y= 1. In the above the base-Dlogarithm was used.

Speciﬁcally, for a logarithm with base-2, the above inequalities become

log2(e)1−1

x≤log2x≤log2(e)(x−1)

with equality iﬀ x= 1.

Lemma 2.5 (Non-negativity) H(X)≥0. Equality holds iﬀ Xis determin-

istic (when Xis deterministic, the uncertainty of Xis obviously zero).

Proof: 0≤PX(x)≤1 implies that log2[1/PX(x)] ≥0 for every x∈ X. Hence,

H(X) = X

x∈X

PX(x) log2

PX(x)≥0,

with equality holding iﬀ PX(x) = 1 for some x∈ X.2

Lemma 2.6 (Upper bound on entropy) If a random variable Xtakes val-

ues from a ﬁnite set X, then

H(X)≤log2|X|,

where |X| denotes the size of the set X. Equality holds iﬀ Xis equiprobable or

uniformly distributed over X(i.e., PX(x) = 1

|X| for all x∈ X).

Proof:

log2|X | − H(X) = log2|X| × "X

x∈X

PX(x)#−"−X

x∈X

PX(x) log2PX(x)#

x∈X

PX(x)×log2|X | +X

x∈X

PX(x) log2PX(x)

x∈X

PX(x) log2[|X | × PX(x)]

≥X

x∈X

PX(x)·log2(e)1−1

|X| × PX(x)

= log2(e)X

x∈X PX(x)−1

|X|

= log2(e)·(1 −1) = 0

where the inequality follows from the FI Lemma, with equality iﬀ (∀x∈ X),

|X| × PX(x) = 1, which means PX(·) is a uniform distribution on X.2

Intuitively, H(X) tells us how random Xis. Indeed, Xis deterministic (not

random at all) iﬀ H(X) = 0. If Xis uniform (equiprobable), H(X) is maximized,

and is equal to log2|X|.

Lemma 2.7 (Log-sum inequality) For non-negative numbers, a1,a2,...,an

and b1,b2,...,bn,

i=1 ailogD

bi≥ n

i=1

ai!logDPn

i=1 ai

i=1 bi

(2.1.4)

with equality holding iﬀ, (∀1≤i≤n) (ai/bi) = (a1/b1), a constant independent

of i. (By convention, 0 ·logD(0) = 0, 0 ·logD(0/0) = 0 and a·logD(a/0) = ∞if

a > 0. Again, this can be justiﬁed by “continuity.”)

Proof: Let a,Pn

i=1 aiand b,Pn

i=1 bi. Then

i=1

ailogD

bi−alogD

b=a





i=1

alogD

bi− n

i=1

| {z }

logD

b





i=1

alogDai

a

≥alogD(e)

i=1

a1−bi

b

=alogD(e) n

i=1

a−

i=1

=alogD(e) (1 −1) = 0

where the inequality follows from the FI Lemma, with equality holing iﬀ ai

a= 1

for all i; i.e., ai

bi=a

b∀i.

We also provide another proof using Jensen’s inequality (cf. Theorem B.6 in

Appendix B). Without loss of generality, assume that ai>0 and bi>0 for

every i. Jensen’s inequality states that

i=1

αif(ti)≥f n

i=1

αiti!

for any strictly convex function f(·), αi≥0, and Pn

i=1 αi= 1; equality holds

iﬀ tiis a constant for all i. Hence by setting αi=bi/Pn

j=1 bj,ti=ai/bi, and

f(t) = t·logD(t), we obtain the desired result. 2

2.1.4 Joint entropy and conditional entropy

Given a pair of random variables (X, Y ) with a joint pmf PX,Y (·,·) deﬁned on

X × Y, the self-information of the (two-dimensional) elementary event [X=

x, Y =y] is deﬁned by

I(x, y),−log2PX,Y (x, y).

This leads us to the deﬁnition of joint entropy.

Deﬁnition 2.8 (Joint entropy) The joint entropy H(X, Y ) of random vari-

ables (X, Y ) is deﬁned by

H(X, Y ),−X

(x,y)∈X×Y

PX,Y (x, y)·log2PX,Y (x, y)

=E[−log2PX,Y (X, Y )].

The conditional entropy can also be similarly deﬁned as follows.

Deﬁnition 2.9 (Conditional entropy) Given two jointly distributed random

variables Xand Y, the conditional entropy H(Y|X) of Ygiven Xis deﬁned by

H(Y|X),X

x∈X

PX(x) −X

y∈Y

PY|X(y|x)·log2PY|X(y|x)!(2.1.5)

where PY|X(·|·) is the conditional pmf of Ygiven X.

Equation (2.1.5) can be written into three diﬀerent but equivalent forms:

H(Y|X) = −X

(x,y)∈X×Y

PX,Y (x, y)·log2PY|X(y|x)

=E[−log2PY|X(Y|X)]

x∈X

PX(x)·H(Y|X=x)

where H(Y|X=x),−Py∈Y PY|X(y|x) log2PY|X(y|x).

The relationship between joint entropy and conditional entropy is exhibited

by the fact that the entropy of a pair of random variables is the entropy of one

plus the conditional entropy of the other.

Theorem 2.10 (Chain rule for entropy)

H(X, Y ) = H(X) + H(Y|X).(2.1.6)

Proof: Since

PX,Y (x, y) = PX(x)PY|X(y|x),

we directly obtain that

H(X, Y ) = E[−log PX,Y (X, Y )]

=E[−log2PX(X)] + E[−log2PY|X(Y|X)]

=H(X) + H(Y|X).

By its deﬁnition, joint entropy is commutative; i.e., H(X, Y ) = H(Y, X).

Hence,

H(X, Y ) = H(X) + H(Y|X) = H(Y) + H(X|Y) = H(Y, X),

which implies that

H(X)−H(X|Y) = H(Y)−H(Y|X).(2.1.7)

The above quantity is exactly equal to the mutual information which will be

introduced in the next section.

The conditional entropy can be thought of in terms of a channel whose input

is the random variable Xand whose output is the random variable Y.H(X|Y) is

then called the equivocation2and corresponds to the uncertainty in the channel

input from the receiver’s point-of-view. For example, suppose that the set of

possible outcomes of random vector (X, Y ) is {(0,0),(0,1),(1,0),(1,1)}, where

none of the elements has zero probability mass. When the receiver Yreceives 1,

he still cannot determine exactly what the sender Xobserves (it could be either

1 or 0); therefore, the uncertainty, from the receiver’s view point, depends on

the probabilities PX|Y(0|1) and PX|Y(1|1).

Similarly, H(Y|X), which is called prevarication,3is the uncertainty in the

channel output from the transmitter’s point-of-view. In other words, the sender

knows exactly what he sends, but is uncertain on what the receiver will ﬁnally

obtain.

A case that is of speciﬁc interest is when H(X|Y) = 0. By its deﬁnition,

H(X|Y) = 0 if Xbecomes deterministic after observing Y. In such case, the

uncertainty of Xafter giving Yis completely zero.

The next corollary can be proved similarly to Theorem 2.10.

Corollary 2.11 (Chain rule for conditional entropy)

H(X, Y |Z) = H(X|Z) + H(Y|X, Z ).

2.1.5 Properties of joint entropy and conditional entropy

Lemma 2.12 (Conditioning never increases entropy) Side information Y

decreases the uncertainty about X:

H(X|Y)≤H(X)

2Equivocation is an ambiguous statement one uses deliberately in order to deceive or avoid

speaking the truth.

3Prevarication is the deliberate act of deviating from the truth (it is a synonym of “equiv-

ocation”).

with equality holding iﬀ Xand Yare independent. In other words, “condition-

ing” reduces entropy.

Proof:

H(X)−H(X|Y) = X

(x,y)∈X×Y

PX,Y (x, y)·log2

PX|Y(x|y)

PX(x)

(x,y)∈X×Y

PX,Y (x, y)·log2

PX|Y(x|y)PY(y)

PX(x)PY(y)

(x,y)∈X×Y

PX,Y (x, y)·log2

PX,Y (x, y)

PX(x)PY(y)

≥

X

(x,y)∈X×Y

PX,Y (x, y)

log2P(x,y)∈X×Y PX,Y (x, y)

P(x,y)∈X×Y PX(x)PY(y)

= 0

where the inequality follows from the log-sum inequality, with equality holding

iﬀ PX,Y (x, y)

PX(x)PY(y)= constant ∀(x, y)∈ X × Y.

Since probability must sum to 1, the above constant equals 1, which is exactly

the case of Xbeing independent of Y.2

Lemma 2.13 Entropy is additive for independent random variables; i.e.,

H(X, Y ) = H(X) + H(Y) for independent Xand Y.

Proof: By the previous lemma, independence of Xand Yimplies H(Y|X) =

H(Y). Hence

H(X, Y ) = H(X) + H(Y|X) = H(X) + H(Y).

Since conditioning never increases entropy, it follows that

H(X, Y ) = H(X) + H(Y|X)≤H(X) + H(Y).(2.1.8)

The above lemma tells us that equality holds for (2.1.8) only when Xis inde-

pendent of Y.

A result similar to (2.1.8) also applies to conditional entropy.

Lemma 2.14 Conditional entropy is lower additive; i.e.,

H(X1, X2|Y1, Y2)≤H(X1|Y1) + H(X2|Y2).

Equality holds iﬀ

PX1,X2|Y1,Y2(x1, x2|y1, y2) = PX1|Y1(x1|y1)PX2|Y2(x2|y2)

for all x1,x2,y1and y2.

Proof: Using the chain rule for conditional entropy and the fact that condition-

ing reduces entropy, we can write

H(X1, X2|Y1, Y2) = H(X1|Y1, Y2) + H(X2|X1, Y1, Y2)

≤H(X1|Y1, Y2) + H(X2|Y1, Y2),(2.1.9)

≤H(X1|Y1) + H(X2|Y2).(2.1.10)

For (2.1.9), equality holds iﬀ X1and X2are conditionally independent given

(2.1.10), equality holds iﬀ X1is conditionally independent of Y2given Y1(i.e.,

PX1|Y1,Y2(x1|y1, y2) = PX1|Y1(x1|y1)), and X2is conditionally independent of Y1

given Y2(i.e., PX2|Y1,Y2(x2|y1, y2) = PX2|Y2(x2|y2)). Hence, the desired equality

condition of the lemma is obtained. 2

2.2 Mutual information

For two random variables Xand Y, the mutual information between Xand Yis

the reduction in the uncertainty of Ydue to the knowledge of X(or vice versa).

A dual deﬁnition of mutual information states that it is the average amount of

information that Yhas (or contains) about Xor Xhas (or contains) about Y.

We can think of the mutual information between Xand Yin terms of a

channel whose input is Xand whose output is Y. Thereby the reduction of the

uncertainty is by deﬁnition the total uncertainty of X(i.e., H(X)) minus the

uncertainty of Xafter observing Y(i.e., H(X|Y)) Mathematically, it is

mutual information = I(X;Y),H(X)−H(X|Y).(2.2.1)

It can be easily veriﬁed from (2.1.7) that mutual information is symmetric; i.e.,

I(X;Y) = I(Y;X).

H(X)H(X|Y)I(X;Y)H(Y|X)H(Y)

H(X, Y )

R

- 

Figure 2.2: Relation between entropy and mutual information.

2.2.1 Properties of mutual information

Lemma 2.15

1. I(X;Y) = X

x∈X X

y∈Y

PX,Y (x, y) log2

PX,Y (x, y)

PX(x)PY(y).

2. I(X;Y) = I(Y;X).

3. I(X;Y) = H(X) + H(Y)−H(X, Y ).

4. I(X;Y)≤H(X) with equality holding iﬀ Xis a function of Y(i.e., X=

f(Y) for some function f(·)).

5. I(X;Y)≥0 with equality holding iﬀ Xand Yare independent.

6. I(X;Y)≤min{log2|X|,log2|Y|}.

Proof: Properties 1, 2, 3, and 4 follow immediately from the deﬁnition. Property

5 is a direct consequence of Lemma 2.12. Property 6 holds iﬀ I(X;Y)≤log2|X|

and I(X;Y)≤log2|Y|. To show the ﬁrst inequality, we write I(X;Y) = H(X)−

H(X|Y), use the fact that H(X|Y) is non-negative and apply Lemma 2.6. A

similar proof can be used to show that I(X;Y)≤log2|Y|.2

The relationships between H(X), H(Y), H(X, Y ), H(X|Y), H(Y|X) and

I(X;Y) can be illustrated by the Venn diagram in Figure 2.2.

2.2.2 Conditional mutual information

The conditional mutual information, denoted by I(X;Y|Z), is deﬁned as the

common uncertainty between Xand Yunder the knowledge of Z. It is mathe-

matically deﬁned by

I(X;Y|Z),H(X|Z)−H(X|Y, Z ).(2.2.2)

Lemma 2.16 (Chain rule for mutual information)

I(X;Y, Z ) = I(X;Y) + I(X;Z|Y) = I(X;Z) + I(X;Y|Z).

Proof: Without loss of generality, we only prove the ﬁrst equality:

I(X;Y, Z ) = H(X)−H(X|Y, Z)

=H(X)−H(X|Y) + H(X|Y)−H(X|Y, Z )

=I(X;Y) + I(X;Z|Y).

The above lemma can be read as: the information that (Y, Z ) has about X

is equal to the information that Yhas about Xplus the information that Zhas

about Xwhen Yis already known.

2.3 Properties of entropy and mutual information for

multiple random variables

Theorem 2.17 (Chain rule for entropy) Let X1,X2,...,Xnbe drawn ac-

cording to PXn(xn),PX1,···,Xn(x1,...,xn), where we use the common super-

script notation to denote an n-tuple: Xn,(X1,···, Xn) and xn,(x1,...,xn).

Then

H(X1, X2,...,Xn) =

i=1

H(Xi|Xi−1,...,X1),

where H(Xi|Xi−1,...,X1),H(X1) for i= 1. (The above chain rule can also

be written as:

H(Xn) =

i=1

H(Xi|Xi−1),

where Xi,(X1,...,Xi).)

Proof: From (2.1.6),

H(X1, X2,...,Xn) = H(X1, X2,...,Xn−1) + H(Xn|Xn−1,...,X1).(2.3.1)

Once again, applying (2.1.6) to the ﬁrst term of the right-hand-side of (2.3.1),

we have

H(X1, X2,...,Xn−1) = H(X1, X2,...,Xn−2) + H(Xn−1|Xn−2,...,X1).

The desired result can then be obtained by repeatedly applying (2.1.6). 2

Theorem 2.18 (Chain rule for conditional entropy)

H(X1, X2,...,Xn|Y) =

i=1

H(Xi|Xi−1,...,X1, Y ).

Proof: The theorem can be proved similarly to Theorem 2.17. 2

Theorem 2.19 (Chain rule for mutual information)

I(X1, X2,...,Xn;Y) =

i=1

I(Xi;Y|Xi−1,...,X1),

where I(Xi;Y|Xi−1,...,X1),I(X1;Y) for i= 1.

Proof: This can be proved by ﬁrst expressing mutual information in terms of

entropy and conditional entropy, and then applying the chain rules for entropy

and conditional entropy. 2

Theorem 2.20 (Independence bound on entropy)

H(X1, X2,...,Xn)≤

i=1

H(Xi).

Equality holds iﬀ all the Xi’s are independent from each other.4

4This condition is equivalent to that Xiis independent of (Xi−1,...,X1) for all i.

Their equivalence can be easily proved by chain rule for probabilities, i.e., PXn(xn) =

i=1 P(Xi|Xi−1

1), which is left to the readers as an exercise.

Proof: By applying the chain rule for entropy,

H(X1, X2,...,Xn) =

i=1

H(Xi|Xi−1,...,X1)

≤

i=1

H(Xi).

Equality holds iﬀ each conditional entropy is equal to its associated entropy, that

iﬀ Xiis independent of (Xi−1,...,X1) for all i.2

Theorem 2.21 (Bound on mutual information) If {(Xi, Yi)}n

i=1 is a pro-

cess satisfying the conditional independence assumption PYn|Xn=Qn

i=1 PYi|Xi,

then

I(X1,...,Xn;Y1,...,Yn)≤

i=1

I(Xi;Yi)

with equality holding iﬀ {Xi}n

i=1 are independent.

Proof: From the independence bound on entropy, we have

H(Y1,...,Yn)≤

i=1

H(Yi).

By the conditional independence assumption, we have

H(Y1,...,Yn|X1,...,Xn) = E−log2PYn|Xn(Yn|Xn)

=E"−

i=1

log2PYi|Xi(Yi|Xi)#

i=1

H(Yi|Xi).

Hence

I(Xn;Yn) = H(Yn)−H(Yn|Xn)

≤

i=1

H(Yi)−

i=1

H(Yi|Xi)

i=1

I(Xi;Yi)

with equality holding iﬀ {Yi}n

i=1 are independent, which holds iﬀ {Xi}n

i=1 are

independent. 2

Source -

UEncoder -

XChannel -

YDecoder -

I(U;V)≤I(X;Y)

“By processing, we can only reduce (mutual) information,

but the processed information may be in a more useful form!”

Figure 2.3: Communication context of the data processing lemma.

2.4 Data processing inequality

Lemma 2.22 (Data processing inequality) (This is also called the data pro-

cessing lemma.) If X→Y→Z, then I(X;Y)≥I(X;Z).

Proof: The Markov chain relationship X→Y→Zmeans that Xand Z

are conditional independent given Y(cf. Appendix B); we directly have that

I(X;Z|Y) = 0. By the chain rule for mutual information,

I(X;Z) + I(X;Y|Z) = I(X;Y, Z ) (2.4.1)

=I(X;Y) + I(X;Z|Y)

=I(X;Y).(2.4.2)

Since I(X;Y|Z)≥0, we obtain that I(X;Y)≥I(X;Z) with equality holding

iﬀ I(X;Y|Z) = 0. 2

The data processing inequality means that the mutual information will not

increase after processing. This result is somewhat counter-intuitive since given

two random variables Xand Y, we might believe that applying a well-designed

processing scheme to Y, which can be generally represented by a mapping g(Y),

could possibly increase the mutual information. However, for any g(·), X→

Y→g(Y) forms a Markov chain which implies that data processing cannot

increase mutual information. A communication context for the data processing

lemma is depicted in Figure 2.3, and summarized in the next corollary.

Corollary 2.23 For jointly distributed random variables Xand Yand any

function g(·), we have X→Y→g(Y) and

I(X;Y)≥I(X;g(Y)).

We also note that if Zobtains all the information about Xthrough Y, then

knowing Zwill not help increase the mutual information between Xand Y; this

is formalized in the following.

Corollary 2.24 If X→Y→Z, then

I(X;Y|Z)≤I(X;Y).

Proof: The proof directly follows from (2.4.1) and (2.4.2). 2

It is worth pointing out that it is possible that I(X;Y|Z)> I(X;Y) when X,

Yand Zdo not form a Markov chain. For example, let Xand Ybe independent

equiprobable binary random variables, and let Z=X+Y. Then,

I(X;Y|Z) = H(X|Z)−H(X|Y, Z )

=H(X|Z)

=PZ(0)H(X|z= 0) + PZ(1)H(X|z= 1) + PZ(2)H(X|z= 2)

= 0 + 0.5 + 0

= 0.5 bits,

which is clearly larger than I(X;Y) = 0.

Finally, we observe that we can extend the data processing inequality for a

sequence of random variables forming a Markov chain:

Corollary 2.25 If X1→X2→ ··· → Xn, then for any i, j, k.l such that

1≤i≤j≤k≤l≤n, we have that

I(Xi;Xl)≤I(Xj;Xk).

2.5 Fano’s inequality

Fano’s inequality is a quite useful tool widely employed in Information Theory

to prove converse results for coding theorems (as we will see in the following

chapters).

Lemma 2.26 (Fano’s inequality) Let Xand Ybe two random variables, cor-

related in general, with alphabets Xand Y, respectively, where Xis ﬁnite but

Ycan be countably inﬁnite. Let ˆ

X,g(Y) be an estimate of Xfrom observing

Y, where g:Y → X is a given estimation function. Deﬁne the probability of

error as

Pe,Pr[ ˆ

X6=X].

Then the following inequality holds

H(X|Y)≤hb(Pe) + Pe·log2(|X | − 1),(2.5.1)

where hb(x),−xlog2x−(1 −x) log2(1 −x) for 0 ≤x≤1 is the binary entropy

function.

Observation 2.27

•Note that when Pe= 0, we obtain that H(X|Y) = 0 (see (2.5.1)) as

intuition suggests, since if Pe= 0, then ˆ

X=g(Y) = X(with probability

1) and thus H(X|Y) = H(g(Y)|Y) = 0.

•Fano’s inequality yields upper and lower bounds on Pein terms of H(X|Y).

This is illustrated in Figure 2.4, where we plot the region for the pairs

(Pe, H(X|Y)) that are permissible under Fano’s inequality. In the ﬁgure,

the boundary of the permissible (dashed) region is given by the function

f(Pe),hb(Pe) + Pe·log2(|X | − 1),

the right-hand side of (2.5.1). We obtain that when

log2(|X | − 1) < H(X|Y)≤log2(|X|),

Pecan be upper and lower bounded as follows:

0<inf{a:f(a)≥H(X|Y)} ≤ Pe≤sup{a:f(a)≥H(X|Y)}<1.

Furthermore, when

0< H(X|Y)≤log2(|X| − 1),

only the lower bound holds:

Pe≥inf{a:f(a)≥H(X|Y)}>0.

Thus for all non-zero values of H(X|Y), we obtain a lower bound (of the

same form above) on Pe; the bound implies that if H(X|Y) is bounded

away from zero, Peis also bounded away from zero.

•A weaker but simpler version of Fano’s inequality can be directly obtained

from (2.5.1) by noting that hb(Pe)≤1:

H(X|Y)≤1 + Pelog2(|X | − 1),(2.5.2)

which in turn yields that

Pe≥H(X|Y)−1

log2(|X | − 1) ( for |X| >2)

which is weaker than the above lower bound on Pe.

log2(|X | − 1)

log2(|X|)

H(X|Y)

(|X | − 1)/|X|

0 1

Figure 2.4: Permissible (Pe, H(X|Y)) region due to Fano’s inequality.

Proof of Lemma 2.26:

Deﬁne a new random variable,

E,1,if g(Y)6=X

0,if g(Y) = X.

Then using the chain rule for conditional entropy, we obtain

H(E, X |Y) = H(X|Y) + H(E|X, Y )

=H(E|Y) + H(X|E, Y ).

Observe that Eis a function of Xand Y; hence, H(E|X, Y ) = 0. Since con-

ditioning never increases entropy, H(E|Y)≤H(E) = hb(Pe). The remaining

term, H(X|E, Y ), can be bounded as follows:

H(X|E, Y ) = Pr[E= 0]H(X|Y , E = 0) + Pr[E= 1]H(X|Y, E = 1)

≤(1 −Pe)·0 + Pe·log2(|X | − 1),

since X=g(Y) for E= 0, and given E= 1, we can upper bound the condi-

tional entropy by the log of the number of remaining outcomes, i.e., (|X| − 1).

Combining these results completes the proof. 2

Fano’s inequality cannot be improved in the sense that the lower bound,

H(X|Y), can be achieved for some speciﬁc cases. Any bound that can be

achieved in some cases is often referred to as sharp.5From the proof of the above

lemma, we can observe that equality holds in Fano’s inequality, if H(E|Y) =

H(E) and H(X|Y, E = 1) = log2(|X| −1). The former is equivalent to Ebeing

independent of Y, and the latter holds iﬀ PX|Y(·|y) is uniformly distributed over

the set X −{g(y)}. We can therefore create an example in which equality holds

in Fano’s inequality.

Example 2.28 Suppose that Xand Yare two independent random variables

which are both uniformly distributed on the alphabet {0,1,2}. Let the estimat-

ing function be given by g(y) = y. Then

Pe= Pr[g(Y)6=X] = Pr[Y6=X] = 1 −

x=0

PX(x)PY(x) = 2

In this case, equality is achieved in Fano’s inequality, i.e.,

hb2

3+2

3·log2(3 −1) = H(X|Y) = H(X) = log23.

To conclude this section, we present an alternative proof for Fano’s inequality

to illustrate the use of the data processing inequality and the FI Lemma.

Alternative Proof of Fano’s inequality: Noting that X→Y→ˆ

Xform a

Markov chain, we directly obtain via the data processing inequality that

I(X;Y)≥I(X;ˆ

X),

which implies that

H(X|Y)≤H(X|ˆ

X).

Thus, if we show that H(X|ˆ

X) is no larger than the right-hand side of (2.5.1),

the proof of (2.5.1) is complete.

Noting that

Pe=X

x∈X X

ˆx∈X:ˆx6=x

PX, ˆ

X(x, ˆx)

and

1−Pe=X

x∈X X

ˆx∈X:ˆx=x

PX, ˆ

X(x, ˆx) = X

x∈X

PX, ˆ

X(x, x),

5Deﬁnition. A bound is said to be sharp if the bound is achievable for some speciﬁc cases.

A bound is said to be tight if the bound is achievable for all cases.

we obtain that

H(X|ˆ

X)−hb(Pe)−Pelog2(|X | − 1)

x∈X X

ˆx∈X:ˆx6=x

PX, ˆ

X(x, ˆx) log2

PX|ˆ

X(x|ˆx)+X

x∈X

PX, ˆ

X(x, x) log2

PX|ˆ

X(x|x)

−"X

x∈X X

ˆx∈X:ˆx6=x

PX, ˆ

X(x, ˆx)#log2

(|X| − 1)

+"X

x∈X

PX, ˆ

X(x, x)#log2(1 −Pe)

x∈X X

ˆx∈X:ˆx6=x

PX, ˆ

X(x, ˆx) log2

PX|ˆ

X(x|ˆx)(|X| − 1)

x∈X

PX, ˆ

X(x, x) log2

1−Pe

PX|ˆ

X(x|x)(2.5.3)

≤log2(e)X

x∈X X

ˆx∈X:ˆx6=x

PX, ˆ

X(x, ˆx)"Pe

PX|ˆ

X(x|ˆx)(|X| − 1) −1#

+ log2(e)X

x∈X

PX, ˆ

X(x, x)"1−Pe

PX|ˆ

X(x|x)−1#

= log2(e)"Pe

(|X| − 1) X

x∈X X

ˆx∈X:ˆx6=x

Pˆ

X(ˆx)−X

x∈X X

ˆx∈X:ˆx6=x

PX, ˆ

X(x, ˆx)#

+ log2(e)"(1 −Pe)X

x∈X

Pˆ

X(x)−X

x∈X

PX, ˆ

X(x, x)#

= log2(e)Pe

(|X| − 1)(|X | − 1) −Pe+ log2(e) [(1 −Pe)−(1 −Pe)]

= 0

where the inequality follows by applying the FI Lemma to each logarithm term

in (2.5.3). 2

2.6 Divergence and variational distance

In addition to the probabilistically deﬁned entropy and mutual information, an-

other measure that is frequently considered in information theory is divergence or

relative entropy. In this section, we deﬁne this measure and study its statistical

properties.

Deﬁnition 2.29 (Divergence) Given two discrete random variables Xand ˆ

deﬁned over a common alphabet X, the divergence (other names are Kullback-

Leibler divergence or distance,relative entropy and discrimination) is denoted

by D(Xkˆ

X) or D(PXkPˆ

X) and deﬁned by6

D(Xkˆ

X) = D(PXkPˆ

X),EXlog2

PX(X)

Pˆ

X(X)=X

x∈X

PX(x) log2

PX(x)

Pˆ

X(x).

In other words, the divergence D(PXkPˆ

X) is the expectation (with respect to

PX) of the log-likelihood ratio log2[PX/P ˆ

X] of distribution PXagainst distribu-

tion Pˆ

X.D(Xkˆ

X) can be viewed as a measure of “distance” or “dissimilarity”

between distributions PXand Pˆ

X.D(Xkˆ

X) is also called relative entropy since

it can be regarded as a measure of the ineﬃciency of mistakenly assuming that

the distribution of a source is Pˆ

Xwhen the true distribution is PX. For example,

if we know the true distribution PXof a source, then we can construct a lossless

data compression code with average codeword length achieving entropy H(X)

(this will be studied in the next chapter). If, however, we mistakenly thought

that the “true” distribution is Pˆ

Xand employ the “best” code corresponding to

Pˆ

X, then the resultant average codeword length becomes

x∈X

[−PX(x)·log2Pˆ

X(x)].

As a result, the relative diﬀerence between the resultant average codeword length

and H(X) is the relative entropy D(Xkˆ

X). Hence, divergence is a measure of

the system cost (e.g., storage consumed) paid due to mis-classifying the system

statistics.

Note that when computing divergence, we follow the convention that

0·log2

p= 0 and p·log2

0=∞for p > 0.

We next present some properties of the divergence and discuss its relation with

entropy and mutual information.

Lemma 2.30 (Non-negativity of divergence)

D(Xkˆ

X)≥0

with equality iﬀ PX(x) = Pˆ

X(x) for all x∈ X (i.e., the two distributions are

equal).

6In order to be consistent with the units (in bits) adopted for entropy and mutual informa-

tion, we will also use the base-2 logarithm for divergence unless otherwise speciﬁed.

Proof:

D(Xkˆ

X) = X

x∈X

PX(x) log2

PX(x)

Pˆ

X(x)

≥ X

x∈X

PX(x)!log2Px∈X PX(x)

Px∈X Pˆ

X(x)

= 0

where the second step follows from the log-sum inequality with equality holding

iﬀ for every x∈ X,

PX(x)

Pˆ

X(x)=Pa∈X PX(a)

Pb∈X Pˆ

X(b),

or equivalently PX(x) = Pˆ

X(x) for all x∈ X.2

Lemma 2.31 (Mutual information and divergence)

I(X;Y) = D(PX,Y kPX×PY),

where PX,Y (·,·) is the joint distribution of the random variables Xand Yand

PX(·) and PY(·) are the respective marginals.

Proof: The observation follows directly from the deﬁnitions of divergence and

mutual information. 2

Deﬁnition 2.32 (Reﬁnement of distribution) Given distribution PXon X,

divide Xinto kmutually disjoint sets, U1,U2,...,Uk, satisfying

[

i=1 Ui.

Deﬁne a new distribution PUon U={1,2,···, k}as

PU(i) = X

x∈Ui

PX(x).

Then PXis called a reﬁnement (or more speciﬁcally, a k-reﬁnement) of PU.

Let us brieﬂy discuss the relation between the processing of information and

its reﬁnement. Processing of information can be modeled as a (many-to-one)

mapping, and reﬁnement is actually the reverse operation. Recall that the

data processing lemma shows that mutual information can never increase due

to processing. Hence, if one wishes to increase mutual information, he should

simultaneously “anti-process” (or reﬁne) the involved statistics.

From Lemma 2.31, the mutual information can be viewed as the divergence

of a joint distribution against the product distribution of the marginals. It is

therefore reasonable to expect that a similar eﬀect due to processing (or a reverse

eﬀect due to reﬁnement) should also apply to divergence. This is shown in the

next lemma.

Lemma 2.33 (Reﬁnement cannot decrease divergence) Let PXand Pˆ

be the reﬁnements (k-reﬁnements) of PUand Pˆ

Urespectively. Then

D(PXkPˆ

X)≥D(PUkPˆ

U).

Proof: By the log-sum inequality, we obtain that for any i∈ {1,2,···, k}

x∈Ui

PX(x) log2

PX(x)

Pˆ

X(x)≥ X

x∈Ui

PX(x)!log2Px∈UiPX(x)

Px∈UiPˆ

X(x)

=PU(i) log2

PU(i)

Pˆ

U(i),(2.6.1)

with equality iﬀ

PX(x)

Pˆ

X(x)=PU(i)

Pˆ

U(i)

for all x∈ U. Hence,

D(PXkPˆ

X) =

i=1 X

x∈Ui

PX(x) log2

PX(x)

Pˆ

X(x)

≥

i=1

PU(i) log2

PU(i)

Pˆ

U(i)

=D(PUkPˆ

U),

with equality iﬀ

(∀i)(∀x∈ Ui)PX(x)

Pˆ

X(x)=PU(i)

Pˆ

U(i).

Observation 2.34 One drawback of adopting the divergence as a measure be-

tween two distributions is that it does not meet the symmetry requirement of a

true distance,7since interchanging its two arguments may yield diﬀerent quan-

tities. In other words, D(PXkPˆ

X)6=D(Pˆ

XkPX) in general. (It also does not

satisfy the triangular inequality.) Thus divergence is not a true distance or met-

ric. Another measure which is a true distance, called variational distance, is

sometimes used instead.

Deﬁnition 2.35 (Variational distance) The variational distance (or L1-distance)

between two distributions PXand Pˆ

Xwith common alphabet Xis deﬁned by

kPX−Pˆ

Xk,X

x∈X |PX(x)−Pˆ

X(x)|.

Lemma 2.36 The variational distance satisﬁes

kPX−Pˆ

Xk= 2 ·sup

E⊂X |PX(E)−Pˆ

X(E)|= 2 ·X

x∈X:PX(x)>P ˆ

X(x)

[PX(x)−Pˆ

X(x)] .

Proof: We ﬁrst show that kPX−Pˆ

Xk= 2 ·Px∈X:PX(x)>P ˆ

X(x)[PX(x)−Pˆ

X(x)] .

Setting A,{x∈ X :PX(x)> P ˆ

X(x)}, we have

kPX−Pˆ

Xk=X

x∈X |PX(x)−Pˆ

X(x)|

x∈A |PX(x)−Pˆ

X(x)|+X

x∈Ac|PX(x)−Pˆ

X(x)|

x∈A

[PX(x)−Pˆ

X(x)] + X

x∈Ac

[Pˆ

X(x)−PX(x)]

x∈A

[PX(x)−Pˆ

X(x)] + Pˆ

X(Ac)−PX(Ac)

x∈A

[PX(x)−Pˆ

X(x)] + PX(A)−Pˆ

X(A)

x∈A

[PX(x)−Pˆ

X(x)] + X

x∈A

[PX(x)−Pˆ

X(x)]

= 2 ·X

x∈A

[PX(x)−Pˆ

X(x)]

7Given a non-empty set A, the function d:A×A→[0,∞) is called a distance or metric

if it satisﬁes the following properties.

1. Non-negativity: d(a, b)≥0 for every a, b ∈Awith equality holding iﬀ a=b.

2. Symmetry: d(a, b) = d(b, a) for every a, b ∈A.

3. Triangular inequality: d(a, b) + d(b, c)≥d(a, c) for every a, b, c ∈A.

where Acdenotes the complement set of A.

We next prove that kPX−Pˆ

Xk= 2 ·supE⊂X |PX(E)−Pˆ

X(E)|by showing

that each quantity is greater than or equal to the other. For any set E⊂ X, we

can write

kPX−Pˆ

Xk=X

x∈X |PX(x)−Pˆ

X(x)|

x∈E|PX(x)−Pˆ

X(x)|+X

x∈Ec|PX(x)−Pˆ

X(x)|

≥X

x∈E

[PX(x)−Pˆ

X(x)]

+X

x∈Ec

[PX(x)−Pˆ

X(x)]

=|PX(E)−Pˆ

X(E)|+|PX(Ec)−Pˆ

X(Ec)|

=|PX(E)−Pˆ

X(E)|+|Pˆ

X(E)−PX(E)|

= 2 · |PX(E)−Pˆ

X(E)|.

Thus kPX−Pˆ

Xk ≥ 2·supE⊂X |PX(E)−Pˆ

X(E)|. Conversely, we have that

2·sup

E⊂X |PX(E)−Pˆ

X(E)| ≥ 2· |PX(A)−Pˆ

X(A)|

=|PX(A)−Pˆ

X(A)|+|Pˆ

X(Ac)−PX(Ac)|

=X

x∈A

[PX(x)−Pˆ

X(x)]

+X

x∈Ac

[Pˆ

X(x)−PX(x)]

x∈A |PX(x)−Pˆ

X(x)|+X

x∈Ac|PX(x)−Pˆ

X(x)|

=kPX−Pˆ

Xk.

Therefore, kPX−Pˆ

Xk= 2 ·supE⊂X |PX(E)−Pˆ

X(E)|.2

Lemma 2.37 (Variational distance vs divergence: Pinsker’s inequality)

D(Xkˆ

X)≥log2(e)

2· kPX−Pˆ

Xk2.

This result is referred to as Pinsker’s inequality.

Proof:

1. With A,{x∈ X :PX(x)> P ˆ

X(x)}, we have from the previous lemma

that

kPX−Pˆ

Xk= 2[PX(A)−Pˆ

X(A)].

2. Deﬁne two random variables Uand ˆ

Uas:

U=1,if X∈ A;

0,if X∈ Ac,

and

U=1,if ˆ

X∈ A;

0,if ˆ

X∈ Ac.

Then PXand Pˆ

Xare reﬁnements (2-reﬁnements) of PUand Pˆ

U, respec-

tively. From Lemma 2.33, we obtain that

D(PXkPˆ

X)≥D(PUkPˆ

U).

3. The proof is complete if we show that

D(PUkPˆ

U)≥2 log2(e)[PX(A)−Pˆ

X(A)]2

= 2 log2(e)[PU(1) −Pˆ

U(1)]2.

For ease of notations, let p=PU(1) and q=Pˆ

U(1). Then to prove the

above inequality is equivalent to show that

p·ln p

q+ (1 −p)·ln 1−p

1−q≥2(p−q)2.

Deﬁne

f(p, q),p·ln p

q+ (1 −p)·ln 1−p

1−q−2(p−q)2,

and observe that

df(p, q)

dq = (p−q)4−1

q(1 −q)≤0 for q≤p.

Thus, f(p, q) is non-increasing in qfor q≤p. Also note that f(p, q) = 0

for q=p. Therefore,

f(p, q)≥0 for q≤p.

The proof is completed by noting that

f(p, q)≥0 for q≥p,

since f(1 −p, 1−q) = f(p, q).

Observation 2.38 The above lemma tells us that for a sequence of distributions

{(PXn, P ˆ

Xn)}n≥1,when D(PXnkPˆ

Xn) goes to zero as ngoes to inﬁnity, kPXn−

Pˆ

Xnkgoes to zero as well. But the converse does not necessarily hold. For a

quick counterexample, let

PXn(0) = 1 −PXn(1) = 1/n > 0

and

Pˆ

Xn(0) = 1 −Pˆ

Xn(1) = 0.

In this case,

D(PXnkPˆ

Xn)→ ∞

since by convention, (1/n)·log2((1/n)/0) → ∞. However,

kPX−Pˆ

Xk= 2PX{x:PX(x)> P ˆ

X(x)}−Pˆ

X{x:PX(x)> P ˆ

X(x)}

n→0.

We however can upper bound D(PXkPˆ

X) by the variational distance between

PXand Pˆ

Xwhen D(PXkPˆ

X)<∞.

Lemma 2.39 If D(PXkPˆ

X)<∞, then

D(PXkPˆ

X)≤log2(e)

min

{x:PX(x)>0}min{PX(x), P ˆ

X(x)}· kPX−Pˆ

Xk.

Proof: Without loss of generality, we assume that PX(x)>0 for all x∈ X.

Since D(PXkPˆ

X)<∞, we have that for any x∈ X,PX(x)>0 implies that

Pˆ

X(x)>0. Let

t,min

{x∈X:PX(x)>0}min{PX(x), P ˆ

X(x)}.

Then for all x∈ X,

ln PX(x)

Pˆ

X(x)≤ln PX(x)

Pˆ

X(x)

≤max

min{PX(x),P ˆ

X(x)}≤s≤max{PX(x),P ˆ

X(x)}

dln(s)

ds · |PX(x)−Pˆ

X(x)|

min{PX(x), P ˆ

X(x)}· |PX(x)−Pˆ

X(x)|

≤1

t· |PX(x)−Pˆ

X(x)|.

Hence,

D(PXkPˆ

X) = log2(e)X

x∈X

PX(x)·ln PX(x)

Pˆ

X(x)

≤log2(e)

x∈X

PX(x)· |PX(x)−Pˆ

X(x)|

≤log2(e)

x∈X |PX(x)−Pˆ

X(x)|

=log2(e)

t· kPX−Pˆ

Xk.

The next lemma discusses the eﬀect of side information on divergence. As

stated in Lemma 2.12, side information usually reduces entropy; it, however,

increases divergence. One interpretation of these results is that side information

is useful. Regarding entropy, side information provides us more information, so

uncertainty decreases. As for divergence, it is the measure or index of how easy

one can diﬀerentiate the source from two candidate distributions. The larger

the divergence, the easier one can tell apart between these two distributions

and make the right guess. At an extreme case, when divergence is zero, one

can never tell which distribution is the right one, since both produce the same

source. So, when we obtain more information (side information), we should be

able to make a better decision on the source statistics, which implies that the

divergence should be larger.

Deﬁnition 2.40 (Conditional divergence) Given three discrete random vari-

ables, X,ˆ

Xand Z, where Xand ˆ

Xhave a common alphabet X, we deﬁne the

conditional divergence between Xand ˆ

Xgiven Zby

D(Xkˆ

X|Z) = D(PX|ZkPˆ

X|Z),X

z∈Z X

x∈X

PX,Z (x, z) log PX|Z(x|z)

Pˆ

X|Z(x|z).

In otherwords, it is the expected value with respect to PX,Z of the log-likelihood

ratio log PX|Z

Pˆ

X|Z.

Lemma 2.41 (Conditional mutual information and conditional diver-

gence) Given three discrete random variables X,Yand Zwith alphabets X,

Yand Z, respectively, and joint distribution PX,Y,Z , then

I(X;Y|Z) = D(PX,Y |ZkPX|ZPY|Z)

x∈X X

y∈Y X

z∈Z

PX,Y,Z (x, y, z) log2

PX,Y |Z(x, y|z)

PX|Z(x|z)PY|Z(y|z),

where PX,Y |Zis conditional joint distribution of Xand Ygiven Z, and PX|Zand

PY|Zare the conditional distributions of Xand Y, respectively, given Z.

Proof: The proof follows directly from the deﬁnition of conditional mutual in-

formation (2.2.2) and the above deﬁntion of conditional divergence. 2

Lemma 2.42 (Chain rule for divergence) For three discrete random vari-

ables, X,ˆ

Xand Z, where Xand ˆ

Xhave a common alphabet X, we have that

D(PX,Z kPˆ

X,Z ) = D(PXkPˆ

X) + D(PX|ZkPˆ

X|Z).

Proof: The proof readily follows from the divergence deﬁnitions. 2

Lemma 2.43 (Conditioning never decreases divergence) For three discrete

random variables, X,ˆ

Xand Z, where Xand ˆ

Xhave a common alphabet X,

we have that

D(PX|ZkPˆ

X|Z)≥D(PXkPˆ

X).

Proof:

D(PX|ZkPˆ

X|Z)−D(PXkPˆ

z∈Z X

x∈X

PX,Z (x, z)·log2

PX|Z(x|z)

Pˆ

X|Z(x|z)−X

x∈X

PX(x)·log2

PX(x)

Pˆ

X(x)

z∈Z X

x∈X

PX,Z (x, z)·log2

PX|Z(x|z)

Pˆ

X|Z(x|z)−X

x∈X X

z∈Z

PX,Z (x, z)!·log2

PX(x)

Pˆ

X(x)

z∈Z X

x∈X

PX,Z (x, z)·log2

PX|Z(x|z)Pˆ

X(x)

Pˆ

X|Z(x|z)PX(x)

≥X

z∈Z X

x∈X

PX,Z (x, z)·log2(e) 1−Pˆ

X|Z(x|z)PX(x)

PX|Z(x|z)Pˆ

X(x)!(by the FI Lemma)

= log2(e) 1−X

x∈X

PX(x)

Pˆ

X(x)X

z∈Z

PZ(z)Pˆ

X|Z(x|z)!

= log2(e) 1−X

x∈X

PX(x)

Pˆ

X(x)Pˆ

X(x)!

= log2(e) 1−X

x∈X

PX(x)!= 0

with equality holding iﬀ for all xand z,

PX(x)

Pˆ

X(x)=PX|Z(x|z)

Pˆ

X|Z(x|z).

Note that it is not necessary that

D(PX|ZkPˆ

X|ˆ

Z)≥D(PXkPˆ

X).

In other words, the side information is helpful for divergence only when it pro-

vides information on the similarity or diﬀerence of the two distributions. For the

above case, Zonly provides information about X, and ˆ

Zprovides information

about ˆ

X; so the divergence certainly cannot be expected to increase. The next

lemma shows that if (Z, ˆ

Z) are independent of (X, ˆ

X), then the side information

of (Z, ˆ

Z) does not help in improving the divergence of Xagainst ˆ

Lemma 2.44 (Independent side information does not change diver-

gence) If (X, ˆ

X) is independent of (Z, ˆ

Z), then

D(PX|ZkPˆ

X|ˆ

Z) = D(PXkPˆ

X),

where

D(PX|ZkPˆ

X|ˆ

Z),X

x∈X X

ˆx∈ˆ

z∈Z X

ˆz∈ˆ

PX, ˆ

X,Z, ˆ

Z(x, ˆx, z, ˆz) log2

PX|Z(x|z)

Pˆ

X|ˆ

Z(ˆx|ˆz).

Proof: This can be easily justiﬁed by the deﬁnition of divergence. 2

Lemma 2.45 (Additivity of divergence under independence)

D(PX,Z kPˆ

X, ˆ

Z) = D(PXkPˆ

X) + D(PZkPˆ

Z),

provided that (X, ˆ

X) is independent of (Z, ˆ

Z).

Proof: This can be easily proved from the deﬁnition. 2

2.7 Convexity/concavity of information measures

We next address the convexity/concavity properties of information measures

with respect to the distributions on which they are deﬁned. Such properties will

be useful when optimizing the information measures over distribution spaces.

Lemma 2.46

1. H(PX) is a concave function of PX, namely

H(λPX+ (1 −λ)Pe

X)≥λH(PX) + (1 −λ)H(Pe

X).

2. Noting that I(X;Y) can be re-written as I(PX, PY|X), where

I(PX, PY|X),X

x∈X X

y∈Y

PY|X(y|x)PX(x) log2

PY|X(y|x)

Pa∈X PY|X(y|a)PX(a),

then I(X;Y) is a concave function of PX(for ﬁxed PY|X), and a convex

function of PY|X(for ﬁxed PX).

3. D(PXkPˆ

X) is convex with respect to both the ﬁrst argument PXand the

second argument Pˆ

X. It is also convex in the pair (PX, P ˆ

X); i.e., if (PX, P ˆ

and (QX, Q ˆ

X) are two pairs of probability mass functions, then

D(λPX+ (1 −λ)QXkλP ˆ

X+ (1 −λ)Qˆ

≤λ·D(PXkQX) + (1 −λ)·D(Pˆ

XkQˆ

X),(2.7.1)

for all λ∈[0,1].

Proof:

1. The proof uses the log-sum inequality:

H(λPX+ (1 −λ)Pe

X)−λH(PX) + (1 −λ)H(Pe

X)

=λX

x∈X

PX(x) log2

PX(x)

λPX(x) + (1 −λ)Pe

X(x)

+(1 −λ)X

x∈X

X(x) log2

X(x)

λPX(x) + (1 −λ)Pe

X(x)

≥λ X

x∈X

PX(x)!log2Px∈X PX(x)

Px∈X[λPX(x) + (1 −λ)Pe

X(x)]

+(1 −λ) X

x∈X

X(x)!log2Px∈X Pe

X(x)

Px∈X[λPX(x) + (1 −λ)Pe

X(x)]

= 0,

with equality holding iﬀ PX(x) = Pe

X(x) for all x.

2. We ﬁrst show the concavity of I(PX;PY|X) with respect to PX. Let ¯

λ=

1−λ.

I(λPX+¯

λP e

X, PY|X)−λI(PX, PY|X)−¯

λI(Pe

X, PY|X)

=λX

y∈Y X

x∈X

PX(x)PY|X(y|x) log2X

x∈X

PX(x)PY|X(y|x)

x∈X

[λPX(x) + ¯

λP e

X(x)]PY|X(y|x)]

+¯

λX

y∈Y X

x∈X

X(x)PY|X(y|x) log2X

x∈X

X(x)PY|X(y|x)

x∈X

[λPX(x) + ¯

λP e

X(x)]PY|X(y|x)]

≥0 (by the log-sum inequality)

with equality holding iﬀ

x∈X

PX(x)PY|X(y|x) = X

x∈X

X(x)PY|X(y|x)

for all y∈ Y. We now turn to the convexity of I(PX, PY|X) with respect to

PY|X. For ease of notation, let PYλ(y),λPY(y)+¯

λPe

Y(y),and PYλ|X(y|x),

λPY|X(y|x) + ¯

λPe

Y|X(y|x).Then

λI(PX, PY|X) + ¯

λI(PX, Pe

Y|X)−I(PX, λPY|X+¯

λPe

Y|X)

=λX

x∈X X

y∈Y

PX(x)PY|X(y|x) log2

PY|X(y|x)

PY(y)

+¯

λX

x∈X X

y∈Y

PX(x)Pe

Y|X(y|x) log2

Y|X(y|x)

Y(y)

−X

x∈X X

y∈Y

PX(x)PYλ|X(y|x) log2

PYλ|X(y|x)

PYλ(y)

=λX

x∈X X

y∈Y

PX(x)PY|X(y|x) log2

PY|X(y|x)PYλ(y)

PY(y)PYλ|X(y|x)

+¯

λX

x∈X X

y∈Y

PX(x)Pe

Y|X(y|x) log2

Y|X(y|x)PYλ(y)

Y(y)PYλ|X(y|x)

≥λlog2(e)X

x∈X X

y∈Y

PX(x)PY|X(y|x)1−PY(y)PYλ|X(y|x)

PY|X(y|x)PYλ(y)

+¯

λlog2(e)X

x∈X X

y∈Y

PX(x)Pe

Y|X(y|x) 1−Pe

Y(y)PYλ|X(y|x)

Y|X(y|x)PYλ(y)!

= 0,

where the inequality follows from the FI Lemma, with equality holding iﬀ

(∀x∈ X, y ∈ Y)PY(y)

PY|X(y|x)=Pe

Y(y)

Y|X(y|x).

3. For ease of notation, let PXλ(x),λPX(x) + (1 −λ)Pe

X(x).

λD(PXkPˆ

X) + (1 −λ)D(Pe

XkPˆ

X)−D(PXλkPˆ

=λX

x∈X

PX(x) log2

PX(x)

PXλ(x)+ (1 −λ)X

x∈X

X(x) log2

X(x)

PXλ(x)

=λD(PXkPXλ) + (1 −λ)D(Pe

XkPXλ)

≥0

by the non-negativity of the divergence, with equality holding iﬀ PX(x) =

X(x) for all x.

Similarly, by letting Pˆ

Xλ(x),λP ˆ

X(x) + (1 −λ)Pe

X(x), we obtain:

λD(PXkPˆ

X) + (1 −λ)D(PXkPe

X)−D(PXkPˆ

Xλ)

=λX

x∈X

PX(x) log2

Pˆ

Xλ(x)

Pˆ

X(x)+ (1 −λ)X

x∈X

PX(x) log2

Pˆ

Xλ(x)

X(x)

≥λ

ln 2 X

x∈X

PX(x) 1−Pˆ

X(x)

Pˆ

Xλ(x)!+(1 −λ)

ln 2 X

x∈X

PX(x) 1−Pe

X(x)

Pˆ

Xλ(x)!

= log2(e) 1−X

x∈X

PX(x)λP ˆ

X(x) + (1 −λ)Pe

X(x)

Pˆ

Xλ(x)!

= 0,

where the inequality follows from the FI Lemma, with equality holding iﬀ

X(x) = Pˆ

X(x) for all x.

Finally, by the log-sum inequality, for each x∈ X, we have

(λPX(x) + (1 −λ)Pˆ

X(x)) log2

λPX(x) + (1 −λ)Pˆ

X(x)

λQX(x) + (1 −λ)Qˆ

X(x)

≤λPX(x) log2

λPX(x)

λQX(x)+ (1 −λ)Pˆ

X(x) log2

(1 −λ)Pˆ

X(x)

(1 −λ)Qˆ

X(x).

Summing over x, we yield (2.7.1).

Note that the last result (convexity of D(PXkPˆ

X) in the pair (PX, P ˆ

X))

actually implies the ﬁrst two results: just set Pˆ

X=Qˆ

Xto show convexity

in the ﬁrst argument PX, and set PX=QXto show convexity in the second

argument Pˆ

2.8 Fundamentals of hypothesis testing

One of the fundamental problems in statistics is to decide between two alternative

explanations for the observed data. For example, when gambling, one may wish

to test whether it is a fair game or not. Similarly, a sequence of observations on

the market may reveal the information that whether a new product is successful

or not. This is the simplest form of the hypothesis testing problem, which is

usually named simple hypothesis testing.

It has quite a few applications in information theory. One of the frequently

cited examples is the alternative interpretation of the law of large numbers.

Another example is the computation of the true coding error (for universal codes)

by testing the empirical distribution against the true distribution. All of these

cases will be discussed subsequently.

The simple hypothesis testing problem can be formulated as follows:

Problem: Let X1,...,Xnbe a sequence of observations which is possibly dr-

awn according to either a “null hypothesis” distribution PXnor an “alternative

hypothesis” distribution Pˆ

Xn. The hypotheses are usually denoted by:

•H0:PXn

•H1:Pˆ

Based on one sequence of observations xn, one has to decide which of the hy-

potheses is true. This is denoted by a decision mapping φ(·), where

φ(xn) = 0,if distribution of Xnis classiﬁed to be PXn;

1,if distribution of Xnis classiﬁed to be Pˆ

Xn.

Accordingly, the possible observed sequences are divided into two groups:

Acceptance region for H0:{xn∈ Xn:φ(xn) = 0}

Acceptance region for H1:{xn∈ Xn:φ(xn) = 1}.

Hence, depending on the true distribution, there are possibly two types of pro-

bability of errors:

Type I error : αn=αn(φ),PXn({xn∈ Xn:φ(xn) = 1})

Type II error : βn=βn(φ),Pˆ

Xn({xn∈ Xn:φ(xn) = 0}).

The choice of the decision mapping is dependent on the optimization criterion.

Two of the most frequently used ones in information theory are:

1. Bayesian hypothesis testing.

Here, φ(·) is chosen so that the Bayesian cost

π0αn+π1βn

is minimized, where π0and π1are the prior probabilities for the null

and alternative hypotheses, respectively. The mathematical expression for

Bayesian testing is:

min

{φ}[π0αn(φ) + π1βn(φ)] .

2. Neyman Pearson hypothesis testing subject to a ﬁxed test level.

Here, φ(·) is chosen so that the type II error βnis minimized subject to a

constant bound on the type I error; i.e.,

αn≤ε

where ε > 0 is ﬁxed. The mathematical expression for Neyman-Pearson

testing is:

min

{φ:αn(φ)≤ε}βn(φ).

The set {φ}considered in the minimization operation could have two diﬀerent

ranges: range over deterministic rules, and range over randomization rules. The

main diﬀerence between a randomization rule and a deterministic rule is that the

former allows the mapping φ(xn) to be random on {0,1}for some xn, while the

latter only accept deterministic assignments to {0,1}for all xn. For example, a

randomization rule for speciﬁc observations ˜xncan be

φ(˜xn) = 0,with probability 0.2;

φ(˜xn) = 1,with probability 0.8.

The Neyman-Pearson lemma shows the well-known fact that the likelihood

ratio test is always the optimal test.

Lemma 2.47 (Neyman-Pearson Lemma) For a simple hypothesis testing

problem, deﬁne an acceptance region for the null hypothesis through the likeli-

hood ratio as

An(τ),xn∈ Xn:PXn(xn)

Pˆ

Xn(xn)> τ,

and let

α∗

n,PXn{Ac

n(τ)}and β∗

n,Pˆ

Xn{An(τ)}.

Then for type I error αnand type II error βnassociated with another choice of

acceptance region for the null hypothesis, we have

αn≤α∗

n⇒βn≥β∗

Proof: Let Bbe a choice of acceptance region for the null hypothesis. Then

αn+τβn=X

xn∈Bc

PXn(xn) + τX

xn∈B

Pˆ

Xn(xn)

xn∈Bc

PXn(xn) + τ"1−X

xn∈Bc

Pˆ

Xn(xn)#

=τ+X

xn∈Bc

[PXn(xn)−τP ˆ

Xn(xn)] .(2.8.1)

Observe that (2.8.1) is minimized by choosing B=An(τ). Hence,

αn+τβn≥α∗

n+τβ∗

which immediately implies the desired result. 2

The Neyman-Pearson lemma indicates that no other choices of acceptance re-

gions can simultaneously improve both type I and type II errors of the likelihood

ratio test. Indeed, from (2.8.1), it is clear that for any αnand βn, one can always

ﬁnd a likelihood ratio test that performs as good. Therefore, the likelihood ratio

test is an optimal test. The statistical properties of the likelihood ratio thus

become essential in hypothesis testing. Note that, when the observations are

i.i.d. under both hypotheses, the divergence, which is the statistical expectation

of the log-likelihood ratio, plays an important role in hypothesis testing (for

non-memoryless observations, one is then concerned with the divergence rate, an

extended notion of divergence for systems with memory which will be deﬁned in

a following chapter).

Chapter 3

Lossless Data Compression

3.1 Principles of data compression

As mentioned in Chapter 1, data compression describes methods of representing

a source by a code whose average codeword length (or code rate) is acceptably

small. The representation can be: lossless (or asymptotically lossless) where

the reconstructed source is identical (or asymptotically identical) to the original

source; or lossy where the reconstructed source is allowed to deviate from the

original source, usually within an acceptable threshold. We herein focus on

lossless data compression.

Since a memoryless source is modelled as a random variable, the averaged

codeword length of a codebook is calculated based on the probability distribution

of that random variable. For example, a ternary memoryless source Xexhibits

three possible outcomes with

PX(x= outcomeA) = 0.5;

PX(x= outcomeB) = 0.25;

PX(x= outcomeC) = 0.25.

Suppose that a binary code book is designed for this source, in which outcomeA,

outcomeBand outcomeCare respectively encoded as 0, 10, and 11. Then the

average codeword length (in bits/source outcome) is

length(0) ×PX(outcomeA) + length(10) ×PX(outcomeB)

+length(11) ×PX(outcomeC)

= 1 ×0.5 + 2 ×0.25 + 2 ×0.25

= 1.5 bits.

There are usually no constraints on the basic structure of a code. In the

case where the codeword length for each source outcome can be diﬀerent, the

code is called a variable-length code. When the codeword lengths of all source

outcomes are equal, the code is referred to as a ﬁxed-length code . It is obvious

that the minimum average codeword length among all variable-length codes is

no greater than that among all ﬁxed-length codes, since the latter is a subclass

of the former. We will see in this chapter that the smallest achievable average

code rate for variable-length and ﬁxed-length codes coincide for sources with

good probabilistic characteristics, such as stationarity and ergodicity. But for

more general sources with memory, the two quantities are diﬀerent (cf. Part II

of the book).

For ﬁxed-length codes, the sequence of adjacent codewords are concate-

nated together for storage or transmission purposes, and some punctuation

mechanism—such as marking the beginning of each codeword or delineating

internal sub-blocks for synchronization between encoder and decoder—is nor-

mally considered an implicit part of the codewords. Due to constraints on space

or processing capability, the sequence of source symbols may be too long for the

encoder to deal with all at once; therefore, segmentation before encoding is often

necessary. For example, suppose that we need to encode using a binary code the

grades of a class with 100 students. There are three grade levels: A,Band C.

By observing that there are 3100 possible grade combinations for 100 students,

a straightforward code design requires ⌈log2(3100)⌉= 159 bits to encode these

combinations. Now suppose that the encoder facility can only process 16 bits

at a time. Then the above code design becomes infeasible and segmentation is

unavoidable. Under such constraint, we may encode grades of 10 students at a

time, which requires ⌈log2(310)⌉= 16 bits. As a consequence, for a class of 100

students, the code requires 160 bits in total.

In the above example, the letters in the grade set {A, B, C }and the letters

from the code alphabet {0,1}are often called source symbols and code symbols,

respectively. When the code alphabet is binary (as in the previous two examples),

the code symbols are referred to as code bits or simply bits (as already used).

A tuple (or grouped sequence) of source symbols is called a sourceword and the

resulting encoded tuple consisting of code symbols is called a codeword. (In the

above example, each sourceword consists of 10 source symbols (students) and

each codeword consists of 16 bits.)

Note that, during the encoding process, the sourceword lengths do not have to

be equal. In this text, we however only consider the case where the sourcewords

have a ﬁxed length throughout the encoding process (except for the Lempel-Ziv

code brieﬂy discussed at the end of this chapter), but we will allow the codewords

to have ﬁxed or variable lengths as deﬁned earlier.1The block diagram of a source

1In other words, our ﬁxed-length codes are actually “ﬁxed-to-ﬁxed length codes” and our

variable-length codes are “ﬁxed-to-variable length codes” since, in both cases, a ﬁxed number

Source -

sourcewords Source

encoder -

codewords Source

decoder -

sourcewords

Figure 3.1: Block diagram of a data compression system.

coding system is depicted in Figure 3.1.

When adding segmentation mechanisms to ﬁxed-length codes, the codes can

be loosely divided into two groups. The ﬁrst consists of block codes in which the

encoding (or decoding) of the next segment of source symbols is independent of

the previous segments. If the encoding/decoding of the next segment, somehow,

retains and uses some knowledge of earlier segments, the code is called a ﬁxed-

length tree code. As will not investigate such codes in this text, we can use

“block codes” and “ﬁxed-length codes” as synonyms.

In this chapter, we ﬁrst consider data compression for block codes in Sec-

tion 3.2. Data compression for variable-length codes is then addressed in Sec-

tion 3.3.

3.2 Block codes for asymptotically lossless compression

3.2.1 Block codes for discrete memoryless sources

We ﬁrst focus on the study of asymptotically lossless data compression of discrete

memoryless sources via block (ﬁxed-length) codes. Such sources were already

deﬁned in Appendix B and the previous chapter; but we nevertheless recall their

deﬁnition.

Deﬁnition 3.1 (Discrete memoryless source) A discrete memoryless source

(DMS) {Xn}∞

n=1 consists of a sequence of independent and identically distributed

(i.i.d.) random variables, X1, X2, X3,..., all taking values in a common ﬁnite

alphabet X. In particular, if PX(·) is the common distribution or probability

mass function (pmf) of the Xi’s, then

PXn(x1, x2,...,xn) =

i=1

PX(xi).

of source symbols is mapped onto codewords with ﬁxed and variable lengths, respectively.

Deﬁnition 3.2 An (n, M) block code of blocklength nand size M(which can

be a function of nin general,2i.e., M=Mn) for a discrete source {Xn}∞

n=1 is a set

{c1,c2,...,cM} ⊆ Xnconsisting of Mreproduction (or reconstruction) words,

where each reproduction word is a sourceword (an n-tuple of source symbols).3

The block code’s operation can be symbolically represented as4

(x1, x2,...,xn)→cm∈ {c1,c2,...,cM}.

This procedure will be repeated for each consecutive block of length n, i.e.,

···(x3n,...,x31)(x2n,...,x21)(x1n,...,x11)→ ·· ·|cm3|cm2|cm1,

where “|” reﬂects the necessity of “punctuation mechanism” or ”synchronization

mechanism” for consecutive source block coders.

The next theorem provides a key tool for proving Shannon’s source coding

theorem.

2In the literature, both (n, M ) and (M, n) have been used to denote a block code with

blocklength nand size M. For example, [45, p. 149] adopts the former one, while [12, p. 193]

uses the latter. We use the (n, M ) notation since M=Mnis a function of nin general.

3One can binary-index the reproduction words in {c1,c2,...,cM}using k,⌈log2M⌉bits.

As such k-bit words in {0,1}kare usually stored for retrieval at a later date, the (n, M ) block

code can be represented by an encoder-decoder pair of functions (f , g), where the encoding

function f:Xn→ {0,1}kmaps each sourceword xnto a k-bit word f(xn) which we call a

codeword. Then the decoding function g:{0,1}k→ {c1,c2,...,cM}is a retrieving operation

that produces the reproduction words. Since the codewords are binary-valued, such a block

code is called a binary code. More generally, a D-ary block code (where D > 1 is an integer)

would use an encoding function f:Xn→ {0,1,···, D −1}kwhere each codeword f(xn)

contains k D-ary code symbols.

Furthermore, since the behavior of block codes is investigated for suﬃciently large nand

M(tending to inﬁnity), it is legitimate to replace ⌈log2M⌉by log2Mfor the case of binary

codes. With this convention, the data compression rate or code rate is

bits required per source symbol = k

n=1

nlog2M.

Similarly, for D-ary codes, the rate is

D-ary code symbols required per source symbol = k

n=1

nlogDM.

For computational convenience, nats (under the natural logarithm) can be used instead of

bits or D-ary code symbols; in this case, the code rate becomes:

nats required per source symbol = 1

nlog M.

4When one uses an encoder-decoder pair (f , g) to describe the block code, the code’s oper-

ation can be expressed as: cm=g(f(xn)).

Theorem 3.3 (Shannon-McMillan) (Asymptotic equipartition property

or AEP5)If {Xn}∞

n=1 is a DMS with entropy H(X), then

−1

nlog2PXn(X1,...,Xn)→H(X) in probability.

In other words, for any δ > 0,

lim

n→∞ Pr −1

nlog2PXn(X1,...,Xn)−H(X)> δ= 0.

Proof: This theorem follows by ﬁrst observing that for an i.i.d. sequence {Xn}∞

n=1,

−1

nlog2PXn(X1,...,Xn) = 1

i=1

[−log2PX(Xi)]

and that the sequence {−log2PX(Xi)}∞

i=1 is i.i.d., and then applying the weak

law of large numbers (WLLN) on the later sequence. 2

The AEP indeed constitutes an “information theoretic” analog of WLLN as

it states that if {−log2PX(Xi)}∞

i=1 is an i.i.d. sequence, then for any δ > 0,

Pr (

i=1

[−log2PX(Xi)] −H(X)≤δ)→1 as n→ ∞.

As a consequence of the AEP, all the probability mass will be ultimately placed

on the weakly δ-typical set, which is deﬁned as

Fn(δ),xn∈ Xn:−1

nlog2PXn(xn)−H(X)≤δ

=(xn∈ Xn:−1

i=1

log2PX(xi)−H(X)≤δ).

Note that since the source is memoryless, for any xn∈ Fn(δ), −(1/n) log2PXn(xn),

the normalized self-information of xn, is equal to (1/n)Pn

i=1 [−log2PX(xi)],

which is the empirical (arithmetic) average self-information or “apparent” en-

tropy of the source. Thus, a sourceword xnis δ-typical if it yields an apparent

source entropy within δof the “true” source entropy H(X). Note that the source-

words in Fn(δ) are nearly equiprobable or equally surprising (cf. Property 1 of

Theorem 3.4); this justiﬁes naming Theorem 3.3 by AEP.

Theorem 3.4 (Consequence of the AEP) Given a DMS {Xn}∞

n=1 with en-

tropy H(X) and any δgreater than zero, then the weakly δ-typical set Fn(δ)

satisﬁes the following.

5This is also called the entropy stability property.

1. If xn∈ Fn(δ), then

2−n(H(X)+δ)≤PXn(xn)≤2−n(H(X)−δ).

2. PXn(Fc

n(δ)) < δ for suﬃciently large n, where the superscript “c” denotes

the complementary set operation.

3. |Fn(δ)|>(1−δ)2n(H(X)−δ)for suﬃciently large n, and |Fn(δ)| ≤ 2n(H(X)+δ)

for every n, where |Fn(δ)|denotes the number of elements in Fn(δ).

Note: The above theorem also holds if we deﬁne the typical set using the base-

Dlogarithm logDfor any D > 1 instead of the base-2 logarithm; in this case,

one just needs to appropriately change the base of the exponential terms in the

above theorem (by replacing 2xterms with Dxterms) and also substitute H(X)

with HD(X)).

Proof: Property 1 is an immediate consequence of the deﬁnition of Fn(δ).

Property 2 is a direct consequence of the AEP, since the AEP states that for a

ﬁxed δ > 0, limn→∞ PXn(Fn(δ)) = 1; i.e., ∀ε > 0, there exists n0=n0(ε) such

that for all n≥n0,

PXn(Fn(δ)) >1−ε.

In particular, setting ε=δyields the result. We nevertheless provide a direct

proof of Property 2 as we give an explicit expression for n0: observe that by

Chebyshev’s inequality,

PXn(Fc

n(δ)) = PXnxn∈ Xn:−1

nlog2PXn(xn)−H(X)> δ

≤σ2

nδ2< δ,

for n > σ2

X/δ3, where the variance

σ2

X,Var[−log2PX(X)] = X

x∈X

PX(x) [log2PX(x)]2−(H(X))2

is a constant6independent of n.

6In the proof, we assume that the variance σ2

X= Var[−log2PX(X)] <∞. This holds since

the source alphabet is ﬁnite:

Var[−log2PX(X)] ≤E[(log2PX(X))2] = X

x∈X

PX(x)(log2PX(x))2

≤X

x∈X

e2[log2(e)]2=4

e2[log2(e)]2× |X | <∞.

To prove Property 3, we have from Property 1 that

1≥X

xn∈Fn(δ)

PXn(xn)≥X

xn∈Fn(δ)

2−n(H(X)+δ)=|Fn(δ)|2−n(H(X)+δ),

and, using Properties 2 and 1, we have that

1−δ < 1−σ2

nδ2≤X

xn∈Fn(δ)

PXn(xn)≤X

xn∈Fn(δ)

2−n(H(X)−δ)=|Fn(δ)|2−n(H(X)−δ),

for n≥σ2

X/δ3.2

Note that for any n > 0, a block code C∼n= (n, M) is said to be uniquely

decodable or completely lossless if its set of reproduction words is trivially equal

to the set of all source n-tuples: {c1,c2,...,cM}=Xn. In this case, if we are

binary-indexing the reproduction words using an encoding-decoding pair (f , g),

every sourceword xnwill be assigned to a distinct binary codeword f(xn) of

length k= log2Mand all the binary k-tuples are the image under fof some

sourceword. In other words, fis a bijective (injective and surjective) map and

hence invertible with the decoding map g=f−1and M=|X|n= 2k. Thus the

code rate is (1/n) log2M= log2|X| bits/source symbol.

Now the question becomes: can we achieve a better (i.e., smaller) compres-

sion rate? The answer is aﬃrmative: we can achieve a compression rate equal to

the source entropy H(X) (in bits), which can be signiﬁcantly smaller than log2M

when this source is strongly non-uniformly distributed, if we give up unique de-

codability (for every n) and allow nto be suﬃciently large to asymptotically

achieve lossless reconstruction by having an arbitrarily small (but positive) pro-

bability of decoding error Pe(C∼n),PXn{xn∈ Xn:g(f(xn)) 6=xn}.

Thus, block codes herein can perform data compression that is asymptotically

lossless with respect to blocklength; this contrasts with variable-length codes

which can be completely lossless (uniquely decodable) for every ﬁnite block-

length.

We now can formally state and prove Shannon’s asymptotically lossless source

coding theorem for block codes. The theorem will be stated for general D-

ary block codes, representing the source entropy HD(X) in D-ary code sym-

bol/source symbol as the smallest (inﬁmum) possible compression rate for asymp-

totically lossless D-ary block codes. Without loss of generality, the theorem

will be proved for the case of D= 2. The idea behind the proof of the for-

ward (achievability) part is basically to binary-index the source sequence in the

weakly δ-typical set Fn(δ) to a binary codeword (starting from index one with

corresponding k-tuple codeword 0 ···01); and to encode all sourcewords outside

Fn(δ) to a default all-zero binary codeword, which certainly cannot be repro-

duced distortionless due to its many-to-one-mapping property. The resultant

Source −1

i=1

log2PX(xi)−H(X)

codeword reconstructed

source sequence

AA 0.525 bits 6∈ F2(0.4) 000 ambiguous

AB 0.317 bits ∈ F2(0.4) 001 AB

AC 0.025 bits ∈ F2(0.4) 010 AC

AD 0.475 bits 6∈ F2(0.4) 000 ambiguous

BA 0.317 bits ∈ F2(0.4) 011 BA

BB 0.109 bits ∈ F2(0.4) 100 BB

BC 0.183 bits ∈ F2(0.4) 101 BC

BD 0.683 bits 6∈ F2(0.4) 000 ambiguous

CA 0.025 bits ∈ F2(0.4) 110 CA

CB 0.183 bits ∈ F2(0.4) 111 CB

CC 0.475 bits 6∈ F2(0.4) 000 ambiguous

CD 0.975 bits 6∈ F2(0.4) 000 ambiguous

DA 0.475 bits 6∈ F2(0.4) 000 ambiguous

DB 0.683 bits 6∈ F2(0.4) 000 ambiguous

DC 0.975 bits 6∈ F2(0.4) 000 ambiguous

DD 1.475 bits 6∈ F2(0.4) 000 ambiguous

Table 3.1: An example of the δ-typical set with n= 2 and δ= 0.4,

where F2(0.4) = {AB, AC, BA, BB, BC, CA, CB }. The codeword

set is {001(AB), 010(AC), 011(BA), 100(BB), 101(BC), 110(CA),

111(CB), 000(AA, AD, BD, CC, CD, DA, DB, DC, DD) }, where

the parenthesis following each binary codeword indicates those source-

words that are encoded to this codeword. The source distribution is

PX(A) = 0.4, PX(B) = 0.3, PX(C) = 0.2 and PX(D) = 0.1.

code rate is (1/n)⌈log2(|Fn(δ)|+ 1)⌉bits per source symbol. As revealed in the

Shannon-McMillan AEP theorem and its Consequence, almost all the probabi-

lity mass will be on Fn(δ) as nsuﬃciently large, and hence, the probability of

non-reconstructable source sequences can be made arbitrarily small. A simple

example for the above coding scheme is illustrated in Table 3.1. The converse

part of the proof will establish (by expressing the probability of correct decoding

in terms of the δ-typical set and also using the Consequence of the AEP) that

for any sequence of D-ary codes with rate strictly below the source entropy, their

probability of error cannot asymptotically vanish (is bounded away from zero).

Actually a stronger result is proven: it is shown that their probability of error

not only does not asymptotically vanish, it actually ultimately grows to 1 (this

is why we call this part a “strong” converse).

Theorem 3.5 (Shannon’s source coding theorem) Given integer D > 1,

consider a discrete memoryless source {Xn}∞

n=1 with entropy HD(X). Then the

following hold.

•Forward part (achievability): For any 0 < ε < 1, there exists 0 < δ < ε

and a sequence of D-ary block codes { C∼n= (n, Mn)}∞

n=1 with

lim sup

n→∞

nlogDMn≤HD(X) + δ(3.2.1)

satisfying

Pe(C∼n)< ε (3.2.2)

for all suﬃciently large n, where Pe(C∼n) denotes the probability of decoding

error for block code C∼n.7

•Strong converse part: For any 0 < ε < 1, any sequence of D-ary block

codes { C∼n= (n, Mn)}∞

n=1 with

lim sup

n→∞

nlogDMn< HD(X) (3.2.3)

satisﬁes

Pe(C∼n)>1−ε

for all nsuﬃciently large.

Proof:

Forward Part: Without loss of generality, we will prove the result for the case of

binary codes (i.e., D= 2). Please be reminded that subscript Din HD(X) will

be dropped (i.e., omitted) speciﬁcally when D= 2.

Given 0 < ε < 1, ﬁx δsuch that 0 < δ < ε and choose n > 2/δ. Now

construct a binary C∼nblock code by simply mapping the δ/2-typical sourcewords

xnonto distinct not all-zero binary codewords of length k,⌈log2Mn⌉bits. In

other words, binary-index (cf. the footnote in Deﬁnition 3.2) the sourcewords in

Fn(δ/2) with the following encoding map:

xn→binary index of xn,if xn∈ Fn(δ/2);

xn→all-zero codeword,if xn6∈ Fn(δ/2).

7(3.2.2) is equivalent to lim supn→∞ Pe(C∼n)≤ε. Since εcan be made arbitrarily small,

the forward part actually indicates the existence of a sequence of D-ary block codes { C∼n}∞

n=1

satisfying (3.2.1) such that lim supn→∞ Pe(C∼n) = 0.

Based on this, the converse should be that any sequence of D-ary block codes satisfying

(3.2.3) satisﬁes lim supn→∞ Pe(C∼n)>0. However, the so-called strong converse actually gives

a stronger consequence: lim supn→∞ Pe(C∼n) = 1 (as ǫcan be made arbitrarily small).

Then by the Shannon-McMillan AEP theorem, we obtain that

Mn=|Fn(δ/2)|+ 1 ≤2n(H(X)+δ/2) + 1 <2·2n(H(X)+δ/2) <2n(H(X)+δ),

for n > 2/δ. Hence, a sequence of C∼n= (n, Mn) block code satisfying (3.2.1) is

established. It remains to show that the error probability for this sequence of

(n, Mn) block code can be made smaller than εfor all suﬃciently large n.

By the Shannon-McMillan AEP theorem,

PXn(Fc

n(δ/2)) <δ

2for all suﬃciently large n.

Consequently, for those nsatisfying the above inequality, and being bigger than

2/δ,

Pe(C∼n)≤PXn(Fc

n(δ/2)) < δ ≤ε.

(For the last step, the readers can refer to Table 3.1 to conﬁrm that only the

“ambiguous” sequences outside the typical set contribute to the probability of

error.)

Strong Converse Part: Fix any sequence of block codes { C∼n}∞

n=1 with

lim sup

n→∞

nlog2| C∼n|< H(X).

Let Snbe the set of source symbols that can be correctly decoded through C∼n-

coding system. (A quick example is depicted in Figure 3.2.) Then |Sn|=| C∼n|.

By choosing δsmall enough with ε/2> δ > 0, and also by deﬁnition of limsup

operation, we have

(∃N0)(∀n > N0)1

nlog2|Sn|=1

nlog2| C∼n|< H(X)−2δ,

which implies

|Sn|<2n(H(X)−2δ).

Furthermore, from Property 2 of the Consequence of the AEP, we obtain that

(∃N1)(∀n > N1)PXn(Fc

n(δ)) < δ.

Source Symbols

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?W ?W

Codewords

Figure 3.2: Possible codebook C∼nand its corresponding Sn. The solid

box indicates the decoding mapping from C∼nback to Sn.

Consequently, for n > N ,max{N0, N1,log2(2/ε)/δ}, the probability of cor-

rectly block decoding satisﬁes

1−Pe(C∼n) = X

xn∈Sn

PXn(xn)

xn∈Sn∩Fc

n(δ)

PXn(xn) + X

xn∈Sn∩Fn(δ)

PXn(xn)

≤PXn(Fc

n(δ)) + |Sn∩ Fn(δ)| · max

xn∈Fn(δ)PXn(xn)

< δ +|Sn| · max

xn∈Fn(δ)PXn(xn)

<ε

2+ 2n(H(X)−2δ)·2−n(H(X)−δ)

<ε

2+ 2−nδ

< ε,

which is equivalent to Pe(C∼n)>1−εfor n > N.2

Observation 3.6 The results of the above theorem is illustrated in Figure 3.3,

where R= lim supn→∞(1/n) logDMnis usually called the ultimate (or asymp-

totic) code rate of block codes for compressing the source. It is clear from the

ﬁgure that the (ultimate) rate of any block code with arbitrarily small decod-

ing error probability must be greater than the source entropy. Conversely, the

probability of decoding error for any block code of rate smaller than entropy ul-

timately approaches 1 (and hence is bounded away from zero). Thus for a DMS,

the source entropy HD(X) is the inﬁmum of all “achievable” source (block) cod-

ing rates; i.e., it is the inﬁmum of all rates for which there exists a sequence

of D-ary block codes with asymptotically vanishing (as the blocklength goes to

inﬁnity) probability of decoding error.

HD(X)

n→∞

−→ 1

for all block codes

n→∞

−→ 0

for the best data compression block code

Figure 3.3: (Ultimate) Compression rate Rversus source entropy

HD(X) and behavior of the probability of block decoding error as block

length ngoes to inﬁnity for a discrete memoryless source.

For a source with (statistical) memory, Shannon-McMillan’s theorem cannot

be directly applied in its original form, and thereby Shannon’s source coding

theorem appears restricted to only memoryless sources. However, by exploring

the concept behind these theorems, one can ﬁnd that the key for the validity

of Shannon’s source coding theorem is actually the existence of a set An=

{xn

1, xn

2,...,xn

M}with M≈DnHD(X)and PXn(Ac

n)→0, namely, the existence

of a “typical-like” set Anwhose size is prohibitively small and whose probability

mass is asymptotically large. Thus, if we can ﬁnd such typical-like set for a

source with memory, the source coding theorem for block codes can be extended

for this source. Indeed, with appropriate modiﬁcations, the Shannon-McMillan

theorem can be generalized for the class of stationary ergodic sources and hence

a block source coding theorem for this class can be established; this is considered

in the next subsection. The block source coding theorem for general (e.g., non-

stationary non-ergodic) sources in terms of a “generalized entropy” measure (see

the end of the next subsection for a brief description) will be studied in detail

in Part II of the book.

3.2.2 Block codes for stationary ergodic sources

In practice, a stochastic source used to model data often exhibits memory or

statistical dependence among its random variables; its joint distribution is hence

not a product of its marginal distributions. In this subsection, we consider the

asymptotic lossless data compression theorem for the class of stationary ergodic

sources.

Before proceeding to generalize the block source coding theorem, we need

to ﬁrst generalize the “entropy” measure for a sequence of dependent random

variables Xn(which certainly should be backward compatible to the discrete

memoryless cases). A straightforward generalization is to examine the limit of

the normalized block entropy of a source sequence, resulting in the concept of

entropy rate.

Deﬁnition 3.7 (Entropy rate) The entropy rate for a source {Xn}∞

n=1 is de-

noted by H(X) and deﬁned by

H(X),lim

n→∞

nH(Xn)

provided the limit exists, where Xn= (X1,···, Xn).

Next we will show that the entropy rate exists for stationary sources (here,

we do not need ergodicity for the existence of entropy rate).

Lemma 3.8 For a stationary source {Xn}∞

n=1, the conditional entropy

H(Xn|Xn−1,...,X1)

is non-increasing in nand also bounded from below by zero. Hence by Lemma

A.20, the limit

lim

n→∞ H(Xn|Xn−1,...,X1)

exists.

Proof: We have

H(Xn|Xn−1,...,X1)≤H(Xn|Xn−1,...,X2) (3.2.4)

=H(Xn,···, X2)−H(Xn−1,···, X2)

=H(Xn−1,···, X1)−H(Xn−2,···, X1) (3.2.5)

=H(Xn−1|Xn−2,...,X1)

where (3.2.4) follows since conditioning never increases entropy, and (3.2.5) holds

because of the stationarity assumption. Finally, recall that each conditional

entropy H(Xn|Xn−1,...,X1) is non-negative. 2

Lemma 3.9 (Cesaro-mean theorem) If an→aas n→ ∞ and bn= (1/n)Pn

i=1 ai,

then bn→aas n→ ∞.

Proof: an→aimplies that for any ε > 0, there exists Nsuch that for all n > N,

|an−a|< ε. Then

|bn−a|=

i=1

(ai−a)

≤1

i=1 |ai−a|

i=1 |ai−a|+1

i=N+1 |ai−a|

≤1

i=1 |ai−a|+n−N

nε.

Hence, limn→∞ |bn−a| ≤ ε. Since εcan be made arbitrarily small, the lemma

holds. 2

Theorem 3.10 For a stationary source {Xn}∞

n=1, its entropy rate always exists

and is equal to

H(X) = lim

n→∞ H(Xn|Xn−1,...,X1).

Proof: The result directly follows by writing

nH(Xn) = 1

i=1

H(Xi|Xi−1,...,X1) (chain-rule for entropy)

and applying the Cesaro-mean theorem. 2

Observation 3.11 It can also be shown that for a stationary source, (1/n)H(Xn)

is non-increasing in nand (1/n)H(Xn)≥H(Xn|Xn−1,...,X1) for all n≥1.

(The proof is left as an exercise.)

It is obvious that when {Xn}∞

n=1 is a discrete memoryless source, H(Xn) =

n×H(X) for every n. Hence,

H(X) = lim

n→∞

nH(Xn) = H(X).

For a ﬁrst-order stationary Markov source,

H(X) = lim

n→∞

nH(Xn) = lim

n→∞ H(Xn|Xn−1,...,X1) = H(X2|X1),

where

H(X2|X1),−X

x1∈X X

x2∈X

π(x1)PX2|X1(x2|x1)·log PX2|X1(x2|x1),

and π(·) is the stationary distribution for the Markov source. Furthermore, if

the Markov source is binary with PX2|X1(0|1) = αand PX2|X1(1|0) = β, then

H(X) = β

α+βhb(α) + α

α+βhb(β),

where hb(α),−αlog α−(1 −α) log(1 −α) is the binary entropy function.

Theorem 3.12 (Generalized AEP or Shannon-McMillan-Breiman The-

orem [12]) If {Xn}∞

n=1 is a stationary ergodic source, then

−1

nlog2PXn(X1,...,Xn)a.s.

−→ H(X).

Since the AEP theorem (law of large numbers) is valid for stationary ergodic

sources, all consequences of AEP will follow, including Shannon’s lossless source

coding theorem.

Theorem 3.13 (Shannon’s source coding theorem for stationary ergo-

dic sources) Given integer D > 1, let {Xn}∞

n=1 be a stationary ergodic source

with entropy rate (in base D)

HD(X),lim

n→∞

nHD(Xn).

Then the following hold.

•Forward part (achievability): For any 0 < ε < 1, there exists δwith

0< δ < ε and a sequence of D-ary block codes { C∼n= (n, Mn)}∞

n=1 with

lim sup

n→∞

nlogDMn< HD(X) + δ,

and probability of decoding error satisﬁed

Pe(C∼n)< ε

for all suﬃciently large n.

•Strong converse part: For any 0 < ε < 1, any sequence of D-ary block

codes { C∼n= (n, Mn)}∞

n=1 with

lim sup

n→∞

nlogDMn< HD(X)

satisﬁes

Pe>1−ε

for all nsuﬃciently large.

A discrete memoryless (i.i.d.) source is stationary and ergodic (so Theorem

3.5 is clearly a special case of Theorem 3.13). In general, it is hard to check

whether a stationary process is ergodic or not. It is known though that if a

stationary process is a mixture of two or more stationary ergodic processes,

i.e., its n-fold distribution can be written as the mean (with respect to some

distribution) of the n-fold distributions of stationary ergodic processes, then it

is not ergodic.8

8The converse is also true; i.e., if a stationary process cannot be represented as a mixture

of stationary ergodic processes, then it is ergodic.

For example, let Pand Qbe two distributions on a ﬁnite alphabet Xsuch

that the process {Xn}∞

n=1 is i.i.d. with distribution Pand the process {Yn}∞

n=1

is i.i.d. with distribution Q. Flip a biased coin (with Heads probability equal to

θ, 0 < θ < 1) once and let

Zi=(Xiif Heads

Yiif Tails

for i= 1,2,···. Then the resulting process {Zi}∞

i=1 has its n-fold distribution as

a mixture of the n-fold distributions of {Xn}∞

n=1 and {Yn}∞

n=1:

PZn(an) = θPXn(an) + (1 −θ)Pyn(an)

for all an∈ Xn,n= 1,2,···. Then the process {Zi}∞

i=1 is stationary but not

ergodic.

A speciﬁc case for which ergodicity can be easily veriﬁed (other than the

case of i.i.d. sources) is the case of stationary Markov sources. Speciﬁcally, if a

(ﬁnite-alphabet) stationary Markov source is irreducible, then it is ergodic and

hence the Generalized AEP holds for this source. Note that irreducibility can

be veriﬁed in terms of the source’s transition probability matrix.

In more complicated situations such as when the source is non-stationary

(with time-varying statistics) and/or non-ergodic, the source entropy rate H(X)

(if the limit exists; otherwise one can look at the lim inf /lim sup of (1/n)H(Xn))

has no longer an operational meaning as the smallest possible compression rate.

This causes the need to establish new entropy measures which appropriately

characterize the operational limits of an arbitrary stochastic system with mem-

ory. This is achieved in [21] where Han and Verd´u introduce the notions of

inf/sup-entropy rates and illustrate the key role these entropy measures play

in proving a general lossless block source coding theorem. More speciﬁcally,

they demonstrate that for an arbitrary ﬁnite-alphabet source X={Xn=

(X1, X2,...,Xn)}∞

n=1 (not necessarily stationary and ergodic), the expression for

the minimum achievable (block) source coding rate is given by the sup-entropy

rate ¯

H(X), deﬁned by

H(X),inf

β∈ℜ β: lim sup

n→∞

Pr −1

nlog PXn(Xn)> β= 0.

More details will be provided in Part II of the book.

3.2.3 Redundancy for lossless block data compression

Shannon’s block source coding theorem establishes that the smallest data com-

pression rate for achieving arbitrarily small error probability for stationary er-

godic sources is given by the entropy rate. Thus one can deﬁne the source

redundancy as the reduction in coding rate one can achieve via asymptotically

lossless block source coding versus just using uniquely decodable (completely

lossless for any value of the sourceword blocklength n) block source coding. In

light of the fact that the former approach yields a source coding rate equal to the

entropy rate while the later approach provides a rate of log2|X|, we therefore

deﬁne the total block source-coding redundancy ρt(in bits/source symbol) for a

stationary ergodic source {Xn}∞

n=1 as

ρt,log2|X | − H(X).

Hence ρtrepresents the amount of “useless” (or superﬂuous) statistical source

information one can eliminate via binary9block source coding.

If the source is i.i.d. and uniformly distributed, then its entropy rate is equal

to log2|X| and as a result its redundancy is ρt= 0. This means that the source

is incompressible, as expected, since in this case every sourceword xnwill belong

to the δ-typical set Fn(δ) for every n > 0 and δ > 0 (i.e., Fn(δ) = Xn) and

hence there are no superﬂuous sourcewords that can be dispensed of via source

coding. If the source has memory or has a non-uniform marginal distribution,

then its redundancy is strictly positive and can be classiﬁed into two parts:

•Source redundancy due to the non-uniformity of the source marginal dis-

tribution ρd:

ρd,log2|X | − H(X1).

•Source redundancy due to the source memory ρm:

ρm,H(X1)−H(X).

As a result, the source total redundancy ρtcan be decomposed in two parts:

ρt=ρd+ρm.

We can summarize the redundancy of some typical stationary ergodic sources

in the following table.

9Since we are measuring ρtin code bits/source symbol, all logarithms in its expression are

in base 2 and hence this redundancy can be eliminated via asymptotically lossless binary block

codes (one can also change the units to D-ary code symbol/source symbol by using base-D

logarithms for the case of D-ary block codes).

Source ρdρmρt

i.i.d. uniform 0 0 0

i.i.d. non-uniform log2|X | − H(X1) 0 ρd

1st-order symmetric

Markov10 0H(X1)−H(X2|X1)ρm

1st-order non-

symmetric Markov log2|X| − H(X1)H(X1)−H(X2|X1)ρd+ρm

3.3 Variable-length codes for lossless data compression

3.3.1 Non-singular codes and uniquely decodable codes

We next study variable-length (completely) lossless data compression codes.

Deﬁnition 3.14 Consider a discrete source {Xn}∞

n=1 with ﬁnite alphabet X

along with a D-ary code alphabet B={0,1,··· , D −1}, where D > 1 is an

integer. Fix integer n≥1, then a D-ary n-th order variable-length code (VLC)

is a map

f:Xn→ B∗

mapping (ﬁxed-length) sourcewords of length nto D-ary codewords in B∗of

variable lengths, where B∗denotes the set of all ﬁnite-length strings from B(i.e.,

c∈ B∗⇔ ∃ integer l≥1 such that c∈ Bl).

The codebook Cof a VLC is the set of all codewords:

C=f(Xn) = {f(xn)∈ B∗:xn∈ Xn}.

A variable-length lossless data compression code is a code in which the

source symbols can be completely reconstructed without distortion. In order

to achieve this goal, the source symbols have to be encoded unambiguously in

the sense that any two diﬀerent source symbols (with positive probabilities) are

represented by diﬀerent codewords. Codes satisfying this property are called

non-singular codes. In practice however, the encoder often needs to encode a

sequence of source symbols, which results in a concatenated sequence of code-

words. If any concatenation of codewords can also be unambiguously recon-

structed without punctuation, then the code is said to be uniquely decodable. In

10A ﬁrst-order Markov process is symmetric if for any x1and ˆx1,

{a:a=PX2|X1(y|x1) for some y}={a:a=PX2|X1(y|ˆx1) for some y}.

other words, a VLC is uniquely decodable if all ﬁnite sequences of sourcewords

(xn∈ Xn) are mapped onto distinct strings of codewords; i.e., for any mand

m′, (xn

1, xn

2,···, xn

m)6= (yn

1, yn

2,···, yn

m′) implies that

(f(xn

1), f (xn

2),···, f (xn

m)) 6= (f(yn

1), f (yn

2),···, f (yn

m′)).

Note that a non-singular VLC is not necessarily uniquely decodable. For exam-

ple, consider a binary (ﬁrst-order) code for the source with alphabet

X={A, B, C, D, E, F }

given by

code of A= 0,

code of B= 1,

code of C= 00,

code of D= 01,

code of E= 10,

code of F= 11.

The above code is clearly non-singular; it is however not uniquely decodable

because the codeword sequence, 010, can be reconstructed as ABA,DA or AE

(i.e., (f(A), f (B), f(A)) = (f(D), f (A)) = (f(A), f(E)) even if (A, B, A), (D, A)

and (A, E) are all non-equal).

One important objective is to ﬁnd out how “eﬃciently” we can represent

a given discrete source via a uniquely decodable n-th order VLC and provide

a construction technique that (at least asymptotically, as n→ ∞) attains the

optimal “eﬃciency.” In other words, we want to determine what is the smallest

possible average code rate (or equivalently, average codeword length) can an

n-th order uniquely decodable VLC have when (losslessly) representing a given

source, and we want to give an explicit code construction that can attain this

smallest possible rate (at least asymptotically in the sourceword length n).

Deﬁnition 3.15 Let Cbe a D-ary n-th order VLC code

f:Xn→ {0,1,···, D −1}∗

for a discrete source {Xn}∞

n=1 with alphabet Xand distribution PXn(xn), xn∈

Xn. Setting ℓ(cxn) as the length of the codeword cxn=f(xn) associated with

sourceword xn, then the average codeword length for Cis given by

ℓ,X

xn∈Xn

PXn(xn)ℓ(cxn)

and its average code rate (in D-ary code symbols/source symbol) is given by

Rn,ℓ

n=1

xn∈Xn

PXn(xn)ℓ(cxn).

The following theorem provides a strong condition which a uniquely decod-

able code must satisfy.

Theorem 3.16 (Kraft inequality for uniquely decodable codes) Let Cbe

a uniquely decodable D-ary n-th order VLC for a discrete source {Xn}∞

n=1 with

alphabet X. Let the M=|X|ncodewords of Chave lengths ℓ1, ℓ2,...,ℓM,

respectively. Then the following inequality must hold

M−1

m=0

D−ℓm≤1.

Proof: Suppose that we use the codebook Cto encode Nsourcewords (xn

i∈

Xn,i= 1,···, N ) arriving in a sequence; this yields a concatenated codeword

sequence

c1c2c3...cN.

Let the lengths of the codewords be respectively denoted by

ℓ(c1), ℓ(c2),...,ℓ(cN).

Consider X

c1∈C X

c2∈C ···X

cN∈C

D−[ℓ(c1)+ℓ(c2)+···+ℓ(cN)]!.

It is obvious that the above expression is equal to

c∈C

D−ℓ(c)!N

= M−1

m=0

D−ℓm!N

(Note that |C| =M.) On the other hand, all the code sequences with length

i=ℓ(c1) + ℓ(c2) + ···+ℓ(cN)

contribute equally to the sum of the identity, which is D−i. Let Aidenote the

number of N-codeword sequences that have length i. Then the above identity

can be re-written as M

m=1

D−ℓm!N

i=1

AiD−i,

where

L,max

c∈C ℓ(c).

Since Cis by assumption a uniquely decodable code, the codeword sequence

must be unambiguously decodable. Observe that a code sequence with length i

has at most Diunambiguous combinations. Therefore, Ai≤Di, and

m=1

D−ℓm!N

i=1

AiD−i≤

i=1

DiD−i=LN,

which implies that

m=1

D−ℓm≤(LN)1/N .

The proof is completed by noting that the above inequality holds for every N,

and the upper bound (LN )1/N goes to 1 as Ngoes to inﬁnity. 2

The Kraft inequality is a very useful tool, especially for showing that the

fundamental lower bound of the average rate of uniquely decodable VLCs for

discrete memoryless sources is given by the source entropy.

Theorem 3.17 The average rate of every uniquely decodable D-ary n-th order

VLC for a discrete memoryless source {Xn}∞

n=1 is lower-bounded by the source

entropy HD(X) (measured in D-ary code symbols/source symbol).

Proof: Consider a uniquely decodable D-ary n-th order VLC code for the source

{Xn}∞

n=1

f:Xn→ {0,1,···, D −1}∗

and let ℓ(cxn) denote the length of the codeword cxn=f(xn) for sourceword xn.

Hence,

Rn−HD(X) = 1

xn∈Xn

PXn(xn)ℓ(cxn)−1

nHD(Xn)

n"X

xn∈Xn

PXn(xn)ℓ(cxn)−X

xn∈Xn

(−PXn(xn) logDPXn(xn))#

xn∈Xn

PXn(xn) logD

PXn(xn)

D−ℓ(cxn)

≥1

n"X

xn∈Xn

PXn(xn)#logDPxn∈XnPXn(xn)

Pxn∈XnD−ℓ(cxn)

(log-sum inequality)

=−1

nlog "X

xn∈Xn

D−ℓ(cxn)#

≥0

where the last inequality follows from the Kraft inequality for uniquely decodable

codes and the fact that the logarithm is a strictly increasing function. 2

From the above theorem, we know that the average code rate is no smaller

than the source entropy. Indeed a lossless data compression code, whose average

code rate achieves entropy, should be optimal (since if its average code rate is

below entropy, the Kraft inequality is violated and the code is no longer uniquely

decodable). We summarize

1. Uniquely decodability ⇒the Kraft inequality holds.

2. Uniquely decodability ⇒average code rate of VLCs for memoryless sources

is lower bounded by the source entropy.

Exercise 3.18

1. Find a non-singular and also non-uniquely decodable code that violates the

Kraft inequality. (Hint: The answer is already provided in this subsection.)

2. Find a non-singular and also non-uniquely decodable code that beats the

entropy lower bound.

Preﬁx

codes

Uniquely

decodable codes

Non-singular

codes

Figure 3.4: Classiﬁcation of variable-length codes.

3.3.2 Preﬁx or instantaneous codes

Apreﬁx code is a VLC which is self-punctuated in the sense that there is no

need to append extra symbols for diﬀerentiating adjacent codewords. A more

precise deﬁnition follows:

Deﬁnition 3.19 (Preﬁx code) A VLC is called a preﬁx code or an instanta-

neous code if no codeword is a preﬁx of any other codeword.

A preﬁx code is also named an instantaneous code because the codeword se-

quence can be decoded instantaneously (it is immediately recognizable) without

the reference to future codewords in the same sequence. Note that a uniquely

decodable code is not necessarily preﬁx-free and may not be decoded instanta-

neously. The relationship between diﬀerent codes encountered thus far is de-

picted in Figure 3.4.

AD-ary preﬁx code can be represented graphically as an initial segment of

aD-ary tree. An example of a tree representation for a binary (D= 2) preﬁx

code is shown in Figure 3.5.

Theorem 3.20 (Kraft inequality for preﬁx codes) There exists a D-ary

nth-order preﬁx code for a discrete source {Xn}∞

n=1 with alphabet Xiﬀ the

codewords of length ℓm,m= 1,...,M, satisfy the Kraft inequality, where M=

|X|n.

Proof: Without loss of generality, we provide the proof for the case of D= 2

(binary codes).

1. [The forward part]Preﬁx codes satisfy the Kraft inequality.





(0)

(1)

(11)

110

(111) 1110

1111

Figure 3.5: Tree structure of a binary preﬁx code. The codewords are

those residing on the leaves, which in this case are 00, 01, 10, 110, 1110

and 1111.

The codewords of a preﬁx code can always be put on a tree. Pick up a length

ℓmax ,max

1≤m≤Mℓm.

A tree has originally 2ℓmax nodes on level ℓmax. Each codeword of length ℓm

obstructs 2ℓmax−ℓmnodes on level ℓmax. In other words, when any node is chosen

as a codeword, all its children will be excluded from being codewords (as for a

preﬁx code, no codeword can be a preﬁx of any other code). There are exactly

2ℓmax−ℓmexcluded nodes on level ℓmax of the tree. We therefore say that each

codeword of length ℓmobstructs 2ℓmax−ℓmnodes on level ℓmax. Note that no two

codewords obstruct the same nodes on level ℓmax. Hence the number of totally

obstructed codewords on level ℓmax should be less than 2ℓmax , i.e.,

m=1

2ℓmax−ℓm≤2ℓmax ,

which immediately implies the Kraft inequality:

m=1

2−ℓm≤1.

(This part can also be proven by stating the fact that a preﬁx code is a uniquely

decodable code. The objective of adding this proof is to illustrate the character-

istics of a tree-like preﬁx code.)

2. [The converse part]Kraft inequality implies the existence of a preﬁx code.

Suppose that ℓ1, ℓ2,...,ℓMsatisfy the Kraft inequality. We will show that

there exists a binary tree with Mselected nodes where the ith node resides on

level ℓi.

Let nibe the number of nodes (among the Mnodes) residing on level i

(namely, niis the number of codewords with length ior ni=|{m:ℓm=i}|),

and let

ℓmax ,max

1≤m≤Mℓm.

Then from the Kraft inequality, we have

n12−1+n22−2+···+nℓmax 2−ℓmax ≤1.

The above inequality can be re-written in a form that is more suitable for this

proof as:

n12−1≤1

n12−1+n22−2≤1

···

n12−1+n22−2+···+nℓmax 2−ℓmax ≤1.

Hence,

n1≤2

n2≤22−n121

···

nℓmax ≤2ℓmax −n12ℓmax−1− · ·· − nℓmax−121,

which can be interpreted in terms of a tree model as: the 1st inequality says

that the number of codewords of length 1 is less than the available number of

nodes on the 1st level, which is 2. The 2nd inequality says that the number of

codewords of length 2 is less than the total number of nodes on the 2nd level,

which is 22, minus the number of nodes obstructed by the 1st level nodes already

occupied by codewords. The succeeding inequalities demonstrate the availability

of a suﬃcient number of nodes at each level after the nodes blocked by shorter

length codewords have been removed. Because this is true at every codeword

length up to the maximum codeword length, the assertion of the theorem is

proved. 2

Theorems 3.16 and 3.20 unveil the following relation between a variable-

length uniquely decodable code and a preﬁx code.

Corollary 3.21 A uniquely decodable D-ary n-th order code can always be

replaced by a D-ary n-th order preﬁx code with the same average codeword

length (and hence the same average code rate).

The following theorem interprets the relationship between the average code

rate of a preﬁx code and the source entropy.

Theorem 3.22 Consider a discrete memoryless source {Xn}∞

n=1.

1. For any D-ary n-th order preﬁx code for the source, the average code rate

is no less than the source entropy HD(X).

2. There must exist a D-ary n-th order preﬁx code for the source whose

average code rate is no greater than HD(X) + 1

n, namely,

Rn,1

xn∈Xn

PXn(xn)ℓ(cxn)≤HD(X) + 1

n,(3.3.1)

where cxnis the codeword for sourceword xn, and ℓ(cxn) is the length of

codeword cxn.

Proof: A preﬁx code is uniquely decodable, and hence it directly follows from

Theorem 3.17 that its average code rate is no less than the source entropy.

To prove the second part, we can design a preﬁx code satisfying both (3.3.1)

and the Kraft inequality, which immediately implies the existence of the desired

code by Theorem 3.20. Choose the codeword length for sourceword xnas

ℓ(cxn) = ⌊−logDPXn(xn)⌋+ 1.(3.3.2)

Then

D−ℓ(cxn)≤PXn(xn).

Summing both sides over all source symbols, we obtain

xn∈Xn

D−ℓ(cxn)≤1,

which is exactly the Kraft inequality. On the other hand, (3.3.2) implies

ℓ(cxn)≤ −logDPXn(xn) + 1,

which in turn implies

xn∈Xn

PXn(xn)ℓ(cxn)≤X

xn∈Xn−PXn(xn) logDPXn(xn)+X

xn∈Xn

PXn(xn)

=HD(Xn) + 1 = nHD(X) + 1,

where the last equality holds since the source is memoryless. 2

We note that n-th order preﬁx codes (which encode sourcewords of length

n) for memoryless sources can yield an average code rate arbitrarily close to

the source entropy when allowing nto grow without bound. For example, a

memoryless source with alphabet

{A, B, C }

and probability distribution

PX(A) = 0.8, PX(B) = PX(C) = 0.1

has entropy being equal to

−0.8·log20.8−0.1·log20.1−0.1·log20.1 = 0.92 bits.

One of the best binary ﬁrst-order or single-letter encoding (with n= 1) preﬁx

codes for this source is given by c(A) = 0, c(B) = 10 and c(C) = 11, where c(·)

is the encoding function. Then the resultant average code rate for this code is

0.8×1 + 0.2×2 = 1.2 bits ≥0.92 bits.

Now if we consider a second-order (with n= 2) preﬁx code by encoding two

consecutive source symbols at a time, the new source alphabet becomes

{AA, AB, AC, BA, BB, B C, CA, C B, CC },

and the resultant probability distribution is calculated by

(∀x1, x2∈ {A, B, C })PX2(x1, x2) = PX(x1)PX(x2)

as the source is memoryless. Then one of the best binary preﬁx codes for the

source is given by

c(AA) = 0

c(AB) = 100

c(AC) = 101

c(BA) = 110

c(BB) = 111100

c(BC) = 111101

c(CA) = 1110

c(CB) = 111110

c(CC) = 111111.

The average code rate of this code now becomes

0.64(1 ×1) + 0.08(3 ×3 + 4 ×1) + 0.01(6 ×4)

2= 0.96 bits,

which is closer to the source entropy of 0.92 bits. As nincreases, the average

code rate will be brought closer to the source entropy.

From Theorems 3.17 and 3.22, we obtain the lossless variable-length source

coding theorem for discrete memoryless sources.

Theorem 3.23 (Lossless variable-length source coding theorem) Fix in-

teger D > 1 and consider a discrete memoryless source {Xn}∞

n=1 with distribution

PXand entropy HD(X) (measured in D-ary units). Then the following hold.

•Forward part (achievability): For any ε > 0, there exists a D-ary n-th

order preﬁx (hence uniquely decodable) code

f:Xn→ {0,1,···, D −1}∗

for the source with an average code rate Rnsatisfying

Rn≤HD(X) + ε

for nsuﬃciently large.

•Converse part: Every uniquely decodable code

f:Xn→ {0,1,···, D −1}∗

for the source has an average code rate Rn≥HD(X).

Thus, for a discrete memoryless source, its entropy HD(X) (measured in D-

ary units) represents the smallest variable-length lossless compression rate for n

suﬃciently large.

Proof: The forward part follows directly from Theorem 3.22 by choosing

nlarge enough such that 1/n < ε, and the converse part is already given by

Theorem 3.17. 2

Observation 3.24 Theorem 3.23 actually also holds for the class of stationary

sources by replacing the source entropy HD(X) with the source entropy rate

HD(X),lim

n→∞

nHD(Xn),

measured in D-ary units. The proof is very similar to the proofs of Theorems 3.17

and 3.22 with slight modiﬁcations (such as using the fact that 1

nHD(Xn) is non-

increasing with nfor stationary sources).

3.3.3 Examples of binary preﬁx codes

A) Huﬀman codes: optimal variable-length codes

Given a discrete source with alphabet X, we next construct an optimal binary

ﬁrst-order (single-letter) uniquely decodable variable-length code

f:X → {0,1}∗,

where optimality is in the sense that the code’s average codeword length (or

equivalently, its average code rate) is minimized over the class of all binary

uniquely decodable codes for the source. Note that ﬁnding optimal n-th oder

codes with n > 1 follows directly by considering Xnas a new source with ex-

panded alphabet (i.e., by mapping nsource symbols at a time).

By Corollary 3.21, we remark that in our search for optimal uniquely de-

codable codes, we can restrict our attention to the (smaller) class of optimal

preﬁx codes. We thus proceed by observing the following necessary conditions

of optimality for binary preﬁx codes.

Lemma 3.25 Let Cbe an optimal binary preﬁx code with codeword lengths

ℓi,i= 1,···, M , for a source with alphabet X={a1,...,aM}and symbol

probabilities p1,...,pM. We assume, without loss of generality, that

p1≤p2≤p3≤ ··· ≤ pM,

and that any group of source symbols with identical probability is listed in order

of increasing codeword length (i.e., if pi=pi+1 =···=pi+s, then ℓi≤ℓi+1 ≤

··· ≤ ℓi+s). Then the following properties hold.

1. Higher probability source symbols have shorter codewords: pi> pjimplies

ℓi≤ℓj, for i, j = 1,···, M .

2. The two least probable source symbols have codewords of equal length:

ℓM−1=ℓM.

3. Among the codewords of length ℓM, two of the codewords are identical

except in the last digit.

Proof:

1) If pi> pjand ℓi> ℓj, then it is possible to construct a better code C′by

interchanging (“swapping”) codewords jand kof C, since

ℓ(C′)−ℓ(C) = piℓj+pjℓi−(piℓi+pjℓj)

= (pi−pj)(ℓj−ℓi)

<0.

Hence code C′is better than code C, contradicting the fact that Cis optimal.

2) We ﬁrst know that ℓM−1≤ℓM, since:

•If pM−1> pM, then ℓM−1≤ℓMby result 1) above.

•If pM−1=pM, then ℓM−1≤ℓMby our assumption about the ordering

of codewords for source symbols with identical probability.

Now, if ℓM−1< ℓM, we may delete the last digit of codeword M, and the

deletion cannot result in another codeword since Cis a preﬁx code. Thus

the deletion forms a new preﬁx code with a better average codeword length

than C, contradicting the fact that Cis optimal. Hence, we must have that

ℓM−1=ℓM.

3) Among the codewords of length ℓM, if no two codewords agree in all digits

except the last, then we may delete the last digit in all such codewords to

obtain a better codeword. 2

The above observation suggests that if we can construct an optimal code for

the entire source except for its two least likely symbols, then we can construct

an optimal overall code. Indeed, the following lemma due to Huﬀman follows

from Lemma 3.25.

Lemma 3.26 (Huﬀman) Consider a source with alphabet X={a1,...,aM}

and symbol probabilities p1,...,pMsuch that p1≥p2≥ ··· ≥ pM. Consider the

reduced source alphabet Yobtained from Xby combining the two least likely

source symbols aM−1and aMinto an equivalent symbol aM−1,M with probability

pM−1+pM. Suppose that C′, given by f′:Y → {0,1}∗, is an optimal code for

the reduced source Y. We now construct a code C,f:X → {0,1}∗, for the

original source Xas follows:

•The codewords for symbols a1, a2,···, aM−2are exactly the same as the

corresponding codewords in C′:

f(a1) = f′(a1), f (a2) = f′(a2),···, f (aM−2) = f′(aM−2).

•The codewords associated with symbols aM−1and aMare formed by ap-

pending a “0” and a “1”, respectively, to the codeword f′(aM−1,M) associ-

ated with the letter aM−1,M in C′:

f(aM−1) = [f′(aM−1,M )0] and f(aK) = [f′(aM−1,M )1].

Then code Cis optimal for the original source X.

0.05

0.1

0.25

(1111)

(1110)

(110)

(10)

(01)

(00)

0.15

0.1

0.25

111

110

0.25

0.5

0.25

0.5

01.0

Figure 3.6: Example of the Huﬀman encoding.

Hence the problem of ﬁnding the optimal code for a source of alphabet size

Mis reduced to the problem of ﬁnding an optimal code for the reduced source

of alphabet size M−1. In turn we can reduce the problem to that of size M−2

and so on. Indeed the above lemma yields a recursive algorithm for constructing

optimal binary preﬁx codes.

Huﬀman encoding algorithm: Repeatedly apply the above lemma until one is left

with a reduced source with two symbols. An optimal binary preﬁx code for this

source consists of the codewords 0 and 1. Then proceed backwards, constructing

(as outlined in the above lemma) optimal codes for each reduced source until

one arrives at the original source.

Example 3.27 Consider a source with alphabet {1,2,3,4,5,6}with symbol

probabilities 0.25,0.25,0.25,0.1,0.1 and 0.05, respectively. By following the

Huﬀman encoding procedure as shown in Figure 3.6, we obtain the Huﬀman

code as

00,01,10,110,1110,1111.

Observation 3.28

•Huﬀman codes are not unique for a given source distribution; e.g., by

inverting all the code bits of a Huﬀman code, one gets another Huﬀman

code, or by resolving ties in diﬀerent ways in the Huﬀman algorithm, one

also obtains diﬀerent Huﬀman codes (but all of these codes have the same

minimal Rn).

•One can obtain optimal codes that are not Huﬀman codes; e.g., by inter-

changing two codewords of the same length of a Huﬀman code, one can get

another non-Huﬀman (but optimal) code. Furthermore, one can construct

an optimal suﬃx code (i.e., a code in which no codeword can be a suﬃx

of another codeword) from a Huﬀman code (which is a preﬁx code) by

reversing the Huﬀman codewords.

•Binary Huﬀman codes always satisfy the Kraft inequality with equality

(their code tree is “saturated”); e.g., see [13, p. 72].

•Any n-th order binary Huﬀman code f:Xn→ {0,1}∗for a stationary

source {Xn}∞

n=1 with ﬁnite alphabet Xsatisﬁes:

H(X)≤1

nH(Xn)≤Rn<1

nH(Xn) + 1

Thus, as nincreases to inﬁnity, Rn→H(X) but the complexity as well as

encoding-decoding delay grows exponentially with n.

•Finally, note that non-binary (i.e., for D > 2) Huﬀman codes can also be

constructed in a mostly similar way as for the case of binary Huﬀman codes

by designing a D-ary tree and iteratively applying Lemma 3.26, where now

the Dleast likely source symbols are combined at each stage. The only

diﬀerence from the case of binary Huﬀman codes is that we have to ensure

that we are ultimately left with Dsymbols at the last stage of the algorithm

to guarantee the code’s optimality. This is remedied by expanding the

original source alphabet Xby adding “dummy” symbols (each with zero

probability) so that the alphabet size of the expanded source |X′|is the

smallest positive integer greater than or equal to |X| with

|X′|= 1 (modulo D −1).

For example, if |X| = 6 and D= 3 (ternary codes), we obtain that |X′|= 7,

meaning that we need to enlarge the original source Xby adding one

dummy (zero-probability) source symbol.

We thus obtain that the necessary conditions for optimality of Lemma 3.25

also hold for D-ary preﬁx codes when replacing Xwith the expanded

source X′and replacing “two” with “D” in the statement of the lemma.

The resulting D-ary Huﬀman code will be an optimal code for the original

source X(e.g., see [18, Chap. 3] and [33, Chap. 11]).

B) Shannon-Fano-Elias code

Assume X={1,...,M}and PX(x)>0 for all x∈ X. Deﬁne

F(x),X

a≤x

PX(a),

and

F(x),X

a<x

PX(a) + 1

2PX(x).

Encoder: For any x∈ X, express ¯

F(x) in decimal binary form, say

F(x) = .c1c2. . . ck. . . ,

and take the ﬁrst k(fractional) bits as the codeword of source symbol x, i.e.,

(c1, c2,...,ck),

where k,⌈log2(1/PX(x))⌉+ 1.

Decoder: Given codeword (c1,...,ck), compute the cumulative sum of F(·) start-

ing from the smallest element in {1,2,...,M}until the ﬁrst xsatisfying

F(x)≥.c1. . . ck.

Then xshould be the original source symbol.

Proof of decodability: For any number a∈[0,1], let [a]kdenote the operation

that chops the binary representation of aafter kbits (i.e., removing the (k+1)th

bit, the (k+ 2)th bit, etc). Then

F(x)−¯

F(x)k<1

2k.

Since k=⌈log2(1/PX(x))⌉+ 1,

2k≤1

2PX(x)

="X

a<x

PX(a) + PX(x)

2#−X

a≤x−1

PX(a)

=¯

F(x)−F(x−1).

Hence,

F(x−1) = F(x−1) + 1

2k−1

2k≤¯

F(x)−1

2k<¯

F(x)k.

In addition,

F(x)>¯

F(x)≥¯

F(x)k.

Consequently, xis the ﬁrst element satisfying

F(x)≥.c1c2. . . ck.

Average codeword length:

ℓ=X

x∈X

PX(x)log2

PX(x)+ 1

x∈X

PX(x) log2

PX(x)+ 2

= (H(X) + 2) bits.

Observation 3.29 The Shannon-Fano-Elias code is a preﬁx code.

3.3.4 Examples of universal lossless variable-length codes

In Section 3.3.3, we assume that the source distribution is known. Thus we can

use either Huﬀman codes or Shannon-Fano-Elias codes to compress the source.

What if the source distribution is not a known priori? Is it still possible to

establish a completely lossless data compression code which is universally good

(or asymptotically optimal) for all interested sources? The answer is aﬃrma-

tive. Two of the examples are the adaptive Huﬀman codes and the Lempel-Ziv

codes (which unlike Huﬀman and Shannon-Fano-Elias codes map variable-length

sourcewords onto codewords).

A) Adaptive Huﬀman code

A straightforward universal coding scheme is to use the empirical distribution

(or relative frequencies) as the true distribution, and then apply the optimal

Huﬀman code according to the empirical distribution. If the source is i.i.d.,

the relative frequencies will converge to its true marginal probability. Therefore,

such universal codes should be good for all i.i.d. sources. However, in order to get

an accurate estimation of the true distribution, one must observe a suﬃciently

long sourceword sequence under which the coder will suﬀer a long delay. This

can be improved by using the adaptive universal Huﬀman code [19].

The working procedure of the adaptive Huﬀman code is as follows. Start

with an initial guess of the source distribution (based on the assumption that

the source is DMS). As a new source symbol arrives, encode the data in terms of

the Huﬀman coding scheme according to the current estimated distribution, and

then update the estimated distribution and the Huﬀman codebook according to

the newly arrived source symbol.

To be speciﬁc, let the source alphabet be X,{a1,...,aM}. Deﬁne

N(ai|xn),number of aioccurrence in x1, x2,...,xn.

Then the (current) relative frequency of aiis N(ai|xn)/n. Let cn(ai) denote the

Huﬀman codeword of source symbol aiwith respect to the distribution

N(a1|xn)

n,N(a2|xn)

n,···,N(aM|xn)

n.

Now suppose that xn+1 =aj. The codeword cn(aj) is set as output, and the

relative frequency for each source outcome becomes:

N(aj|xn+1)

n+ 1 =n×(N(aj|xn)/n) + 1

n+ 1

and N(ai|xn+1)

n+ 1 =n×(N(ai|xn)/n)

n+ 1 for i6=j.

This observation results in the following distribution updated policy.

P(n+1)

X(aj) = nP (n)

X(aj) + 1

n+ 1

and

P(n+1)

X(ai) = n

n+ 1P(n)

X(ai) for i6=j,

where P(n+1)

Xrepresents the estimate of the true distribution PXat time (n+ 1).

Note that in the Adaptive Huﬀman coding scheme, the encoder and decoder

need not be re-designed at every time, but only when a suﬃcient change in the

estimated distribution occurs such that the so-called sibling property is violated.

Deﬁnition 3.30 (Sibling property) A preﬁx code is said to have the sibling

property if its codetree satisﬁes:

1. every node in the code-tree (except for the root node) has a sibling (i.e.,

the code-tree is saturated), and

2. the node can be listed in non-decreasing order of probabilities with each

node being adjacent to its sibling.

a1(00,3/8)

a2(01,1/4)

a3(100,1/8)

a4(101,1/8)

a5(110,1/16)

a6(111,1/16)

b11(1/8)

b10(1/4)

b0(5/8)

b1(3/8)

8/8

b05

8≥b13

8

|{z }

sibling pair

≥a13

8≥a21

4

|{z }

sibling pair

≥b10 1

4≥b11 1

8

|{z }

sibling pair

≥a31

8≥a41

8

|{z }

sibling pair

≥a51

16≥a61

16

|{z }

sibling pair

Figure 3.7: Example of the sibling property based on the code tree from

P(16)

X. The arguments inside the parenthesis following ajrespectively

indicate the codeword and the probability associated with aj. “b” is

used to denote the internal nodes of the tree with the assigned (partial)

code as its subscript. The number in the parenthesis following bis the

probability sum of all its children.

The next observation indicates the fact that the Huﬀman code is the only

preﬁx code satisfying the sibling property.

Observation 3.31 A preﬁx code is a Huﬀman code iﬀ it satisﬁes the sibling

property.

An example for a code tree satisfying the sibling property is shown in Fig-

ure 3.7. The ﬁrst requirement is satisﬁed since the tree is saturated. The second

requirement can be checked by the node list in Figure 3.7.

If the next observation (say at time n= 17) is a3, then its codeword 100 is

set as output (using the Huﬀman code corresponding to P(16)

X). The estimated

a1(00,6/17)

a2(01,4/17)

a3(100,3/17)

a4(101,2/17)

a5(110,1/17)

a6(111,1/17)

b11(2/17)

b10(5/17)

b0(10/17)

b1(7/17)

17/17

b010

17≥b17

17

|{z }

sibling pair

≥a16

17≥b10 5

17

≥a24

17≥a33

17≥a42

17

|{z }

sibling pair

≥b11 2

17≥a51

17≥a61

17

|{z }

sibling pair

Figure 3.8: (Continuation of Figure 3.7) Example of violation of the

sibling property after observing a new symbol a3at n= 17. Note that

node a1is not adjacent to its sibling a2.

distribution is updated as

P(17)

X(a1) = 16 ×(3/8)

17 =6

17, P (17)

X(a2) = 16 ×(1/4)

17 =4

P(17)

X(a3) = 16 ×(1/8) + 1

17 =3

17, P (17)

X(a4) = 16 ×(1/8)

17 =2

P(17)

X(a5) = 16 ×[1/(16)]

17 =1

17, P (17)

X(a6) = 16 ×[1/(16)]

17 =1

17.

The sibling property is then violated (cf. Figure 3.8). Hence, codebook needs to

be updated according to the new estimated distribution, and the observation at

n= 18 shall be encoded using the new codebook in Figure 3.9. Details about

Adaptive Huﬀman codes can be found in [19].

a1(10,6/17)

a2(00,4/17)

a3(01,3/17)

a4(110,2/17)

a5(1110,1/17)

a6(1111,1/17)

b111(2/17)

b11(4/17)

b0(7/17)

b1(10/17) 17/17

b110

17≥b07

17

|{z }

sibling pair

≥a16

17≥b11 4

17

|{z }

sibling pair

≥a24

17≥a33

17

|{z }

sibling pair

≥a42

17≥b111 2

17

|{z }

sibling pair

≥a51

17≥a61

17

|{z }

sibling pair

Figure 3.9: (Continuation of Figure 3.8) Updated Huﬀman code. The

sibling property holds now for the new code.

B) Lempel-Ziv codes

We now introduce a well-known and feasible universal coding scheme, which is

named after its inventors, Lempel and Ziv (e.g., cf. [12]). These codes, unlike

Huﬀman and Shannon-Fano-Elias codes, map variable-length sourcewords (as

opposed to ﬁxed-length codewords) onto codewords.

Suppose the source alphabet is binary. Then the Lempel-Ziv encoder can be

described as follows.

Encoder:

1. Parse the input sequence into strings that have never appeared before. For

example, if the input sequence is 1011010100010 ..., the algorithm ﬁrst

grabs the ﬁrst letter 1 and ﬁnds that it has never appeared before. So 1

is the ﬁrst string. Then the algorithm scoops the second letter 0 and also

determines that it has not appeared before, and hence, put it to be the

next string. The algorithm moves on to the next letter 1, and ﬁnds that

this string has appeared. Hence, it hits another letter 1 and yields a new

string 11, and so on. Under this procedure, the source sequence is parsed

into the strings

1,0,11,01,010,00,10.

2. Let Lbe the number of distinct strings of the parsed source. Then we

need log2Lbits to index these strings (starting from one). In the above

example, the indices are:

parsed source : 1 0 11 01 010 00 10

index : 001 010 011 100 101 110 111 .

The codeword of each string is then the index of its preﬁx concatenated

with the last bit in its source string. For example, the codeword of source

string 010 will be the index of 01, i.e., 100, concatenated with the last bit

of the source string, i.e., 0. Through this procedure, encoding the above

parsed strings with L= 3 yields the codeword sequence

(000,1)(000,0)(001,1)(010,1)(100,0)(010,0)(001,0)

or equivalently,

0001000000110101100001000010.

Note that the conventional Lempel-Ziv encoder requires two passes: the ﬁrst

pass to decide L, and the second pass to generate the codewords. The algorithm,

however, can be modiﬁed so that it requires only one pass over the entire source

string. Also note that the above algorithm uses an equal number of bits—log2L—

to all the location index, which can also be relaxed by proper modiﬁcation.

Decoder: The decoding is straightforward from the encoding procedure.

Theorem 3.32 The above algorithm asymptotically achieves the entropy rate

of any (unknown statistics) stationary ergodic source.

Proof: Please refer to [12, Sec. 13.5]. 2

Chapter 4

Data Transmission and Channel

Capacity

4.1 Principles of data transmission

A noisy communication channel is an input-output medium in which the output

is not completely or deterministically speciﬁed by the input. The channel is

indeed stochastically modeled, where given channel input x, the channel output

yis governed by a transition (conditional) probability distribution denoted by

PY|X(y|x). Since two diﬀerent inputs may give rise to the same output, the

receiver, upon receipt of an output, needs to guess the most probable sent in-

put. In general, words of length nare sent and received over the channel; in

this case, the channel is characterized by a sequence of n-dimensional transition

distributions PYn|Xn(yn|xn), for n= 1,2,···. A block diagram depicting a data

transmission or channel coding system (with no feedback1) is given in Figure 4.1.

W-Channel

Encoder -

XnChannel

PYn|Xn(·|·)-

YnChannel

Decoder -

Figure 4.1: A data transmission system, where Wrepresents the mes-

sage for transmission, Xndenotes the codeword corresponding to mes-

sage W,Ynrepresents the received word due to channel input Xn, and

Wdenotes the reconstructed message from Yn.

1The capacity of channels with (output) feedback will be studied in Part II of the book.

The designer of a data transmission (or channel) code needs to carefully

select codewords from the set of channel input words (of a given length) so

that a minimal ambiguity is obtained at the channel receiver. For example,

suppose that a channel has binary input and output alphabets and that its

transition probability distribution induces the following conditional probability

on its output symbols given that input words of length 2 are sent:

PY|X2(y= 0|x2= 00) = 1

PY|X2(y= 0|x2= 01) = 1

PY|X2(y= 1|x2= 10) = 1

PY|X2(y= 1|x2= 11) = 1,

which can be graphically depicted as

and a binary message (either event Aor event B) is required to be transmitted

from the sender to the receiver. Then the data transmission code with (codeword

00 for event A, codeword 10 for event B) obviously induces less ambiguity at

the receiver than the code with (codeword 00 for event A, codeword 01 for event

B).

In short, the objective in designing a data transmission (or channel) code

is to transform a noisy channel into a reliable medium for sending messages

and recovering them at the receiver with minimal loss. To achieve this goal, the

designer of a data transmission code needs to take advantage of the common parts

between the sender and the receiver sites that are least aﬀected by the channel

noise. We will see that these common parts are probabilistically captured by the

mutual information between the channel input and the channel output.

As illustrated in the previous example, if a “least-noise-aﬀected” subset of

the channel input words is appropriately selected as the set of codewords, the

messages intended to be transmitted can be reliably sent to the receiver with

arbitrarily small error. One then raises the question:

What is the maximum amount of information (per channel use) that

can be reliably transmitted over a given noisy channel ?

In the above example, we can transmit a binary message error-free, and hence

the amount of information that can be reliability transmitted is at least 1 bit

per channel use (or channel symbol). It can be expected that the amount of

information that can be reliably transmitted for a highly noisy channel should

be less than that for a less noisy channel. But such a comparison requires a good

measure of the “noisiness” of channels.

From an information theoretic viewpoint, “channel capacity” provides a good

measure of the noisiness of a channel; it is deﬁned as the maximal amount of

informational messages (per channel use) that can be transmitted via a data

transmission code over the channel and recovered with arbitrarily small proba-

bility of error at the receiver. In addition to its dependence on the channel

transition distribution, channel capacity also depends on the coding constraint

imposed on the channel input, such as “only block (ﬁxed-length) codes are al-

lowed.” When no coding constraints are applied on the channel input (so that

variable-length codes can be employed), the derivation of the channel capacity

is usually viewed as a hard problem, and is only partially solved so far. In

this chapter, we will introduce the channel capacity for block codes (namely,

only block transmission code can be used). Throughout the chapter, the noisy

channel is assumed to be memoryless (as deﬁned in the next section).

4.2 Discrete memoryless channels

Deﬁnition 4.1 (Discrete channel) A discrete communication channel is char-

acterized by

•A ﬁnite input alphabet X.

•A ﬁnite output alphabet Y.

•A sequence of n-dimensional transition distributions

{PYn|Xn(yn|xn)}∞

n=1

such that Pyn∈YnPYn|Xn(yn|xn) = 1 for every xn∈ Xn, where xn=

(x1,···, xn)∈ Xnand yn= (y1,···, yn)∈ Yn. We assume that the above

sequence of n-dimensional distribution is consistent, i.e.,

PYi|Xi(yi|xi) = Pxi+1 ∈X Pyi+1∈Y PXi+1 (xi+1)PYi+1|Xi+1 (yi+1|xi+1)

Pxi+1∈X PXi+1 (xi+1)

xi+1∈X X

yi+1∈Y

PXi+1|Xi(xi+1|xi)PYi+1|Xi+1 (yi+1 |xi+1)

for every xi,yi,PXi+1|Xiand i= 1,2,···.

In general, real-world communications channels exhibit statistical memory

in the sense that current channel outputs statistically depend on past outputs

as well as past, current and (possibly) future inputs. However, for the sake of

simplicity, we restrict our attention in this chapter to the class of memoryless

channels (channels with memory will later be treated in Volume II).

Deﬁnition 4.2 (Discrete memoryless channel) A discrete memoryless chan-

nel (DMC) is a channel whose sequence of transition distributions PYn|Xnsatisﬁes

PYn|Xn(yn|xn) =

i=1

PY|X(yi|xi) (4.2.1)

for every n= 1,2,···, xn∈ Xnand yn∈ Yn. In other words, a DMC is

fully described by the channel’s transition distribution matrix Q,[px,y] of size

|X| × |Y|, where

px,y ,PY|X(y|x)

for x∈ X,y∈ Y. Furthermore, the matrix Qis stochastic; i.e., the sum of the

entries in each of its rows is equal to 1 since Py∈Y px,y = 1 for all x∈ X.

Observation 4.3 We note that the DMC’s condition (4.2.1) is actually equiv-

alent to the following two sets of conditions:











PYn|Xn,Y n−1(yn|xn, yn−1) = PY|X(yn|xn)∀n= 1,2,···,xn,yn;

(4.2.2a)

PYn−1|Xn(yn−1|xn) = PYn−1|Xn−1(yn−1|xn−1)∀n= 2,3,···,xn,yn−1.

(4.2.2b)











PYn|Xn,Y n−1(yn|xn, yn−1) = PY|X(yn|xn)∀n= 1,2,···,xn,yn;

(4.2.3a)

PXn|Xn−1,Y n−1(xn|xn−1, yn−1) = PXn|Xn−1(xn|xn−1)∀n= 1,2,···,xn,yn−1.

(4.2.3b)

Condition (4.2.2a) (also, (4.2.3a)) implies that the current output Ynonly de-

pend on the current input Xnbut not on past inputs Xn−1and outputs Yn−1.

Condition (4.2.2b) indicates that the past outputs Yn−1do not depend on the

current input Xn. These two conditions together give

=PYn−1|Xn−1(yn−1|xn−1)PY|X(yn|xn);

hence, (4.2.1) holds recursively on n= 1,2,···. The converse (i.e., (4.2.1) implies

both (4.2.2a) and (4.2.2b)) is a direct consequence of

PYn|Xn,Y n−1(yn|xn, yn−1) = PYn|Xn(yn|xn)

Pyn∈Y PYn|Xn(yn|xn)

and

PYn−1|Xn(yn−1|xn) = X

yn∈Y

PYn|Xn(yn|xn).

Similarly, (4.2.3b) states that the current input Xnis independent of past outputs

Yn−1, which together with (4.2.3a) implies again

PYn|Xn(yn|xn)

=PXn,Y n(xn, yn)

PXn(xn)

=PXn−1,Y n−1(xn−1, yn−1)PXn|Xn−1,Y n−1(xn|xn−1, yn−1)PYn|Xn,Y n−1(yn|xn, yn−1)

PXn−1(xn−1)PXn|Xn−1(xn|xn−1)

=PYn−1|Xn−1(yn−1|xn−1)PY|X(yn|xn),

hence, recursively yielding (4.2.1). The converse for (4.2.3b)—i.e., (4.2.1) imply-

ing (4.2.3b)— can be analogously proved by noting that

PXn|Xn−1,Y n−1(xn|xn−1, yn−1) = PXn(xn)Pyn∈Y PYn|Xn(yn|xn)

PXn−1(xn−1)PYn−1|Xn−1(yn−1|xn−1).

Note that the above deﬁnition of DMC in (4.2.1) prohibits the use of channel

feedback, where the current channel output ynis a function of past channel

outputs yn−1in addition to past and current inputs xn(this is exactly what is

inferred by Condition (4.2.2a)). Therefore, conditions (4.2.2a) will be instead

used to deﬁne a DMC channel with feedback (feedback will be considered in Part

II of this book).

Examples of DMCs:

1. Identity (noiseless) channels: An identity channel has equal-size input and

output alphabets (|X| =|Y|) and channel transition probability satisfying

PY|X(y|x) = (1 if y=x

0 if y6=x.

This is a noiseless or perfect channel as the channel input is received error-

free at the channel output.

2. Binary symmetric channels: A binary symmetric channel (BSC) is a chan-

nel with binary input and output alphabets such that each input has a

(conditional) probability given by εfor being received inverted at the out-

put, where ε∈[0,1] is called the channel’s crossover probability or bit error

rate. The channel’s transition distribution matrix is given by

Q= [px,y] = p0,0p0,1

p1,0p1,1

=PY|X(0|0) PY|X(1|0)

PY|X(0|1) PY|X(1|1)=1−ε ε

ε1−ε(4.2.4)

and can be graphically represented via a transition diagram as shown in

Figure 4.2.

XPY|XY

1−ε

Figure 4.2: Binary symmetric channel.

If we set ε= 0, then the BSC reduces to the binary identity (noiseless)

channel. The channel is called “symmetric” since PY|X(1|0) = PY|X(0|1);

i.e., it has the same probability for ﬂipping an input bit into a 0 or a 1.

A detailed discussion of DMCs with various symmetry properties will be

discussed at the end of this chapter.

Despite its simplicity, the BSC is rich enough to capture most of the

complexity of coding problems over more general channels. For exam-

ple, it can exactly model the behavior of practical channels with additive

memoryless Gaussian noise used in conjunction of binary symmetric mod-

ulation and hard-decision demodulation (e.g., see [44, p. 240].) It is also

worth pointing out that the BSC can be explicitly represented via a binary

modulo-2 additive noise channel whose output at time iis the modulo-2

sum of its input and noise variables:

Yi=Xi⊕Zifor i= 1,2,···

where ⊕denotes addition modulo-2, Yi,Xiand Ziare the channel output,

input and noise, respectively, at time i, the alphabets X=Y=Z={0,1}

are all binary, and it is assumed that Xiand Zjare independent from each

other for any i, j = 1,2,···,and that the noise process is a Bernoulli(ε)

process – i.e., a binary i.i.d. process with Pr[Z= 1] = ε.

3. Binary erasure channels: In the BSC, some input bits are received perfectly

and others are received corrupted (ﬂipped) at the channel output. In some

channels however, some input bits are lost during transmission instead of

being received corrupted (for example, packets in data networks may get

dropped or blocked due to congestion or bandwidth constraints). In this

case, the receiver knows the exact location of these bits in the received

bitstream or codeword, but not their actual value. Such bits are then

declared as “erased” during transmission and are called “erasures.” This

gives rise to the so-called binary erasure channel (BEC) as illustrated in

Figure 4.3, with input alphabet X={0,1}and output alphabet Y=

{0, E, 1}, where Erepresents an erasure, and channel transition matrix

given by

Q= [px,y] = p0,0p0,E p0,1

p1,0p1,E p1,1

=PY|X(0|0) PY|X(E|0) PY|X(1|0)

PY|X(0|1) PY|X(E|1) PY|X(1|1)

=1−α α 0

0α1−α(4.2.5)

where 0 ≤α≤1 is called the channel’s erasure probability.

4. Binary channels with errors and erasures: One can combine the BSC with

the BEC to obtain a binary channel with both errors and erasures, as

shown in Figure 4.4. We will call such channel the binary symmetric

erasure channel (BSEC). In this case, the channel’s transition matrix is

given by

Q= [px,y] = p0,0p0,E p0,1

p1,0p1,E p1,1=1−ε−α α ε

ε α 1−ε−α(4.2.6)

where ε, α ∈[0,1] are the channel’s crossover and erasure probabilities,

respectively. Clearly, setting α= 0 reduces the BSEC to the BSC, and

setting ε= 0 reduces the BSEC to the BEC.

More generally, the channel need not have a symmetric property in the

sense of having identical transition distributions when inputs bits 0 or 1

XPY|XY

1−α

Figure 4.3: Binary erasure channel.

are sent. For example, the channel’s transition matrix can be given by

Q= [px,y] = p0,0p0,E p0,1

p1,0p1,E p1,1=1−ε−α α ε

ε′α′1−ε′−α′(4.2.7)

where the probabilities ε6=ε′and α6=α′in general. We call such chan-

nel, an asymmetric channel with errors and erasures (this model might

be useful to represent practical channels using asymmetric or non-uniform

modulation constellations).

XPY|XY

1−ε−α

Figure 4.4: Binary symmetric erasure channel.

5. q-ary symmetric channels: Given an integer q≥2, the q-ary symmetric

channel is a non-binary extension of the BSC; it has alphabets X=Y=

{0,1,···, q −1}of size qand channel transition matrix given by

Q= [px,y] = p0,0p0,1··· p0,q−1

p1,0p1,1··· p1,q−1

=





1−εε

q−1··· ε

q−1

q−11−ε··· ε

q−1

q−1··· 1−ε







(4.2.8)

where 0 ≤ε≤1 is the channel’s symbol error rate (or probability). When

q= 2, the channel reduces to the BSC with bit error rate ε, as expected.

As the BSC, the q-ary symmetric channel can be expressed as a modulo-

qadditive noise channel with common input, output and noise alphabets

X=Y=Z={0,1,···, q −1}and whose output Yiat time iis given by

Yi=Xi⊕qZi, for i= 1,2,···,where ⊕qdenotes addition modulo-q, and

Xiand Ziare the channel’s input and noise variables, respectively, at time

i. Here, the noise process {Zn}∞

n=1 is assumed to be an i.i.d. process with

distribution

Pr[Z= 0] = 1 −εand Pr[Z=a] = ε

q−1∀a∈ {1,···, q −1}.

It is also assumed that the input and noise processes are independent from

each other.

6. q-ary erasure channels: Given an integer q≥2, one can also consider

a non-binary extension of the BEC, yielding the so called q-ary erasure

channel. Speciﬁcally, this channel has input and output alphabets given

by X={0,1,···, q −1}and Y={0,1,···, q −1, E }, respectively, where

Edenotes an erasure, and channel transition distribution given by

PY|X(y|x) = 









1−αif y=x,x∈ X

αif y=E,x∈ X

0 if y6=x,x∈ X

(4.2.9)

where 0 ≤α≤1 is the erasure probability. As expected, setting q= 2

reduces the channel to the BEC.

4.3 Block codes for data transmission over DMCs

Deﬁnition 4.4 (Fixed-length data transmission code) Given positive in-

tegers nand M, and a discrete channel with input alphabet Xand output

alphabet Y, a ﬁxed-length data transmission code (or block code) for this chan-

nel with blocklength nand rate 1

nlog2Mmessage bits per channel symbol (or

channel use) is denoted by C∼n= (n, M ) and consists of:

1. Minformation messages intended for transmission.

2. An encoding function

f:{1,2,...,M} → Xn

yielding codewords f(1), f (2),···, f (M)∈ Xn, each of length n. The set

of these Mcodewords is called the codebook and we also usually write

C∼n={f(1), f (2),···, f (M)}to list the codewords.

3. A decoding function g:Yn→ {1,2,...,M}.

The set {1,2,...,M}is called the message set and we assume that a message

Wfollows a uniform distribution over the set of messages: Pr[W=w] = 1

Mfor

all w∈ {1,2,...,M}. A block diagram for the channel code is given at the

beginning of this chapter; see Figure 4.1. As depicted in the diagram, to convey

message Wover the channel, the encoder sends its corresponding codeword

Xn=f(W) at the channel input. Finally, Ynis received at the channel output

(according to the memoryless channel distribution PYn|Xn) and the decoder yields

W=g(Yn) as the message estimate.

Deﬁnition 4.5 (Average probability of error) The average probability of

error for a channel block code C∼n= (n, M ) code with encoder f(·) and decoder

g(·) used over a channel with transition distribution PYn|Xnis deﬁned as

Pe(C∼n),1

w=1

λw(C∼n),

where

λw(C∼n),Pr[ ˆ

W6=W|W=w] = Pr[g(Yn)6=w|Xn=f(w)]

yn∈Yn:g(yn)6=w

PYn|Xn(yn|f(w))

is the code’s conditional probability of decoding error given that message wis

sent over the channel.

Note that, since we have assumed that the message Wis drawn uniformly

from the set of messages, we have that

Pe(C∼n) = Pr[ ˆ

W6=W].

Observation 4.6 Another more conservative error criterion is the so-called

maximal probability of error

λ(C∼n),max

w∈{1,2,···,M }λw(C∼n).

Clearly, Pe(C∼n)≤λ(C∼n); so one would expect that Pe(C∼n) behaves diﬀerently

than λ(C∼n). However it can be shown that from a code C∼n= (n, M ) with

arbitrarily small Pe(C∼n), one can construct (by throwing away from C∼nhalf of

its codewords with largest conditional probability of error) a code C∼′

n= (n, M

with arbitrarily small λ(C∼′

n) at essentially the same code rate as ngrows to

inﬁnity (e.g., see [12, p. 204], [45, p. 163]).2Hence, we will only use Pe(C∼n)

as our criterion when evaluating the “goodness” or reliability3of channel block

codes.

Our target is to ﬁnd a good channel block code (or to show the existence of a

good channel block code). From the perspective of the (weak) law of large num-

bers, a good choice is to draw the code’s codewords based on the jointly typical

set between the input and the output of the channel, since all the probability

mass is ultimately placed on the jointly typical set. The decoding failure then

occurs only when the channel input-output pair does not lie in the jointly typical

set, which implies that the probability of decoding error is ultimately small. We

next deﬁne the jointly typical set.

Deﬁnition 4.7 (Jointly typical set) The set Fn(δ) of jointly δ-typical n-tuple

pairs (xn, yn) with respect to the memoryless distribution PXn,Y n(xn, yn) =

i=1 PX,Y (xi, yi) is deﬁned by

Fn(δ),(xn, yn)∈ Xn× Yn:

−1

nlog2PXn(xn)−H(X)< δ, −1

nlog2PYn(yn)−H(Y)< δ,

and −1

nlog2PXn,Y n(xn, yn)−H(X, Y )< δ.

2Note that this fact holds for single-user channels with known transition distributions (as

given in Deﬁnition 4.1) that remain constant throughout the transmission of a codeword. It

does not however hold for single-user channels whose statistical descriptions may vary in an

unknown manner from symbol to symbol during a codeword transmission; such channels, which

include the class of “arbitrarily varying channels” (see [13, Chapter 2, Section 6]), will not be

considered in this textbook.

3We interchangeably use the terms “goodness” or “reliability” for a block code to mean

that its (average) probability of error asymptotically vanishes with increasing blocklength.

In short, a pair (xn, yn) generated by independently drawing ntimes under PX,Y

is jointly δ-typical if its joint and marginal empirical entropies are respectively

δ-close to the true joint and marginal entropies.

With the above deﬁnition, we directly obtain the joint AEP theorem.

Theorem 4.8 (Joint AEP) If (X1, Y1), (X2, Y2), ..., (Xn, Yn), ... are i.i.d.,

i.e., {(Xi, Yi)}∞

i=1 is a dependent pair of DMSs, then

−1

nlog2PXn(X1, X2,...,Xn)→H(X) in probability,

−1

nlog2PYn(Y1, Y2,...,Yn)→H(Y) in probability,

and

−1

nlog2PXn,Y n((X1, Y1),...,(Xn, Yn)) →H(X, Y ) in probability

as n→ ∞.

Proof: By the weak law of large numbers, we have the desired result. 2

Theorem 4.9 (Shannon-McMillan theorem for pairs) Given a dependent

pair of DMSs with joint entropy H(X, Y ) and any δgreater than zero, we can

choose nbig enough so that the jointly δ-typical set satisﬁes:

1. PXn,Y n(Fc

n(δ)) < δ for suﬃciently large n.

2. The number of elements in Fn(δ) is at least (1 −δ)2n(H(X,Y )−δ)for suﬃ-

ciently large n, and at most 2n(H(X,Y )+δ)for every n.

3. If (xn, yn)∈ Fn(δ), its probability of occurrence satisﬁes

2−n(H(X,Y )+δ)< PXn,Y n(xn, yn)<2−n(H(X,Y )−δ).

Proof: The proof is quite similar to that of the Shannon-McMillan theorem for

a single memoryless source presented in the previous chapter; we hence leave it

as an exercise. 2

We herein arrive at the main result of this chapter, Shannon’s channel coding

theorem for DMCs. It basically states that a quantity C, termed as channel

capacity and deﬁned as the maximum of the channel’s mutual information over

the set of its input distributions (see below), is the supremum of all “achievable”

channel block code rates; i.e., it is the supremum of all rates for which there

exists a sequence of block codes for the channel with asymptotically decaying

(as the blocklength grows to inﬁnity) probability of decoding error. In other

words, for a given DMC, its capacity C, which can be calculated by solely using

the channel’s transition matrix Q, constitutes the largest rate at which one can

reliably transmit information via a block code over this channel. Thus, it is

possible to communicate reliably over an inherently noisy DMC at a ﬁxed rate

(without decreasing it) as long as this rate is below Cand the code’s blocklength

is allowed to be large.

Theorem 4.10 (Shannon’s channel coding theorem) Consider a DMC

with ﬁnite input alphabet X, ﬁnite output alphabet Yand transition distribution

probability PY|X(y|x), x∈ X and y∈ Y. Deﬁne the channel capacity4

C,max

I(X;Y) = max

I(PX, PY|X)

where the maximum is taken over all input distributions PX. Then the following

hold.

•Forward part (achievability): For any 0 < ε < 1, there exist γ > 0 and a

sequence of data transmission block codes { C∼n= (n, Mn)}∞

n=1 with

lim inf

n→∞

nlog2Mn≥C−γ

and

Pe(C∼n)< ε for suﬃciently large n,

where Pe(C∼n) denotes the (average) probability of error for block code C∼n.

•Converse part: Any sequence of data transmission block codes { C∼n=

(n, Mn)}∞

n=1 with

lim inf

n→∞

nlog2Mn> C

4First note that the mutual information I(X;Y) is actually a function of the input statistics

PXand the channel statistics PY|X. Hence, we may write it as

I(PX, PY|X) = X

x∈X X

y∈Y

PX(x)PY|X(y|x) log2

PY|X(y|x)

Px′∈X PX(x′)PY|X(y|x′).

Such an expression is more suitable for calculating the channel capacity.

Note also that the channel capacity Cis well-deﬁned since, for a ﬁxed PY|X,I(PX, PY|X) is

concave and continuous in PX(with respect to both the variational distance and the Euclidean

distance (i.e., L2-distance) [45, Chapter 2]), and since the set of all input distributions PXis

a compact (closed and bounded) subset of R|X | due to the ﬁniteness of X. Hence there exists

aPXthat achieves the supremum of the mutual information and the maximum is attainable.

satisﬁes

Pe(C∼n)>0 for suﬃciently large n;

i.e., the codes’ probability of error is bounded away from zero for all n

suﬃciently large.

Proof of the forward part: It suﬃces to prove the existence of a good block

code sequence (satisfying the rate condition, i.e., lim inf n→∞(1/n) log2Mn≥

C−γfor some γ > 0) whose average error probability is ultimately less than ε.

We will use Shannon’s original random coding proof technique in which the

good block code sequence is not deterministically constructed; instead, its exis-

tence is implicitly proven by showing that for a class (ensemble) of block code

sequences { C∼n}∞

n=1 and a code-selecting distribution Pr[ C∼n] over these block

code sequences, the expectation value of the average error probability, evaluated

under the code-selecting distribution on these block code sequences, can be made

smaller than εfor nsuﬃciently large:

EC∼n[Pe(C∼n)] = X

C∼n

Pr[ C∼n]Pe(C∼n)→0 as n→ ∞.

Hence, there must exists at least one such a desired good code sequence { C∼∗

n}∞

n=1

among them (with Pe(C∼∗

n)→0 as n→ ∞).

Fix ε∈(0,1) and some γin (0,4ε). Observe that there exists N0such that

for n > N0, we can choose an integer Mnwith

C−γ

2≥1

nlog2Mn> C −γ.

(Since we are only concerned with the case of “suﬃcient large n,” it suﬃces to

consider only those n’s satisfying n > N0, and ignore those n’s for n≤N0.)

Deﬁne δ,γ/8. Let Pˆ

Xbe the probability distribution achieving the channel

capacity:

C,max

I(PX, PY|X) = I(Pˆ

X, PY|X).

Denote by Pˆ

Ynthe channel output distribution due to channel input product

distribution Pˆ

Xn(with Pˆ

Xn(xn) = Qn

i=1 Pˆ

X(xi)), i.e.,

Pˆ

Yn(yn) = X

xn∈Xn

Pˆ

Xn,ˆ

Yn(xn, yn)

where

Pˆ

Xn,ˆ

Yn(xn, yn),Pˆ

Xn(xn)PYn|Xn(yn|xn)

for all xn∈ Xnand yn∈ Yn. Note that since Pˆ

Xn(xn) = Qn

i=1 Pˆ

X(xi) and the

channel is memoryless, the resulting joint input-output process {(ˆ

Xi,ˆ

Yi)}∞

i=1 is

also memoryless with

Pˆ

Xn,ˆ

Yn(xn, yn) =

i=1

Pˆ

X, ˆ

Y(xi, yi)

and

Pˆ

X, ˆ

Y(x, y) = Pˆ

X(x)PY|X(y|x) for x∈ X, y ∈ Y.

We next present the proof in three steps.

Step 1: Code construction.

For any blocklength n, independently select Mnchannel inputs with re-

placement5from Xnaccording to the distribution Pˆ

Xn(xn). For the se-

lected Mnchannel inputs yielding codebook C∼n,{c1,c2,...,cMn}, deﬁne

the encoder fn(·) and decoder gn(·), respectively, as follows:

fn(m) = cmfor 1 ≤m≤Mn,

and

gn(yn) = 









m, if cmis the only codeword in C∼n

satisfying (cm, yn)∈ Fn(δ);

any one in {1,2,...,Mn},otherwise,

where Fn(δ) is deﬁned in Deﬁnition 4.7 with respect to distribution Pˆ

Xn,ˆ

Yn.

(We evidently assume that the codebook C∼nand the channel distribution

PY|Xare known at both the encoder and the decoder.) Hence, the code

C∼noperates as follows. A message Wis chosen according to the uniform

distribution from the set of messages. The encoder fnthen transmits the

Wth codeword cWin C∼nover the channel. Then Ynis received at the

channel output and the decoder guesses the sent message via ˆ

W=gn(Yn).

Note that there is a total |X|nMnpossible randomly generated codebooks

C∼nand the probability of selecting each codebook is given by

Pr[ C∼n] =

m=1

Pˆ

Xn(cm).

5Here, the channel inputs are selected with replacement. That means it is possible and

acceptable that all the selected Mnchannel inputs are identical.

Step 2: Conditional error probability.

For each (randomly generated) data transmission code C∼n, the conditional

probability of error given that message mwas sent, λm(C∼n), can be upper

bounded by:

λm(C∼n)≤X

yn∈Yn: (cm,yn)6∈Fn(δ)

PYn|Xn(yn|cm)

m′=1

m′6=mX

yn∈Yn: (cm′,yn)∈Fn(δ)

PYn|Xn(yn|cm),(4.3.1)

where the ﬁrst term in (4.3.1) considers the case that the received channel

output ynis not jointly δ-typical with cm, (and hence, the decoding rule

gn(·) would possibly result in a wrong guess), and the second term in

(4.3.1) reﬂects the situation when ynis jointly δ-typical with not only the

transmitted codeword cm, but also with another codeword cm′(which may

cause a decoding error).

By taking expectation in (4.3.1) with respect to the mth codeword-

selecting distribution Pˆ

Xn(cm), we obtain

cm∈Xn

Pˆ

Xn(cm)λm(C∼n)≤X

cm∈XnX

yn6∈Fn(δ|cm)

Pˆ

Xn(cm)PYn|Xn(yn|cm)

cm∈Xn

m′=1

m′6=mX

yn∈Fn(δ|cm′)

Pˆ

Xn(cm)PYn|Xn(yn|cm)

=Pˆ

Xn,ˆ

Yn(Fc

n(δ))

m′=1

m′6=mX

cm∈XnX

yn∈Fn(δ|cm′)

Pˆ

Xn,ˆ

Yn(cm, yn),

(4.3.2)

where

Fn(δ|xn),{yn∈ Yn: (xn, yn)∈ Fn(δ)}.

Step 3: Average error probability.

We now can analyze the expectation of the average error probability

EC∼n[Pe(C∼n)]

over the ensemble of all codebooks C∼ngenerated at random according to

Pr[ C∼n] and show that it asymptotically vanishes as ngrows without bound.

We obtain the following series of inequalities.

EC∼n[Pe(C∼n)] = X

C∼n

Pr[ C∼n]Pe(C∼n)

c1∈Xn··· X

cMn∈Xn

Pˆ

Xn(c1)···Pˆ

Xn(cMn) 1

m=1

λm(C∼n)!

m=1 X

c1∈Xn··· X

cm−1∈XnX

cm+1∈Xn··· X

cMn∈Xn

Pˆ

Xn(c1)···Pˆ

Xn(cm−1)Pˆ

Xn(cm+1)···Pˆ

Xn(cMn)

× X

cm∈Xn

Pˆ

Xn(cm)λm(C∼n)!

≤1

m=1 X

c1∈Xn··· X

cm−1∈XnX

cm+1∈Xn··· X

cMn∈Xn

Pˆ

Xn(c1)···Pˆ

Xn(cm−1)Pˆ

Xn(cm+1)···Pˆ

Xn(cMn)

×Pˆ

Xn,ˆ

Yn(Fc

n(δ))

m=1 X

c1∈Xn··· X

cm−1∈XnX

cm+1∈Xn··· X

cMn∈Xn

Pˆ

Xn(c1)···Pˆ

Xn(cm−1)Pˆ

Xn(cm+1)···Pˆ

Xn(cMn)

m′=1

m′6=mX

cm∈XnX

yn∈Fn(δ|cm′)

Pˆ

Xn,ˆ

Yn(cm, yn) (4.3.3)

=Pˆ

Xn,ˆ

Yn(Fc

n(δ))

m=1











m′=1

m′6=m



X

c1∈Xn··· X

cm−1∈XnX

cm+1∈Xn··· X

cMn∈Xn

Pˆ

Xn(c1)···Pˆ

Xn(cm−1)Pˆ

Xn(cm+1)···Pˆ

Xn(cMn)

×X

cm∈XnX

yn∈Fn(δ|cm′)

Pˆ

Xn,ˆ

Yn(cm, yn)







where (4.3.3) follows from (4.3.2), and the last step holds since Pˆ

Xn,ˆ

Yn(Fc

n(δ))

is a constant independent of c1,...,cMnand m. Observe that for n > N0,

m′=1

m′6=m



X

c1∈Xn··· X

cm−1∈XnX

cm+1∈Xn··· X

cMn∈Xn

Pˆ

Xn(c1)···Pˆ

Xn(cm−1)Pˆ

Xn(cm+1)···Pˆ

Xn(cMn)

×X

cm∈XnX

yn∈Fn(δ|cm′)

Pˆ

Xn,ˆ

Yn(cm, yn)



m′=1

m′6=m



X

cm∈XnX

cm′∈XnX

yn∈Fn(δ|cm′)

Pˆ

Xn(cm′)Pˆ

Xn,ˆ

Yn(cm, yn)



m′=1

m′6=m



X

cm′∈XnX

yn∈Fn(δ|cm′)

Pˆ

Xn(cm′) X

cm∈Xn

Pˆ

Xn,ˆ

Yn(cm, yn)!



m′=1

m′6=m



X

cm′∈XnX

yn∈Fn(δ|cm′)

Pˆ

Xn(cm′)Pˆ

Yn(yn)



m′=1

m′6=m



X

(cm′,yn)∈Fn(δ)

Pˆ

Xn(cm′)Pˆ

Yn(yn)



≤

m′=1

m′6=m

|Fn(δ)|2−n(H(ˆ

X)−δ)2−n(H(ˆ

Y)−δ)

≤

m′=1

m′6=m

2n(H(ˆ

X, ˆ

Y)+δ)2−n(H(ˆ

X)−δ)2−n(H(ˆ

Y)−δ)

= (Mn−1)2n(H(ˆ

X, ˆ

Y)+δ)2−n(H(ˆ

X)−δ)2−n(H(ˆ

Y)−δ)

≤Mn·2n(H(ˆ

X, ˆ

Y)+δ)2−n(H(ˆ

X)−δ)2−n(H(ˆ

Y)−δ)

≤2n(C−4δ)·2−n(I(ˆ

X;ˆ

Y)−3δ)= 2−nδ,

where the ﬁrst inequality follows from the deﬁnition of the jointly typical

set Fn(δ), the second inequality holds by the Shannon-McMillan theorem

for pairs (Theorem 4.9), the last inequality follows since C=I(ˆ

X;ˆ

Y) by

deﬁnition of ˆ

Xand ˆ

Y, and since (1/n) log2Mn≤C−(γ/2) = C−4δ.

Consequently,

EC∼n[Pe(C∼n)] ≤Pˆ

Xn,ˆ

Yn(Fc

n(δ)) + 2−nδ,

which for suﬃciently large n(and n > N0), can be made smaller than

2δ=γ/4< ε by the Shannon-McMillan theorem for pairs. 2

Before proving the converse part of the channel coding theorem, let us recall

Fano’s inequality in a channel coding context. Consider an (n, Mn) channel

block code C∼nwith encoding and decoding functions given by

fn:{1,2,···, Mn} → Xn

and

gn:Yn→ {1,2,···, Mn},

respectively. Let message W, which is uniformly distributed over the set of

messages {1,2,···, Mn}, be sent via codeword Xn(W) = fn(W) over the DMC,

and let Ynbe received at the channel output. At the receiver, the decoder

estimates the sent message via ˆ

W=gn(Yn) and the probability of estimation

error is given by the code’s average error probability:

Pr[W6=ˆ

W] = Pe(C∼n)

since Wis uniformly distributed. Then Fano’s inequality (2.5.2) yields

H(W|Yn)≤1 + Pe(C∼n) log2(Mn−1)

≤1 + Pe(C∼n) log2Mn.(4.3.4)

We next proceed with the proof of the converse part.

Proof of the converse part: For any (n, Mn) block channel code C∼nas de-

scribed above, we have that W→Xn→Ynform a Markov chain; we thus

obtain by the data processing inequality that

I(W;Yn)≤I(Xn;Yn).(4.3.5)

We can also upper bound I(Xn;Yn) in terms of the channel capacity Cas follows

I(Xn;Yn)≤max

PXn

I(Xn;Yn)

≤max

PXn

i=1

I(Xi;Yi) (by Theorem 2.21)

≤

i=1

max

PXn

I(Xi;Yi)

i=1

max

PXi

I(Xi;Yi)

=nC. (4.3.6)

100

Consequently, code C∼nsatisﬁes the following:

log2Mn=H(W) (since Wis uniformly distributed)

=H(W|Yn) + I(W;Yn)

≤H(W|Yn) + I(Xn;Yn) (by 4.3.5)

≤H(W|Yn) + nC (by 4.3.6)

≤1 + Pe(C∼n)·log2Mn+nC. (by 4.3.4)

This implies that

Pe(C∼n)≥1−C

(1/n) log2Mn−1

log2Mn

So if lim infn→∞(1/n) log2Mn> C, then for any δ > 0, there exists an integer

Nsuch that for n≥N,1

nlog2Mn> C +δ.

Hence, for n≥N0,max{N, 2/δ},

Pe(C∼n)>1−C

C+δ−1

n(C+δ)>δ

2(C+δ)>0;

i.e., Pe(C∼n) is bounded away from zero for nsuﬃciently large. 2

The results of the above channel coding theorem is illustrated in Figure 4.5,

where R= lim inf n→∞(1/n) log2Mn(measured in message bits/channel use) is

usually called the ultimate (or asymptotic) coding rate of channel block codes.

As indicated in the ﬁgure, the ultimate rate of any good block code for the

DMC must be smaller than its capacity C. Conversely, any block code with

(ultimate) rate greater than C, will have its probability of error bounded away

from zero. Thus for a DMC, its capacity Cis the supremum of all “achievable”

channel block coding rates; i.e., it is the supremum of all rates for which there

exists a sequence of channel block codes with asymptotically vanishing (as the

blocklength goes to inﬁnity) probability of error.

Shannon’s channel coding theorem, established in 1948 [38], provides the ul-

timate limit for reliable communication over a noisy channel. However, it does

not provide an explicit eﬃcient construction for good codes since searching for

a good code from the ensemble of randomly generated codes is prohibitively

complex, as its size grows double-exponentially with blocklength (see Step 1 of

the proof of the forward part). It thus spurred the entire area of coding theory,

which ﬂourished over the last 60 years with the aim of constructing power-

ful error-correcting codes operating close to the capacity limit. Particular ad-

vances were made for the class of linear codes (also known as group codes) whose

101

limn→∞ Pe= 0

for the best channel block code

lim supn→∞ Pe>0

for all channel block codes

Figure 4.5: Ultimate channel coding rate Rversus channel capacity C

and behavior of the probability of error as blocklength ngoes to inﬁnity

for a discrete memoryless channel.

rich6yet elegantly simple algebraic structures made them amenable for eﬃcient

practically-implementable encoding and decoding. Examples of such codes in-

clude Hamming codes, Golay codes, BCH and Reed-Solomon codes and convo-

lutional codes. In 1993, the so-called Turbo codes were introduced by Berrou

et al. [3, 4] and shown experimentally to perform close to the channel capacity

limit for the class of memoryless channels. Similar near-capacity achieving lin-

ear codes were later established with the re-discovery of Gallager’s low-density

parity-check codes [16, 17, 29, 30]. Many of these codes are used with increased

sophistication in today’s ubiquitous communication, information and multime-

dia technologies. For detailed studies on coding theory, see the following texts

[8, 10, 23, 28, 31, 35, 44].

4.4 Calculating channel capacity

Given a DMC with ﬁnite input alphabet X, ﬁnite input alphabet Yand channel

transition matrix Q= [px,y] of size |X| × |Y|, where px,y ,PY|X(y|x), for x∈ X

and y∈ Y, we would like to calculate

C,max

I(X;Y)

where the maximization (which is well-deﬁned) is carried over the set of input

distributions PX, and I(X;Y) is the mutual information between the channel’s

input and output.

Note that Ccan be determined numerically via non-linear optimization tech-

niques – such as the iterative algorithms developed by Arimoto [1] and Blahut

[7, 9], see also [14] and [45, Chap. 9]. In general, there are no closed-form (single-

letter) analytical expressions for C. However, for many “simpliﬁed” channels,

6Indeed, there exist linear codes that can achieve the capacity of memoryless channels with

additive noise (e.g., see [13, p. 114]). Such channels include the BSC and the q-ary symmetric

channel.

102

it is possible to analytically determine Cunder some “symmetry” properties of

their channel transition matrix.

4.4.1 Symmetric, weakly-symmetric and quasi-symmetric

channels

Deﬁnition 4.11 A DMC with ﬁnite input alphabet X, ﬁnite output alphabet

Yand channel transition matrix Q= [px,y] of size |X | × |Y| is said to be sym-

metric if the rows of Qare permutations of each other and the columns of Q

are permutations of each other. The channel is said to be weakly-symmetric if

the rows of Qare permutations of each other and all the column sums in Qare

equal.

It directly follows from the deﬁnition that symmetry implies weak-symmetry.

Examples of symmetric DMCs include the BSC, the q-ary symmetric channel and

the following ternary channel with X=Y={0,1,2}and transition matrix

Q=



PY|X(0|0) PY|X(1|0) PY|X(2|0)

PY|X(0|1) PY|X(1|1) PY|X(2|1)

PY|X(0|2) PY|X(1|2) PY|X(2|2)

=



0.4 0.1 0.5

0.5 0.4 0.1

0.1 0.5 0.4

.

The following DMC with |X| =|Y| = 3 and

Q=





0.5 0.25 0.25 0

0 0.25 0.25 0.5





(4.4.1)

is weakly-symmetric (but not symmetric). Noting that all above channels in-

volve square transition matrices, we emphasize that Qcan be rectangular while

satisfying the symmetry or weak-symmetry properties. For example, the DMC

with |X| = 2, |Y| = 4 and

Q=1−ε

1−ε

2(4.4.2)

is symmetric (where ε∈[0,1]), while the DMC with |X| = 2, |Y| = 3 and

Q=1

6

is weakly-symmetric.

103

Lemma 4.12 The capacity of a weakly-symmetric channel Qis achieved by a

uniform input distribution and is given by

C= log2|Y| − H(q1, q2,···, q|Y|) (4.4.3)

where (q1, q2,···, q|Y|) denotes any row of Qand

H(q1, q2,···, q|Y|),−|Y|

i=1

qilog2qi

is the row entropy.

Proof: The mutual information between the channel’s input and output is given

I(X;Y) = H(Y)−H(Y|X)

=H(Y)−X

x∈X

PX(x)H(Y|X=x)

Noting that every row of Qis a permutation of every other row, we obtain

that H(Y|X=x) is independent of xand can be written as

H(Y|X=x) = H(q1, q2,···, q|Y|)

where (q1, q2,···, q|Y|) is any row of Q. Thus

H(Y|X) = X

x∈X

PX(x)H(q1, q2,···, q|Y|)

=H(q1, q2,···, q|Y|) X

x∈X

PX(x)!

=H(q1, q2,···, q|Y|).

Thus

I(X;Y) = H(Y)−H(q1, q2,···, q|Y|)

≤log2|Y| − H(q1, q2,···, q|Y|)

with equality achieved iﬀ Yis uniformly distributed over Y. We next show that

choosing a uniform input distribution, PX(x) = 1

|X| ∀x∈ X, yields a uniform

output distribution, hence maximizing mutual information. Indeed, under a

uniform input distribution, we obtain that for any y∈ Y,

PY(y) = X

x∈X

PX(x)PY|X(y|x) = 1

|X| X

x∈X

px,y =A

|X|

104

where A,Px∈X px,y is a constant given by the sum of the entries in any column

of Q, since by the weak-symmetry property all column sums in Qare identical.

Noting that Py∈Y PY(y) = 1 yields that

y∈Y

|X| = 1

and thus

A=|X|

|Y|.(4.4.4)

Thus

PY(y) = A

|X| =|X|

|Y|

|X| =1

|Y|

for any y∈ Y; thus the uniform input distribution induces a uniform output

distribution and achieves channel capacity as given by (4.4.3). 2

Observation 4.13 Note that if the weakly-symmetric channel has a square (i.e.,

with |X| =|Y|) transition matrix Q, then Qis a doubly-stochastic matrix; i.e.,

both its row sums and its column sums are equal to 1. Note however that having

a square transition matrix does not necessarily make a weakly-symmetric channel

symmetric; e.g., see (4.4.1).

Example 4.14 (Capacity of the BSC.) Since the BSC with crossover pro-

bability (or bit error rate) εis symmetric, we directly obtain from Lemma 4.12

that its capacity is achieved by a uniform input distribution and is given by

C= log2(2) −H(1 −ε, ε) = 1 −hb(ε) (4.4.5)

where hb(·) is the binary entropy function.

Example 4.15 (Capacity of the q-ary symmetric channel.) Similarly, the

q-ary symmetric channel with symbol error rate εdescribed in (4.2.8) is sym-

metric; hence, by Lemma 4.12, its capacity is given by

C= log2q−H1−ε, ε

q−1,···,ε

q−1= log2q+εlog2

q−1+(1−ε) log2(1−ε).

Note that when q= 2, the channel capacity is equal to that of the BSC, as ex-

pected. Furthermore, when ε= 0, the channel reduces to the identity (noiseless)

q-ary channel and its capacity is given by C= log2q.

We next note that one can further weaken the weak-symmetry property and

deﬁne a class of “quasi-symmetric” channels for which the uniform input distri-

bution still achieves capacity and yields a simple closed-form formula for capacity.

105

Deﬁnition 4.16 A DMC with ﬁnite input alphabet X, ﬁnite output alphabet

Yand channel transition matrix Q= [px,y] of size |X | × |Y| is said to be quasi-

symmetric7if Qcan be partitioned along its columns into mweakly-symmetric

sub-matrices Q1,Q2,···,Qmfor some integer m≥1, where each Qisub-matrix

has size |X| × |Yi|for i= 1,2,···, m with Y1∪ ··· ∪ Ym=Yand Yi∩ Yj=∅

∀i6=j,i, j = 1,2,···, m.

Hence, quasi-symmetry is our weakest symmetry notion, since a weakly-

symmetric channel is clearly quasi-symmetric (just set m= 1 in the above

deﬁnition); we thus have: symmetry =⇒weak-symmetry =⇒quasi-symmetry.

Lemma 4.17 The capacity of a quasi-symmetric channel Qas deﬁned above is

achieved by a uniform input distribution and is given by

i=1

aiCi(4.4.6)

where

ai,X

y∈Yi

px,y = sum of any row in Qi, i = 1,···, m,

and

Ci= log2|Yi| − Hany row in the matrix 1

aiQi, i = 1,···, m

is the capacity of the ith weakly-symmetric “sub-channel” whose transition ma-

trix is obtained by multiplying each entry of Qiby 1

ai(this normalization renders

sub-matrix Qiinto a stochastic matrix and hence a channel transition matrix).

Proof: We ﬁrst observe that for each i= 1,···, m,aiis independent of the

input value x, since sub-matrix iis weakly-symmetric (so any row in Qiis a

permutation of any other row); and hence aiis the sum of any row in Qi.

For each i= 1,···, m, deﬁne

PYi|X(y|x),(px,y

aiif y∈ Yiand x∈ X;

0 otherwise

where Yiis a random variable taking values in Yi. It can be easily veriﬁed that

PYi|X(y|x) is a legitimate conditional distribution. Thus [PYi|X(y|x)] = 1

aiQi

is the transition matrix of the weakly-symmetric “sub-channel” iwith input

7This notion of “quasi-symmetry” is slightly more general that Gallager’s notion [18, p. 94],

as we herein allow each sub-matrix to be weakly-symmetric (instead of symmetric as in [18]).

106

alphabet Xand output alphabet Yi. Let I(X;Yi) denote its mutual information.

Since each such sub-channel iis weakly-symmetric, we know that its capacity

Ciis given by

Ci= max

I(X;Yi) = log2|Yi| − Hany row in the matrix 1

aiQi,

where the maximum is achieved by a uniform input distribution.

Now, the mutual information between the input and the output of our original

quasi-symmetric channel Qcan be written as

I(X;Y) = X

y∈Y X

x∈X

PX(x)px,y log2

px,y

Px′∈X PX(x′)px′,y

i=1 X

y∈YiX

x∈X

aiPX(x)px,y

log2

px,y

Px′∈X PX(x′)px′,y

i=1

aiX

y∈YiX

x∈X

PX(x)PYi|X(y|x) log2

PYi|X(y|x)

Px′∈X PX(x′)PYi|X(y|x′)

i=1

aiI(X;Yi).

Therefore, the capacity of channel Qis

C= max

I(X;Y)

= max

i=1

aiI(X;Yi)

i=1

aimax

I(X;Yi) (as the same uniform PXmaximizes each I(X;Yi))

i=1

aiCi.

Example 4.18 (Capacity of the BEC) The BEC with erasure probability α

as given in (4.2.5) is quasi-symmetric (but neither weakly-symmetric nor sym-

metric). Indeed, its transition matrix Qcan be partitioned along its columns

into two symmetric (hence weakly-symmetric) sub-matrices

Q1=1−α0

0 1 −α

107

and

Q2=α

α.

Thus applying the capacity formula for quasi-symmetric channels of Lemma 4.17

yields that the capacity of the BEC is given by

C=a1C1+a2C2

where a1= 1 −α,a2=α,

C1= log2(2) −H1−α

1−α,0

1−α= 1 −H(1,0) = 1 −0 = 1,

and

C2= log2(1) −Hα

α= 0 −0 = 0.

Therefore, the BEC capacity is given by

C= (1 −α)(1) + (α)(0) = 1 −α. (4.4.7)

Example 4.19 (Capacity of the BSEC) Similarly, the BSEC with crossover

probability εand erasure probability αas described in (4.2.6) is quasi-symmetric;

its transition matrix can be partitioned along its columns into two symmetric

sub-matrices

Q1=1−ε−α ε

ε1−ε−α

and

Q2=α

α.

Hence by Lemma 4.17, the channel capacity is given by C=a1C1+a2C2where

a1= 1 −α,a2=α,

C1= log2(2) −H1−ε−α

1−α,ε

1−α= 1 −hb1−ε−α

1−α,

and

C2= log2(1) −Hα

α= 0.

We thus obtain that

C= (1 −α)1−hb1−ε−α

1−α+ (α)(0)

= (1 −α)1−hb1−ε−α

1−α.(4.4.8)

As already noted, the BSEC is a combination of the BSC with bit error rate ε

and the BEC with erasure probability α. Indeed, setting α= 0 in (4.4.8) yields

that C= 1 −hb(1 −ε) = 1 −hb(ε) which is the BSC capacity. Furthermore,

setting ε= 0 results in C= 1 −α, the BEC capacity.

108

4.4.2 Channel capacity Karuch-Kuhn-Tucker condition

When the channel does not satisfy any symmetry property, the following neces-

sary and suﬃcient Karuch-Kuhn-Tucker (KKT) condition (e.g., cf. [18, pp. 87-

91], [5, 11]) for calculating channel capacity can be quite useful.

Deﬁnition 4.20 (Mutual information for a speciﬁc input symbol) The

mutual information for a speciﬁc input symbol is deﬁned as:

I(x;Y),X

y∈Y

PY|X(y|x) log2

PY|X(y|x)

PY(y).

From the above deﬁnition, the mutual information becomes:

I(X;Y) = X

x∈X

PX(x)X

y∈Y

PY|X(y|x) log2

PY|X(y|x)

PY(y)

x∈X

PX(x)I(x;Y).

Lemma 4.21 (KKT condition for channel capacity.) For a given DMC,

an input distribution PXachieves its channel capacity iﬀ there exists a constant

Csuch that (I(x:Y) = C∀x∈ X with PX(x)>0;

I(x:Y)≤C∀x∈ X with PX(x) = 0. (4.4.9)

Furthermore, the constant Cis the channel capacity (justifying the choice of

notation).

Proof: The forward (if) part holds directly; hence, we only prove the converse

(only-if) part.

Without loss of generality, we assume that PX(x)<1 for all x∈ X, since

PX(x) = 1 for some ximplies that I(X;Y) = 0. The problem of calculating the

channel capacity is to maximize

I(X;Y) = X

x∈X X

y∈Y

PX(x)PY|X(y|x) log2

PY|X(y|x)

Px′∈X PX(x′)PY|X(y|x′),(4.4.10)

subject to the condition X

x∈X

PX(x) = 1 (4.4.11)

109

for a given channel distribution PY|X. By using the Lagrange multiplier method

(e.g., see [5]), maximizing (4.4.10) subject to (4.4.11) is equivalent to maximize:

f(PX),X

x∈X

y∈Y

PX(x)PY|X(y|x) log2

PY|X(y|x)

x′∈X

PX(x′)PY|X(y|x′)+λ X

x∈X

PX(x)−1!.

We then take the derivative of the above quantity with respect to PX(x′′), and

obtain that8

∂f (PX)

∂PX(x′′)=I(x′′;Y)−log2(e) + λ.

By Property 2 of Lemma 2.46, I(X;Y) = I(PX, PY|X) is a concave function in

PX(for a ﬁxed PY|X). Therefore, the maximum of I(PX, PY|X) occurs for a zero

derivative when PX(x) does not lie on the boundary, namely 1 > PX(x)>0.

For those PX(x) lying on the boundary, i.e., PX(x) = 0, the maximum occurs iﬀ

a displacement from the boundary to the interior decreases the quantity, which

implies a non-positive derivative, namely

I(x;Y)≤ −λ+ log2(e),for those xwith PX(x) = 0.

To summarize, if an input distribution PXachieves the channel capacity, then

I(x′′;Y) = −λ+ log2(e),for PX(x′′)>0;

I(x′′;Y)≤ −λ+ log2(e),for PX(x′′) = 0.

8The details for taking the derivative are as follows:

∂

∂PX(x′′ )



X

x∈X X

y∈Y

PX(x)PY|X(y|x) log2PY|X(y|x)

−X

x∈X X

y∈Y

PX(x)PY|X(y|x) log2"X

x′∈X

PX(x′)PY|X(y|x′)#+λ X

x∈X

PX(x)−1!





y∈Y

PY|X(y|x′′) log2PY|X(y|x′′ )−

X

y∈Y

PY|X(y|x′′) log2"X

x′∈X

PX(x′)PY|X(y|x′)#

+ log2(e)X

x∈X X

y∈Y

PX(x)PY|X(y|x)PY|X(y|x′′)

Px′∈X PX(x′)PY|X(y|x′)

+λ

=I(x′′;Y)−log2(e)X

y∈Y "X

x∈X

PX(x)PY|X(y|x)#PY|X(y|x′′)

Px′∈X PX(x′)PY|X(y|x′)+λ

=I(x′′;Y)−log2(e)X

y∈Y

PY|X(y|x′′) + λ

=I(x′′;Y)−log2(e) + λ.

110

for some λ. With the above result, setting C=−λ+ 1 yields (4.4.9). Finally,

multiplying both sides of each equation in (4.4.9) by PX(x) and summing over

xyields that maxPXI(X;Y) on the left and the constant Con the right, thus

proving that the constant Cis indeed the channel’s capacity. 2

Example 4.22 (Quasi-symmetric channels.) For a quasi-symmetric chan-

nel, one can directly verify that the uniform input distribution satisﬁes the

KKT condition of Lemma 4.21 and yields that the channel capacity is given

by (4.4.6); this is left as an exercise. As we already saw, the BSC, the q-ary

symmetric channel, the BEC and the BSEC are all quasi-symmetric.

Example 4.23 Consider a DMC with a ternary input alphabet X={0,1,2},

binary output alphabet Y={0,1}and the following transition matrix

Q=



1 0

0 1

.

This channel is not quasi-symmetric. However, one may guess that the capacity

of this channel is achieved by the input distribution (PX(0), PX(1), PX(2)) =

2,0,1

2) since the input x= 1 has an equal conditional probability of being re-

ceived as 0 or 1 at the output. Under this input distribution, we obtain that

I(x= 0; Y) = I(x= 2; Y) = 1 and that I(x= 1; Y) = 0. Thus the KKT con-

dition of (4.4.9) is satisﬁed; hence conﬁrming that the above input distribution

achieves channel capacity and that channel capacity is equal to 1 bit.

Observation 4.24 (Capacity achieved by a uniform input distribution.)

We close this chapter by noting that there is a class of DMC’s that is larger

than that of quasi-symmetric channels for which the uniform input distribution

achieves capacity. It concerns the class of so-called “T-symmetric” channels [36,

Section V, Deﬁnition 1] for which

T(x),I(x;Y)−log2|X| =X

y∈Y

PY|X(y|x) log2

PY|X(y|x)

Px′∈X PY|X(y|x′)

is a constant function of x(i.e., independent of x), where I(x;Y) is the mu-

tual information for input xunder a uniform input distribution. Indeed the

T-symmetry condition is equivalent to the property of having the uniform input

distribution achieve capacity. This directly follows from the KKT condition of

Lemma 4.21. An example of a T-symmetric channel that is not quasi-symmetric

is the binary-input ternary-output channel with the following transition matrix

Q=1

3.

111

Hence its capacity is achieved by the uniform input distribution. See [36, Fig. 2]

for (inﬁnitely-many) other examples of T-symmetric channels. However, unlike

quasi-symmetric channels, T-symmetric channels do not admit in general a sim-

ple closed-form expression for their capacity (such as the one given in (4.4.6)).

112

Chapter 5

Diﬀerential Entropy and Gaussian

Channels

We have so for examined information measures and their operational character-

ization for discrete-time discrete-alphabet systems. In this chapter, we turn our

focus to continuous-alphabet (real-valued) systems. Except for a brief interlude

with the continuous-time (waveform) Gaussian channel, we consider discrete-

time systems, as treated throughout the book.

We ﬁrst recall that a real-valued (continuous) random variable Xis described

by its cumulative distribution function (cdf)

FX(x),Pr[X≤x]

for x∈R, the set of real numbers. The distribution of Xis called absolutely con-

tinuous (with respect to the Lebesgue measure) if a probability density function

(pdf) fX(·) exists such that

FX(x) = Zx

−∞

fX(t)dt

where fX(t)≥0∀tand R+∞

−∞ fX(t)dt = 1. If FX(·) is diﬀerentiable everywhere,

then the pdf fX(·) exists and is given by the derivative of FX(·): fX(t) = dFX(t)

dt .

The support of a random variable Xwith pdf fX(·) is denoted by SXand deﬁned

SX={x∈R:fX(x)>0}.

We will deal with random variables that admit a pdf.1

1A rigorous (measure-theoretic) study for general continuous systems, initiated by Kol-

mogorov [25], can be found in [34, 22].

113

5.1 Diﬀerential entropy

Recall that the deﬁnition of entropy for a discrete random variable Xrepresent-

ing a DMS is

H(X),X

x∈X −PX(x) log2PX(x) (in bits).

As already seen in Shannon’s source coding theorem, this quantity is the mini-

mum average code rate achievable for the lossless compression of the DMS. But if

the random variable takes on values in a continuum, the minimum number of bits

per symbol needed to losslessly describe it must be inﬁnite. This is illustrated

in the following example, where we take a discrete approximation (quantization)

of a random variable uniformly distributed on the unit interval and study the

entropy of the quantized random variable as the quantization becomes ﬁner and

ﬁner.

Example 5.1 Consider a real-valued random variable Xthat is uniformly dis-

tributed on the unit interval, i.e., with pdf given by

fX(x) = (1 if x∈[0,1);

0 otherwise.

Given a positive integer m, we can discretize Xby uniformly quantizing it into

mlevels by partitioning the support of Xinto equal-length segments of size

∆ = 1

m(∆ is called the quantization step-size) such that:

qm(X) = i

m,if i−1

m≤X < i

for 1 ≤i≤m. Then the entropy of the quantized random variable qm(X) is

given by

H(qm(X)) = −

i=1

mlog21

m= log2m(in bits).

Since the entropy H(qm(X)) of the quantized version of Xis a lower bound to

the entropy of X(as qm(X) is a function of X) and satisﬁes in the limit

lim

m→∞ H(qm(X)) = lim

m→∞ log2m=∞,

we obtain that the entropy of Xis inﬁnite.

The above example indicates that to compress a continuous source without

incurring any loss or distortion indeed requires an inﬁnite number of bits.2Thus

2In fact, all continuous random variables (including those not admitting a pdf) have inﬁnite

entropy. We sketch the proof as follows. For any continuous random variable X, there must

114

when studying continuous sources, the entropy measure is limited in its eﬀective-

ness and the introduction of a new measure is necessary. Such new measure is

indeed obtained upon close examination of the entropy of a uniformly quantized

real-valued random-variable minus the quantization accuracy as the accuracy

increases without bound.

Lemma 5.2 Consider a real-valued random variable Xwith pdf fXsuch that

−fXlog2fXis integrable.3Then a uniform quantization of Xwith an n-bit

accuracy (i.e., with a quantization step-size of ∆ = 2−n) yields an entropy ap-

proximately equal to −RfX(x) log2fX(x)dx +nbits for nsuﬃciently large. In

other words,

lim

n→∞ [H(qn(X)) −n] = −ZfX(x) log2fX(x)dx

where qn(X) is the uniformly quantized version of Xwith quantization step-size

∆ = 2−n.

Proof: We assume without loss of generality that the support of Xis given by

the entire real line.

Step 1: Mean-value theorem.

Let ∆ = 2−nbe the quantization step-size and let ti=i∆ for integer

i∈(−∞,∞). From the mean-value theorem (e.g., cf [32]), we can choose

xi∈[ti−1, ti] such that

pi,Zti

ti−1

fX(x)dx =f(xi)(ti−ti−1) = ∆ ·fX(xi).

exist a non-empty open interval for which the cdf FX(·) is strictly increasing. Now quantize

the source into m+ 1 levels as follows:

•Assign one level to the complement of this open interval.

•Assign mlevels to this open interval such that the total probability mass on this interval,

denoted by a, is equally distributed among these mlevels.

Then the entropy of Xis lower bounded by

H(q(X)) = −(1 −a)·log(1 −a)−a·log a

where q(X) is the quantized version of X. The lower bound H(q(X)) → ∞ as m→ ∞.

3By integrability, we mean the usual Riemann integrability (e.g, see [37]).

115

Step 2: Deﬁnition of h(n)(X).Let

h(n)(X),−∞

i=−∞

[fX(xi) log2fX(xi)]2−n.

Since −fX(x) log2fX(x)dx is integrable,

h(n)(X)→ −ZfX(x) log2fX(x)dx as n→ ∞.

Therefore, given any ε > 0, there exists Nsuch that for all n > N,

−ZfX(x) log2fX(x)dx −h(n)(X)< ε.

Step 3: Computation of H(qn(X)).The entropy of the (uniformly) quan-

tized version of X,qn(X), is given by

H(qn(X)) = −∞

i=−∞

pilog2pi

=−∞

i=−∞

(fX(xi)∆) log2(fX(xi)∆)

=−∞

i=−∞

(fX(xi)2−n) log2(fX(xi)2−n).

Step 4: H(qn(X)) −h(n)(X) .

From Steps 2 and 3,

H(qn(X)) −h(n)(X) = −∞

i=−∞

[fX(xi)2−n] log2(2−n)

=n∞

i=−∞ Zti

ti−1

fX(x)dx

=nZ∞

−∞

fX(x)dx =n.

Hence, we have that for n > N,

−ZfX(x) log2fX(x)dx +n−ε < H(qn(X)) = h(n)(X) + n

<−ZfX(x) log2fX(x)dx +n+ε,

116

yielding that

lim

n→∞ [H(qn(X)) −n] = −ZfX(x) log2fX(x)dx.

In light of the above result, we can deﬁne the following information measure.

Deﬁnition 5.3 (Diﬀerential entropy) The diﬀerential entropy (in bits) of a

continuous random variable Xwith pdf fXand support SXis deﬁned as

h(X),−ZSX

fX(x)·log2fX(x)dx =E[−log2fX(X)],

when the integral exists.

Thus the diﬀerential entropy h(X) of a real-valued random variable Xhas

an operational meaning in the following sense. Since H(qn(X)) is the minimum

average number of bits needed to losslessly describe qn(X), we thus obtain that

h(X) + nis approximately needed to describe Xwhen uniformly quantizing it

with an n-bit accuracy. Therefore, we may conclude that the larger h(X) is, the

larger is the average number of bits required to describe a uniformly quantized

Xwithin a ﬁxed accuracy.

Example 5.4 A continuous random variable Xwith support SX= [0,1) and

pdf fX(x) = 2xfor x∈SXhas diﬀerential entropy equal to

0−2x·log2(2x)dx =x2(log2e−2 log2(2x))

2

2 ln 2 −log2(2) = −0.278652 bits.

We herein illustrate Lemma 5.2 by uniformly quantizing Xto an n-bit accuracy

and computing the entropy H(qn(X)) and H(qn(X)) −nfor increasing values of

n, where qn(X) is the quantized version of X.

We have that qn(X) is given by

qn(X) = i

2n,if i−1

2n≤X < i

2n,

for 1 ≤i≤2n. Hence,

Pr qn(X) = i

2n=(2i−1)

22n,

117

n H(qn(X)) H(qn(X)) −n

1 0.811278 bits -0.188722 bits

2 1.748999 bits -0.251000 bits

3 2.729560 bits -0.270440 bits

4 3.723726 bits -0.276275 bits

5 4.722023 bits -0.277977 bits

6 5.721537 bits -0.278463 bits

7 6.721399 bits -0.278600 bits

8 7.721361 bits -0.278638 bits

9 8.721351 bits -0.278648 bits

Table 5.1: Quantized random variable qn(X) under an n-bit accuracy:

H(qn(X)) and H(qn(X)) −nversus n.

which yields

H(qn(X)) = −

i=1

2i−1

22nlog22i−1

22n

="−1

22n

i=1

(2i−1) log2(2i−1) + 2 log2(2n)#.

As shown in Table 5.1, we indeed observe that as nincreases, H(qn(X)) tends

to inﬁnity while H(qn(X)) −nconverges to h(X) = −0.278652 bits.

Thus a continuous random variable Xcontains an inﬁnite amount of infor-

mation; but we can measure the information contained in its n-bit quantized

version qn(X) as: H(qn(X)) ≈h(X) + n(for nlarge enough).

Example 5.5 Let us determine the minimum average number of bits required

to describe the uniform quantization with 3-digit accuracy of the decay time

(in years) of a radium atom assuming that the half-life of the radium (i.e., the

median of the decay time) is 80 years and that its pdf is given by fX(x) = λe−λx,

where x > 0.

Since the median of the decay time is 80, we obtain:

Z80

λe−λxdx = 0.5,

which implies that λ= 0.00866.Also, 3-digit accuracy is approximately equiva-

lent to log2999 = 9.96 ≈10 bits accuracy. Therefore, by Lemma 5.2, the number

118

of bits required to describe the quantized decay time is approximately

h(X) + 10 = log2

λ+ 10 = 18.29 bits.

We close this section by computing the diﬀerential entropy for two common

real-valued random variables: the uniformly distributed random variable and

the Gaussian distributed random variable.

Example 5.6 (Diﬀerential entropy of a uniformly distributed random

variable) Let Xbe a continuous random variable that is uniformly distributed

over the interval (a, b), where b > a; i.e., its pdf is given by

fX(x) = (1

b−aif x∈(a, b);

0 otherwise.

So its diﬀerential entropy is given by

h(X) = −Zb

b−alog2

b−a= log2(b−a) bits.

Note that if (b−a)<1 in the above example, then h(X) is negative, unlike

entropy. The above example indicates that although diﬀerential entropy has a

form analogous to entropy (in the sense that summation and pmf for entropy are

replaced by integration and pdf, respectively, for diﬀerential entropy), diﬀeren-

tial entropy does not retain all the properties of entropy (one such operational

diﬀerence was already highlighted in the previous lemma).

Example 5.7 (Diﬀerential entropy of a Gaussian random variable) Let

X∼ N(µ, σ2); i.e., Xis a Gaussian (or normal) random variable with ﬁnite mean

µ, variance Var(X) = σ2>0 and pdf

fX(x) = 1

√2πσ2e−(x−µ)2

2σ2

for x∈R. Then its diﬀerential entropy is given by

h(X) = ZfX(x)1

2log2(2πσ2) + (x−µ)2

2σ2log2edx

2log2(2πσ2) + log2e

2σ2E[(X−µ)2]

2log2(2πσ2) + 1

2log2e

2log2(2πeσ2) bits.(5.1.1)

Note that for a Gaussian random variable, its diﬀerential entropy is only a func-

tion of its variance σ2(it is independent from its mean µ).

119

5.2 Joint and conditional diﬀerential entropies, divergence

and mutual information

Deﬁnition 5.8 (Joint diﬀerential entropy) If Xn= (X1, X2,···, Xn) is a

continuous random vector of size n(i.e., a vector of ncontinuous random vari-

ables) with joint pdf fXnand support SXn⊆Rn, then its joint diﬀerential

entropy is deﬁned as

h(Xn),−ZSXn

fXn(x1, x2,···, xn) log2fXn(x1, x2,···, xn)dx1dx2··· dxn

=E[−log2fXn(Xn)]

when the n-dimensional integral exists.

Deﬁnition 5.9 (Conditional diﬀerential entropy) Let Xand Ybe two jointly

distributed continuous random variables with joint pdf fX,Y and support SX Y ⊆

R2such that the conditional pdf of Ygiven Xgiven by fY|X(y|x) = fX,Y (x,y)

fX(x)is

well deﬁned for all (x, y)∈SXY , where fXis the marginal pdf of X. Then the

conditional entropy of Ygiven Xis deﬁned as

h(Y|X),−ZSXY

fX,Y (x, y) log2fY|X(y|x)dx dy =E[−log2fY|X(Y|X)],

when the integral exists.

Note that as the in the case of (discrete) entropy, the chain rule holds for

diﬀerential entropy:

h(X, Y ) = h(X) + h(Y|X) = h(Y) + h(X|Y).

Deﬁnition 5.10 (Divergence or relative entropy) Let Xand Ybe two con-

tinuous random variables with marginal pdfs fXand fY, respectively, such that

their supports satisfy SX⊆SY⊆R. Then the divergence (or relative entropy or

Kullback-Leibler distance) between Xand Yis written as D(XkY) or D(fXkfY)

and deﬁned by

D(XkY),ZSX

fX(x) log2

fX(x)

fY(x)dx =EfX(X)

fY(X)

when the integral exists. The deﬁnition carries over similarly in the multivariate

case: for Xn= (X1, X2,··· , Xn) and Yn= (Y1, Y2,···, Yn) two random vectors

with joint pdfs fXnand fYn, respectively, and supports satisfying SXn⊆SYn⊆

Rn, then the divergence between Xnand Ynis deﬁned as

D(XnkYn),ZSXn

fXn(x1, x2,···, xn) log2

fXn(x1, x2,···, xn)

fYn(x1, x2,···, xn)dx1dx2··· dxn

when the integral exists.

120

Deﬁnition 5.11 (Mutual information) Let Xand Ybe two jointly distributed

continuous random variables with joint pdf fX,Y and support SXY ⊆R2, then

the mutual information between Xand Yis deﬁned by

I(X;Y),D(fX,Y kfXfY) = ZSX,Y

fX,Y (x, y) log2

fX,Y (x, y)

fX(x)fY(y)dx dy,

assuming the integral exists, where fXand fYare the marginal pdfs of Xand

Y, respectively.

Observation 5.12 For two jointly distributed continuous random variables X

and Ywith joint pdf fX,Y , support SX Y ⊆R2and joint diﬀerential entropy

h(X, Y ) = −ZSXY

fX,Y (x, y) log2fX,Y (x, y)dx dy,

then as in Lemma 5.2 and the ensuing discussion, one can write

H(qn(X), qm(Y)) ≈h(X, Y ) + n+m

for nand msuﬃciently large, where qs(Z) denotes the (uniformly) quantized

version of random variable Zwith an s-bit accuracy.

On the other hand, for the above continuous Xand Y,

I(qn(X); qm(Y)) = H(qn(X)) + H(qm(Y)) −H(qn(X), qm(Y))

≈[h(X) + n] + [h(Y) + m]−[h(X, Y ) + n+m]

=h(X) + h(Y)−h(X, Y )

=ZSX,Y

fX,Y (x, y) log2

fX,Y (x, y)

fX(x)fY(y)dx dy

for nand msuﬃciently large; in other words,

lim

n,m→∞ I(qn(X); qm(Y)) = h(X) + h(Y)−h(X, Y ).

Furthermore, it can be shown that

lim

n→∞ D(qn(X)kqn(Y)) = ZSX

fX(x) log2

fX(x)

fY(x)dx.

Thus mutual information and divergence can be considered as the true tools

of Information Theory, as their retain the same operational characteristics and

properties for both discrete and continuous probability spaces (as well as general

spaces where they can be deﬁned in terms of Radon-Nikodym derivatives (e.g.,

cf. [22]).4

4This justiﬁes using identical notations for both I(·;·) and D(·k·) as opposed to the dis-

cerning notations of H(·) for entropy and h(·) for diﬀerential entropy.

121

The following lemma illustrates that for continuous systems, I(·;·) and D(·k·)

keep the same properties already encountered for discrete systems, while diﬀer-

ential entropy (as already seen with its possibility if being negative) satisﬁes

some diﬀerent properties than entropy. The proof is left as an exercise.

Lemma 5.13 The following properties hold for the information measures of

continuous systems.

1. Non-negativity of divergence: Let Xand Ybe two continuous ran-

dom variables with marginal pdfs fXand fY, respectively, such that their

supports satisfy SX⊆SY⊆R. Then

D(fXkfY)≥0

with equality iﬀ fX(x) = fY(x) for all x∈S(i.e., X=Yalmost surely).

2. Non-negativity of mutual information: For any two continuous ran-

dom variables Xand Y,

I(X;Y)≥0

with equality iﬀ Xand Yare independent.

3. Conditioning never increases diﬀerential entropy: For any two con-

tinuous random variables Xand Ywith joint pdf fX,Y and well-deﬁned

conditional pdf fX|Y,

h(X|Y)≤h(X)

with equality iﬀ Xand Yare independent.

4. Chain rule for diﬀerential entropy: For a continuous random vector

Xn= (X1, X2,···, Xn),

h(X1, X2,...,Xn) =

i=1

h(Xi|X1, X2,...,Xi−1),

where h(Xi|X1, X2,...,Xi−1),h(X1) for i= 1.

5. Chain rule for mutual information: For continuous random vector

Xn= (X1, X2,···, Xn) and random variable Ywith joint pdf fXn,Y and

well-deﬁned conditional pdfs fXi,Y |Xi−1,fXi|Xi−1and fY|Xi−1for i= 1,···, n,

we have that

I(X1, X2,···, Xn;Y) =

i=1

I(Xi;Y|Xi−1,···, X1),

where I(Xi;Y|Xi−1,···, X1),I(X1;Y) for i= 1.

122

6. Data processing inequality: For continuous random variables X,Y

and Zsuch that X→Y→Z,

I(X;Y)≥I(X;Z).

7. Independence bound for diﬀerential entropy: For a continuous ran-

dom vector Xn= (X1, X2,···, Xn),

h(Xn)≤

i=1

h(Xi)

with equality iﬀ all the Xi’s are independent from each other.

8. Invariance of diﬀerential entropy under translation: For continuous

random variables Xand Ywith joint pdf fX,Y and well-deﬁned conditional

pdf fX|Y,

h(X+c) = h(X) for any constant c∈R,

and

h(X+Y|Y) = h(X|Y).

The results also generalize in the multivariate case: for two continuous

random vectors Xn= (X1, X2,···, Xn) and Yn= (Y1, Y2,··· , Yn) with

joint pdf fXn,Y nand well-deﬁned conditional pdf fXn|Yn,

h(Xn+cn) = h(Xn)

for any constant n-tuple cn= (c1, c2,···, cn)∈Rn, and

h(Xn+Yn|Yn) = h(Xn|Yn),

where the addition of two n-tuples is performed component-wise.

9. Diﬀerential entropy under scaling: For any continuous random vari-

able Xand any non-zero real constant a,

h(aX) = h(X) + log2|a|.

10. Joint diﬀerential entropy under linear mapping: Consider the ran-

dom (column) vector X= (X1, X2,···, Xn)Twith joint pdf fXn, where T

denotes transposition, and let Y= (Y1, Y2,···, Yn)Tbe a random (column)

vector obtained from the linear transformation Y=AX, where Ais an

invertible (non-singular) n×nreal-valued matrix. Then

h(Y) = h(Y1, Y2,···, Yn) = h(X1, X2,···, Xn) + log2|det(A)|,

where det(A) is the determinant of the square matrix A.

123

11. Joint diﬀerential entropy under nonlinear mapping: Consider the

random (column) vector X= (X1, X2,···, Xn)Twith joint pdf fXn, and

let Y= (Y1, Y2,···, Yn)Tbe a random (column) vector obtained from the

nonlinear transformation

Y=g(X),(g1(X1), g2(X2),···, gn(Xn))T,

where each gi:Rn→Ris a diﬀerentiable function, i= 1,2,···, n. Then

h(Y) = h(Y1, Y2,···, Yn)

=h(X1,···, Xn) + ZRn

fXn(x1,···, xn) log2|det(J)|dx1··· dxn,

where Jis the n×nJacobian matrix given by

J,





∂g1

∂x1

∂g1

∂x2··· ∂g1

∂xn

∂g2

∂x1

∂g2

∂x2··· ∂g2

∂xn

.··· .

∂gn

∂x1

∂gn

∂x2··· ∂gn

∂xn







Observation 5.14 Property 9 of the above Lemma indicates that for a contin-

uous random variable X,h(X)6=h(aX) (except for the trivial case of a= 1)

and hence diﬀerential entropy is not in general invariant under invertible maps.

This is in contrast to entropy, which is always invariant under invertible maps:

given a discrete random variable Xwith alphabet X,

H(f(X)) = H(X)

for all invertible maps f:X → Y, where Yis a discrete set; in particular

H(aX) = H(X) for all non-zero reals a.

On the other hand, for both discrete and continuous systems, mutual infor-

mation and divergence are invariant under invertible maps:

I(X;Y) = I(g(X); Y) = I(g(X); h(Y))

and

D(XkY) = D(g(X)kg(Y))

for all invertible maps gand hproperly deﬁned on the alphabet/support of the

concerned random variables. This reinforces the notion that mutual information

and divergence constitute the true tools of Information Theory.

124

Deﬁnition 5.15 (Multivariate Gaussian) A continuous random vector X=

(X1, X2,···, Xn)Tis called a size-n(multivariate) Gaussian random vector with

a ﬁnite mean vector µ,(µ1, µ2,···, µn)T, where µi,E[Xi]<∞for i=

1,2,···, n, and an n×ninvertible (real-valued) covariance matrix

KX= [Ki,j]

,E[(X−µ)(X−µ)T]

=





Cov(X1, X1) Cov(X1, X2)··· Cov(X1, Xn)

Cov(X2, X1) Cov(X2, X2)··· Cov(X2, Xn)

.··· .

Cov(Xn, X1) Cov(Xn, X2)··· Cov(Xn, Xn)







where Ki,j = Cov(Xi, Xj),E[(Xi−µi)(Xj−µj)] is the covariance5between Xi

and Xjfor i, j = 1,2,···, n, if its joint pdf is given by the multivariate Gaussian

pdf

fXn(x1, x2,···, xn) = 1

(√2π)npdet(KX)e−1

2(x−µ)TK−1

X(x−µ)

for any (x1, x2,···, xn)∈Rn, where x= (x1, x2,··· , xn)T. As in the scalar case

(i.e., for n= 1), we write X∼ Nn(µ, KX) to denote that Xis a size-nGaussian

random vector with mean vector µand covariance matrix KX.

Observation 5.16 In light of the above deﬁnition, we make the following re-

marks.

1. Note that a covariance matrix Kis always symmetric (i.e., KT=K)

and positive-semideﬁnite.6But as we require KXto be invertible in the

deﬁnition of the multivariate Gaussian distribution above, we will hereafter

assume that the covariance matrix of Gaussian random vectors is positive-

deﬁnite (which is equivalent to having all the eigenvalues of KXpositive),

thus rendering the matrix invertible.

5Note that the diagonal components of KXyield the variance of the diﬀerent random

variables: Ki,i = Cov(Xi, Xi) = Var(Xi) = σ2

Xi,i= 1,···, n.

6An n×nreal-valued symmetric matrix Kis positive-semideﬁnite (e.g., cf. [15]) if for every

real-valued vector x= (x1, x2···, xn)T,

xTKx= (x1,···, xn)K









≥0,

with equality holding only when xi= 0 for i= 1,2,···, n. Furthermore, the matrix is positive-

deﬁnite if xTKx>0 for all real-valued vectors x6= 0, where 0 is the all-zero vector of size

125

2. If a random vector X= (X1, X2,···, Xn)Thas a diagonal covariance ma-

trix KX(i.e., all the oﬀ-diagonal components of KXare zero: Ki,j = 0

for all i6=j,i, j = 1,···, n), then all its component random variables are

uncorrelated but not necessarily independent. However, if Xis Gaussian

and have a diagonal covariance matrix, then all its component random

variables are independent from each other.

3. Any linear transformation of a Gaussian random vector yields another

Gaussian random vector. Speciﬁcally, if X∼ Nn(µ, KX) is a size-nGaus-

sian random vector with mean vector µand covariance matrix KX, and if

Y=AmnX, where Amn is a given m×nreal-valued matrix, then

Y∼ Nm(Amnµ, AmnKXAT

mn)

is a size-mGaussian random vector with mean vector Amnµand covariance

matrix AmnKXAT

mn.

More generally, any aﬃne transformation of a Gaussian random vector

yields another Gaussian random vector: if X∼ Nn(µ, KX) and Y=

AmnX+bm, where Amn is a m×nreal-valued matrix and bmis a size-m

real-valued vector, then

Y∼ Nm(Amnµ+bm,AmnKXAT

mn).

Theorem 5.17 (Joint diﬀerential entropy of the multivariate Gaussian)

If X∼ Nn(µ, KX) is a Gaussian random vector with mean vector µand (positive-

deﬁnite) covariance matrix KX, then its joint diﬀerential entropy is given by

h(X) = h(X1, X2,···, Xn) = 1

2log2[(2πe)ndet(KX)] .(5.2.1)

In particular, in the univariate case of n= 1, (5.2.1) reduces to (5.1.1).

Proof: Without loss of generality we assume that Xhas a zero mean vec-

tor since its diﬀerential entropy is invariant under translation by Property 8 of

Lemma 5.13:

h(X) = h(X−µ);

so we assume that µ= 0.

Since the covariance matrix KXis a real-valued symmetric matrix, then

it is orthogonally diagonizable; i.e., there exits a square (n×n) orthogonal

matrix A(i.e., satisfying AT=A−1) such that AKXATis a diagonal ma-

trix whose entries are given by the eigenvalues of KX(Ais constructed using

the eigenvectors of KX; e.g., see [15]). As a result the linear transformation

126

Y=AX∼ Nn0,AKXATis a Gaussian vector with the diagonal covariance

matrix KY=AKXATand has therefore independent components (as noted in

Observation 5.16). Thus

h(Y) = h(Y1, Y2,···, Yn)

=h(Y1) + h(Y2) + ···+h(Yn) (5.2.2)

i=1

2log2[2πeVar(Yi)] (5.2.3)

2log2(2πe) + 1

2log2"n

i=1

Var(Yi)#

2log2(2πe) + 1

2log2[det (KY)] (5.2.4)

2log2(2πe)n+1

2log2[det (KX)] (5.2.5)

2log2[(2πe)ndet (KX)] ,(5.2.6)

where (5.2.2) follows by the independence of the random variables Y1,...,Yn

(e.g., see Property 7 of Lemma 5.13), (5.2.3) follows from (5.1.1), (5.2.4) holds

since the matrix KYis diagonal and hence its determinant is given by the product

of its diagonal entries, and (5.2.5) holds since

det (KY) = det AKXAT

= det(A)det (KX) det(AT)

= det(A)2det (KX)

= det (KX),

where the last equality holds since det(A)2= 1, as the matrix Ais orthogonal

(AT=A−1=⇒det(A) = det(AT) = 1/[det(A)]; thus, det(A)2= 1).

Now invoking Property 10 of Lemma 5.13 and noting that |det(A)|= 1 yield

that

h(Y1, Y2,···, Yn) = h(X1, X2,···, Xn) + log2|det(A)|

|{z }

=h(X1, X2,···, Xn).

We therefore obtain using (5.2.6) that

h(X1, X2,···, Xn) = 1

2log2[(2πe)ndet (KX)] ,

hence completing the proof.

An alternate (but rather mechanical) proof to the one presented above con-

sists of directly evaluating the joint diﬀerential entropy of Xby integrating

−fXn(xn) log2fXn(xn) over Rn; it is left as an exercise. 2

127

Corollary 5.18 (Hadamard’s inequality) For any real-valued n×npositive-

deﬁnite matrix K= [Ki,j]i,j=1,··· ,n,

det(K)≤

i=1

Ki,i

with equality iﬀ Kis a diagonal matrix, where Ki,i are the diagonal entries of

Proof: Since every positive-deﬁnite matrix is a covariance matrix (e.g., see

[20]), let X= (X1, X2,···, Xn)T∼ Nn(0,K) be a jointly Gaussian random

vector with zero mean vector and covariance matrix K. Then

2log2[(2πe)ndet(K)] = h(X1, X2,···, Xn) (5.2.7)

≤

i=1

h(Xi) (5.2.8)

i=1

2log2[2πeVar(Xi)] (5.2.9)

2log2"(2πe)n

i=1

Ki,i#,(5.2.10)

where (5.2.7) follows from Theorem 5.17, (5.2.8) follows from Property 7 of

Lemma 5.13 and (5.2.9)-(5.2.10) hold using (5.1.1) along with the fact that

each random variable Xi∼ N(0, Ki,i) is Gaussian with zero mean and variance

Var(Xi) = Ki,i for i= 1,2,···, n (as the marginals of a multivariate Gaussian

are also Gaussian (e,g., cf.[20])).

Finally, from (5.2.10), we directly obtain that

det(K)≤

i=1

Ki,i,

with equality iﬀ the jointly Gaussian random variables X1,X2,...,Xnare inde-

pendent from each other, or equivalently iﬀ the covariance matrix Kis diagonal.

The next theorem states that among all real-valued size-nrandom vectors (of

support Rn) with identical mean vector and covariance matrix, the Gaussian

random vector has the largest diﬀerential entropy.

128

Theorem 5.19 (Maximal diﬀerential entropy for real-valued random

vectors)Let X= (X1, X2,···, Xn)Tbe a real-valued random vector with

support SXn=Rn, mean vector µand covariance matrix KX. Then

h(X1, X2,···, Xn)≤1

2log2[(2πe)ndet(KX)] ,(5.2.11)

with equality iﬀ Xis Gaussian; i.e., X∼ Nnµ, KX.

Proof: We will present the proof in two parts: the scalar or univariate case,

and the multivariate case.

(i) Scalar case (n= 1): For a real-valued random variable with support SX=R,

mean µand variance σ2, let us show that

h(X)≤1

2log22πeσ2,(5.2.12)

with equality iﬀ X∼ N(µ, σ2).

For a Gaussian random variable Y∼ N(µ, σ2), using the non-negativity of

divergence, can write

0≤D(XkY)

=ZR

fX(x) log2

fX(x)

√2πσ2e−(x−µ)2

2σ2

=−h(X) + ZR

fX(x)log2√2πσ2+(x−µ)2

2σ2log2edx

=−h(X) + 1

2log2(2πσ2) + log2e

2σ2ZR

(x−µ)2fX(x)dx

|{z }

=σ2

=−h(X) + 1

2log22πeσ2.

Thus

h(X)≤1

2log22πeσ2,

with equality iﬀ X=Y(almost surely); i.e., X∼ N(µ, σ2).

(ii). Multivariate case (n > 1): As in the proof of Theorem 5.17, we can use an

orthogonal square matrix A(i.e., satisfying AT=A−1and hence |det(A)|= 1)

such that AKXATis diagonal. Therefore, the random vector generated by the

linear map

Z=AX

129

will have a covariance matrix given by KZ=AKXATand hence have uncorre-

lated (but not necessarily independent) components. Thus

h(X) = h(Z)−log2|det(A)|

|{z }

(5.2.13)

=h(Z1, Z2,···, Zn)

≤

i=1

h(Zi) (5.2.14)

≤

i=1

2log2[2πeVar(Zi)] (5.2.15)

2log2(2πe) + 1

2log2"n

i=1

Var(Zi)#

2log2(2πe)n+1

2log2[det (KZ)] (5.2.16)

2log2(2πe)n+1

2log2[det (KX)] (5.2.17)

2log2[(2πe)ndet (KX)] ,

where (5.2.13) holds by Property 10 of Lemma 5.13 and since |det(A)|= 1,

(5.2.14) follows from Property 7 of Lemma 5.13, (5.2.15) follows from (5.2.12)

(the scalar case above), (5.2.16) holds since KZis diagonal, and (5.2.17) follows

from the fact that det (KZ) = det (KX) (as Ais orthogonal). Finally, equality is

achieved in both (5.2.14) and (5.2.15) iﬀ the random variables Z1,Z2,...,Znare

Gaussian and independent from each other, or equivalently iﬀ X∼ Nnµ, KX.

Observation 5.20 The following two results can also be shown (the proof is

left as an exercise):

1. Among all continuous random variables admitting a pdf with support the

interval (a, b), where b > a are real numbers, the uniformly distributed

random variable maximizes diﬀerential entropy.

2. Among all continuous random variables admitting a pdf with support the

interval [0,∞) and ﬁnite mean µ, the exponential distribution with param-

eter (or rate parameter) λ= 1/µ maximizes diﬀerential entropy.

A systematic approach to ﬁnding distributions that maximize diﬀerential entropy

subject to various support and moments constraints can be found in [12, 45].

130

5.3 AEP for continuous memoryless sources

The AEP theorem and its consequence for discrete memoryless (i.i.d.) sources

reveal to us that the number of elements in the typical set is approximately

2nH(X), where H(X) is the source entropy, and that the typical set carries al-

most all the probability mass asymptotically (see Theorems 3.3 and 3.4). An

extension of this result from discrete to continuous memoryless sources by just

counting the number of elements in a continuous (typical) set deﬁned via a law-

of-large-numbers argument is not possible, since the total number of elements

in a continuous set is inﬁnite. However, when considering the volume of that

continuous typical set (which is a natural analog to the size of a discrete set),

such an extension, with diﬀerential entropy playing a similar role as entropy,

becomes straightforward.

Theorem 5.21 (AEP for continuous memoryless sources) Let {Xi}∞

i=1 be

a continuous memoryless source (i.e., an inﬁnite sequence of continuous i.i.d. ran-

dom variables) with pdf fX(·) and diﬀerential entropy h(X). Then

−1

nlog fX(X1,...,Xn)→E[−log2fX(X)] = h(X) in probability.

Proof: The proof is an immediate result of the law of large numbers (e.g., see

Theorem 3.3). 2

Deﬁnition 5.22 (Typical set) For δ > 0 and any ngiven, deﬁne the typical

set for the above continuous source as

Fn(δ),xn∈Rn:−1

nlog2fX(X1,...,Xn)−h(X)< δ.

Deﬁnition 5.23 (Volume) The volume of a set A ⊂ Rnis deﬁned as

Vol(A),ZA

dx1···dxn.

Theorem 5.24 (Consequence of the AEP for continuous memoryless

sources) For a continuous memoryless source {Xi}∞

i=1 with diﬀerential entropy

h(X), the following hold.

1. For nsuﬃciently large, PXn{Fn(δ)}>1−δ.

2. Vol(Fn(δ))≤2n(h(X)+δ)for all n.

3. Vol(Fn(δ))≥(1 −δ)2n(h(X)−δ)for nsuﬃciently large.

Proof: The proof is quite analogous to the corresponding theorem for discrete

memoryless sources (Theorem 3.4) and is left as an exercise. 2

131

5.4 Capacity of the discrete-time memoryless Gaussian

channel

132

Appendix A

Overview on Suprema and Limits

We herein review basic results on suprema and limits which are useful for the

development of information theoretic coding theorems; they can be found in

standard real analysis texts (e.g., see [32, 43]).

A.1 Supremum and maximum

Throughout, we work on subsets of R, the set of real numbers.

Deﬁnition A.1 (Upper bound of a set) A real number uis called an upper

bound of a non-empty subset Aof Rif every element of Ais less than or equal

to u; we say that Ais bounded above. Symbolically, the deﬁnition becomes:

A ⊂ Ris bounded above ⇐⇒ (∃u∈R) such that (∀a∈ A), a ≤u.

Deﬁnition A.2 (Least upper bound or supremum) If Ais a non-empty

subset of R, then we say that a real number sis a least upper bound or supremum

of Aif sis an upper bound of the set Aand if s≤s′for each upper bound s′

of A. In this case, we write s= sup A; other notations are s= supx∈A xand

s= sup{x:x∈ A}.

Completeness Axiom: (Least upper bound property) Let Abe a non-

empty subset of Rthat is bounded above. Then Ahas a least upper bound.

It follows directly that if a non-empty set in Rhas a supremum, then this

supremum is unique. Furthermore, note that the empty set (∅) and any set

not bounded above do not admit a supremum in R. However, when working

in the set of extended real numbers given by R∪ {−∞,∞}, we can deﬁne the

133

supremum of the empty set as −∞ and that of a set not bounded above as ∞.

These extended deﬁnitions will be adopted in the text.

We now distinguish between two situations: (i) the supremum of a set A

belongs to A, and (ii) the supremum of a set Adoes not belong to A. It is quite

easy to create examples for both situations. A quick example for (i) involves

the set (0,1], while the set (0,1) can be used for (ii). In both examples, the

supremum is equal to 1; however, in the former case, the supremum belongs to

the set, while in the latter case it does not. When a set contains its supremum,

we call the supremum the maximum of the set.

Deﬁnition A.3 (Maximum) If sup A ∈ A, then sup Ais also called the max-

imum of A, and is denoted by max A. However, if sup A 6∈ A, then we say that

the maximum of Adoes not exist.

Property A.4 (Properties of the supremum)

1. The supremum of any set in R∪ {−∞,∞} always exits.

2. (∀a∈ A)a≤sup A.

3. If −∞ <sup A<∞, then (∀ε > 0)(∃a0∈ A)a0>sup A − ε.

(The existence of a0∈(sup A−ε, sup A] for any ε > 0 under the condition

of |sup A| <∞is called the approximation property for the supremum.)

4. If sup A=∞, then (∀L∈R)(∃B0∈ A)B0> L.

5. If sup A=−∞, then Ais empty.

Observation A.5 In Information Theory, a typical channel coding theorem

establishes that a (ﬁnite) real number αis the supremum of a set A. Thus, to

prove such a theorem, one must show that αsatisﬁes both properties 3 and 2

above, i.e.,

(∀ε > 0)(∃a0∈ A)a0> α −ε(A.1.1)

and

(∀a∈ A)a≤α, (A.1.2)

where (A.1.1) and (A.1.2) are called the achievability (or forward) part and the

converse part, respectively, of the theorem. Speciﬁcally, (A.1.2) states that αis

an upper bound of A, and (A.1.1) states that no number less than αcan be an

upper bound for A.

134

Property A.6 (Properties of the maximum)

1. (∀a∈ A)a≤max A, if max Aexists in R∪ {−∞,∞}.

2. max A ∈ A.

From the above property, in order to obtain α= max A, one needs to show

that αsatisﬁes both

(∀a∈ A)a≤αand α∈ A.

A.2 Inﬁmum and minimum

The concepts of inﬁmum and minimum are dual to those of supremum and

maximum.

Deﬁnition A.7 (Lower bound of a set) A real number ℓis called a lower

bound of a non-empty subset Ain Rif every element of Ais greater than or

equal to ℓ; we say that Ais bounded below. Symbolically, the deﬁnition becomes:

A ⊂ Ris bounded below ⇐⇒ (∃ℓ∈R) such that (∀a∈ A)a≥ℓ.

Deﬁnition A.8 (Greatest lower bound or inﬁmum) If Ais a non-empty

subset of R, then we say that a real number ℓis a greatest lower bound or

inﬁmum of Aif ℓis a lower bound of Aand if ℓ≥ℓ′for each lower bound ℓ′

of A. In this case, we write ℓ= inf A; other notations are ℓ= infx∈A xand

ℓ= inf{x:x∈ A}.

Completeness Axiom: (Greatest lower bound property) Let Abe a

non-empty subset of Rthat is bounded below. Then Ahas a greatest lower

bound.

As for the case of the supremum, it directly follows that if a non-empty set

in Rhas an inﬁmum, then this inﬁmum is unique. Furthermore, working in the

set of extended real numbers, the inﬁmum of the empty set is deﬁned as ∞and

that of a set not bounded below as −∞.

Deﬁnition A.9 (Minimum) If inf A ∈ A, then inf Ais also called the min-

imum of A, and is denoted by min A. However, if inf A 6∈ A, we say that the

minimum of Adoes not exist.

135

Property A.10 (Properties of the inﬁmum)

1. The inﬁmum of any set in R∪ {−∞,∞} always exists.

2. (∀a∈ A)a≥inf A.

3. If ∞>inf A>−∞, then (∀ε > 0)(∃a0∈ A)a0<inf A+ε.

(The existence of a0∈[inf A,inf A+ε) for any ε > 0 under the assumption

of |inf A| <∞is called the approximation property for the inﬁmum.)

4. If inf A=−∞, then (∀A∈R)(∃B0∈ A)B0< L.

5. If inf A=∞, then Ais empty.

Observation A.11 Analogously to Observation A.5, a typical source coding

theorem in Information Theory establishes that a (ﬁnite) real number αis the

inﬁmum of a set A. Thus, to prove such a theorem, one must show that α

satisﬁes both properties 3 and 2 above, i.e.,

(∀ε > 0)(∃a0∈ A)a0< α +ε(A.2.1)

and

(∀a∈ A)a≥α. (A.2.2)

Here, (A.2.1) is called the achievability or forward part of the coding theorem;

it speciﬁes that no number greater than αcan be a lower bound for A. Also,

(A.2.2) is called the converse part of the theorem; it states that αis a lower

bound of A.

Property A.12 (Properties of the minimum)

1. (∀a∈ A)a≥min A, if min Aexists in R∪ {−∞,∞}.

2. min A ∈ A.

A.3 Boundedness and suprema operations

Deﬁnition A.13 (Boundedness) A subset Aof Ris said to be bounded if it

is both bounded above and bounded below; otherwise it is called unbounded.

Lemma A.14 (Condition for boundedness) A subset Aof Ris bounded iﬀ

(∃k∈R) such that (∀a∈ A)|a| ≤ k.

136

Lemma A.15 (Monotone property) Suppose that Aand Bare non-empty

subsets of Rsuch that A ⊂ B. Then

1. sup A ≤ sup B.

2. inf A ≥ inf B.

Lemma A.16 (Supremum for set operations) Deﬁne the “addition” of two

sets Aand Bas

A+B,{c∈R:c=a+bfor some a∈ A and b∈ B}.

Deﬁne the “scaler multiplication” of a set Aby a scalar k∈Ras

k· A ,{c∈R:c=k·afor some a∈ A}.

Finally, deﬁne the “negation” of a set Aas

−A ,{c∈R:c=−afor some a∈ A}.

Then the following hold.

1. If Aand Bare both bounded above, then A+Bis also bounded above

and sup(A+B) = sup A+ sup B.

2. If 0 < k < ∞and Ais bounded above, then k· A is also bounded above

and sup(k· A) = k·sup A.

3. sup A=−inf(−A) and inf A=−sup(−A).

Property 1 does not hold for the “product” of two sets, where the “product”

of sets Aand Bis deﬁned as as

A · B ,{c∈R:c=ab for some a∈ A and b∈ B}.

In this case, both of these two situations can occur:

sup(A · B)>(sup A)·(sup B)

sup(A · B) = (sup A)·(sup B).

137

Lemma A.17 (Supremum/inﬁmum for monotone functions)

1. If f:R→Ris a non-decreasing function, then

sup{x∈R:f(x)< ε}= inf{x∈R:f(x)≥ε}

and

sup{x∈R:f(x)≤ε}= inf{x∈R:f(x)> ε}.

2. If f:R→Ris a non-increasing function, then

sup{x∈R:f(x)> ε}= inf{x∈R:f(x)≤ε}

and

sup{x∈R:f(x)≥ε}= inf{x∈R:f(x)< ε}.

The above lemma is illustrated in Figure A.1.

A.4 Sequences and their limits

Let Ndenote the set of “natural numbers” (positive integers) 1,2,3,···. A

sequence drawn from a real-valued function is denoted by

f:N→R.

In other words, f(n) is a real number for each n= 1,2,3,···. It is usual to write

f(n) = an, and we often indicate the sequence by any one of these notations

{a1, a2, a3,···, an,··· } or {an}∞

n=1.

One important question that arises with a sequence is what happens when n

gets large. To be precise, we want to know that when nis large enough, whether

or not every anis close to some ﬁxed number L(which is the limit of an).

Deﬁnition A.18 (Limit) The limit of {an}∞

n=1 is the real number Lsatisfying:

(∀ε > 0)(∃N) such that (∀n > N)

|an−L|< ε.

In this case, we write L= limn→∞ an. If no such Lsatisﬁes the above statement,

we say that the limit of {an}∞

n=1 does not exist.

138

f(x)

sup{x:f(x)< ε}

= inf{x:f(x)≥ε}

sup{x:f(x)≤ε}

= inf{x:f(x)> ε}

f(x)

sup{x:f(x)≥ε}

= inf{x:f(x)< ε}

sup{x:f(x)> ε}

= inf{x:f(x)≤ε}

Figure A.1: Illustration of Lemma A.17.

Property A.19 If {an}∞

n=1 and {bn}∞

n=1 both have a limit in R, then the fol-

lowing hold.

1. limn→∞(an+bn) = limn→∞ an+ limn→∞ bn.

2. limn→∞(α·an) = α·limn→∞ an.

3. limn→∞(anbn) = (limn→∞ an)(limn→∞ bn).

Note that in the above deﬁnition, −∞ and ∞cannot be a legitimate limit

for any sequence. In fact, if (∀L)(∃N) such that (∀n > N)an> L, then we

139

say that andiverges to ∞and write an→ ∞. A similar argument applies to

andiverging to −∞. For convenience, we will work in the set of extended real

numbers and thus state that a sequence {an}∞

n=1 that diverges to either ∞or

−∞ has a limit in R∪ {−∞,∞}.

Lemma A.20 (Convergence of monotone sequences) If {an}∞

n=1 is non-de-

creasing in n, then limn→∞ anexists in R∪{−∞,∞}. If {an}∞

n=1 is also bounded

from above – i.e., an≤L∀nfor some Lin R– then limn→∞ anexists in R.

Likewise, if {an}∞

n=1 is non-increasing in n, then limn→∞ anexists in R∪

{−∞,∞}. If {an}∞

n=1 is also bounded from below – i.e., an≥L∀nfor some L

in R– then limn→∞ anexists in R.

As stated above, the limit of a sequence may not exist. For example, an=

(−1)n. Then anwill be close to either −1 or 1 for nlarge. Hence, more general-

ized deﬁnitions that can describe the general limiting behavior of a sequence is

required.

Deﬁnition A.21 (limsup and liminf) The limit supremum of {an}∞

n=1 is the

extended real number in R∪ {−∞,∞} deﬁned by

lim sup

n→∞

an,lim

n→∞(sup

k≥n

ak),

and the limit inﬁmum of {an}∞

n=1 is the extended real number deﬁned by

lim inf

n→∞ an,lim

n→∞(inf

k≥nak).

Some also use the notations lim and lim to denote limsup and liminf, respectively.

Note that the limit supremum and the limit inﬁmum of a sequence is always

deﬁned in R∪ {−∞,∞}, since the sequences supk≥nak= sup{ak:k≥n}and

infk≥nak= inf{ak:k≥n}are monotone in n(cf. Lemma A.20). An immediate

result follows from the deﬁnitions of limsup and liminf.

Lemma A.22 (Limit) For a sequence {an}∞

n=1,

lim

n→∞ an=L⇐⇒ lim sup

n→∞

an= lim inf

n→∞ an=L.

Some properties regarding the limsup and liminf of sequences (which are

parallel to Properties A.4 and A.10) are listed below.

140

Property A.23 (Properties of the limit supremum)

1. The limit supremum always exists in R∪ {−∞,∞}.

2. If |lim supm→∞ am|<∞, then (∀ε > 0)(∃N) such that (∀n > N )

an<lim supm→∞ am+ε. (Note that this holds for every n > N .)

3. If |lim supm→∞ am|<∞, then (∀ε > 0 and integer K)(∃N > K) such

that aN>lim supm→∞ am−ε. (Note that this holds only for one N, which

is larger than K.)

Property A.24 (Properties of the limit inﬁmum)

1. The limit inﬁmum always exists in R∪ {−∞,∞}.

2. If |lim inf m→∞ am|<∞, then (∀ε > 0 and K)(∃N > K) such that

aN<lim infm→∞ am+ε. (Note that this holds only for one N, which is

larger than K.)

3. If |lim infm→∞ am|<∞, then (∀ε > 0)(∃N) such that (∀n > N)an>

lim infm→∞ am−ε. (Note that this holds for every n > N .)

The last two items in Properties A.23 and A.24 can be stated using the

terminology of suﬃciently large and inﬁnitely often, which is often adopted in

Information Theory.

Deﬁnition A.25 (Suﬃciently large) We say that a property holds for a se-

quence {an}∞

n=1 almost always or for all suﬃciently large nif the property holds

for every n > N for some N.

Deﬁnition A.26 (Inﬁnitely often) We say that a property holds for a se-

quence {an}∞

n=1 inﬁnitely often or for inﬁnitely many nif for every K, the prop-

erty holds for one (speciﬁc) Nwith N > K.

Then properties 2 and 3 of Property A.23 can be respectively re-phrased as:

if |lim supm→∞ am|<∞, then (∀ε > 0)

an<lim sup

m→∞

am+εfor all suﬃciently large n

and

an>lim sup

m→∞

am−εfor inﬁnitely many n.

141

Similarly, properties 2 and 3 of Property A.24 becomes: if |lim infm→∞ am|<∞,

then (∀ε > 0)

an<lim inf

m→∞ am+εfor inﬁnitely many n

and

an>lim inf

m→∞ am−εfor all suﬃciently large n.

Lemma A.27

1. lim infn→∞ an≤lim supn→∞ an.

2. If an≤bnfor all suﬃciently large n, then

lim inf

n→∞ an≤lim inf

n→∞ bnand lim sup

n→∞

an≤lim sup

n→∞

bn.

3. lim supn→∞ an< r ⇒an< r for all suﬃciently large n.

4. lim supn→∞ an> r ⇒an> r for inﬁnitely many n.

lim inf

n→∞ an+ lim inf

n→∞ bn≤lim inf

n→∞ (an+bn)

≤lim sup

n→∞

an+ lim inf

n→∞ bn

≤lim sup

n→∞

(an+bn)

≤lim sup

n→∞

an+ lim sup

n→∞

bn.

6. If limn→∞ anexists, then

lim inf

n→∞ (an+bn) = lim

n→∞ an+ lim inf

n→∞ bn

and

lim sup

n→∞

(an+bn) = lim

n→∞ an+ lim sup

n→∞

bn.

Finally, one can also interpret the limit supremum and limit inﬁmum in terms

of the concept of clustering points. A clustering point is a point that a sequence

{an}∞

n=1 approaches (i.e., belonging to a ball with arbitrarily small radius and

that point as center) inﬁnitely many times. For example, if an= sin(nπ/2),

then {an}∞

n=1 ={1,0,−1,0,1,0,−1,0,...}. Hence, there are three clustering

points in this sequence, which are −1, 0 and 1. Then the limit supremum of

the sequence is nothing but its largest clustering point, and its limit inﬁmum

is exactly its smallest clustering point. Speciﬁcally, lim supn→∞ an= 1 and

lim infn→∞ an=−1. This approach can sometimes be useful to determine the

limsup and liminf quantities.

142

A.5 Equivalence

We close this appendix by providing some equivalent statements that are often

used to simplify proofs. For example, instead of directly showing that quantity

xis less than or equal to quantity y, one can take an arbitrary constant ε > 0

and prove that x < y +ε. Since y+εis a larger quantity than y, in some cases

it might be easier to show x < y +εthan proving x≤y. By the next theorem,

any proof that concludes that “x < y +εfor all ε > 0” immediately gives the

desired result of x≤y.

Theorem A.28 For any x, y and ain R,

1. x < y +εfor all ε > 0 iﬀ x≤y;

2. x < y −εfor some ε > 0 iﬀ x < y;

3. x > y −εfor all ε > 0 iﬀ x≥y;

4. x > y +εfor some ε > 0 iﬀ x > y;

5. |a|< ε for all ε > 0 iﬀ a= 0.

143

Appendix B

Overview in Probability and Random

Processes

This appendix presents a quick overview of basic concepts from probability the-

ory and the theory of random processes. The reader can consult comprehensive

texts on these subjects for a thorough study (e.g., cf. [2, 6, 20]). We close the ap-

pendix with a brief discussion of Jensen’s inequality and the Lagrange multipliers

technique for the optimization of convex functions [5, 11].

B.1 Probability space

Deﬁnition B.1 (σ-Fields) Let Fbe a collection of subsets of a non-empty set

Ω. Then Fis called a σ-ﬁeld (or σ-algebra) if the following hold:

1. Ω ∈ F.

2. Closure of Funder complementation: If A∈ F, then Ac∈fc, where

“Ac” is the complement set of A(relative to Ω).

3. Closure of Funder countable union: If Ai∈ F for i= 1,2,3,..., then

S∞

i=1 Ai∈ F.

It directly follows that the empty set ∅is also an element of F(as Ωc=∅)

and that Fis closed under countable intersection since

∞

i=1

Ai= ∞

[

i=1

Ai!c

The largest σ-ﬁeld of subsets of a given set Ω is the collection of all subsets of

Ω (i.e., its powerset), while the smallest σ-ﬁeld is given by {Ω,∅}. Also, if Ais

144

a proper (strict) non-empty subset of Ω, then the smallest σ-ﬁeld containing A

is given by {Ω,∅, A, Ac}.

Deﬁnition B.2 (Probability space) Aprobability space is a triple (Ω,F, P ),

where Ω is a given set called sample space containing all possible outcomes

(usually observed from an experiment), Fis the σ-ﬁeld of subsets Ω and P

is a probability measure P:F → [0,1] on the σ-ﬁeld satisfying the following:

1. 0 ≤P(A)≤1 for all A∈ F.

2. P(Ω) = 1.

3. Countable additivity: If A1,A2,... is a sequence of disjoint sets (i.e.,

Ai∩Aj=∅ ∀ i6=j) in F, then

P ∞

[

k=1

Ak!=∞

k=1

P(Ak).

It directly follows from properties 1-3 of the above deﬁnition that P(∅) = 0.

Usually, the σ-ﬁeld Fis called the event space and its elements (which are

subsets of Ω satisfying the properties of Deﬁnition B.1) are called events.

B.2 Random variable and random process

B.3 Central limit theorem

Theorem B.3 (Central limit theorem) If {Xn}∞

n=1 is a sequence of i.i.d. ran-

dom variables with ﬁnite common marginal mean µand variance σ2, then

√n

i=1

(Xi−µ)d

−→ Z∼ N(0, σ2),

where the convergence is in distribution (as n→ ∞) and Z∼ N(0, σ2) is a

Gaussian distributed random variable with mean 0 and variance σ2.

B.4 Convexity, concavity and Jensen’s inequality

Jensen’s inequality provides a useful bound for the expectation of convex (or

concave) functions.

145

Deﬁnition B.4 (Convexity) Consider a convex set1O ∈ Rm, where mis a

ﬁxed positive integer. Then a function f:O → Ris said to be convex over Oif

for every x,yin Oand 0 ≤λ≤1,

fλx + (1 −λ)y≤λf(x) + (1 −λ)f(y).

Furthermore, a function fis said to be strictly convex if equality holds only when

λ= 0 or λ= 1.

Deﬁnition B.5 (Concavity) A function fis concave if −fis convex.

Note that when O= (a, b) is an interval in Rand function f:O → Rhas a

non-negative (respectively positive) second derivative over O, then the function

is convex (resp. strictly convex). This can be easily shown via the Taylor series

expansion of the function.

Theorem B.6 (Jensen’s inequality) If f:O → Ris convex over a convex

set O ⊂ Rm, and X= (X1, X2,···, Xm)Tis an m-dimensional random vector

with alphabet X ⊂ O, then

E[f(X)] ≥f(E[X]).

Moreover, if fis strictly convex, then equality in the above inequality immedi-

ately implies X=E[X] with probability 1.

Note: Ois a convex set; hence, X ⊂ O implies E[X]∈ O. This guarantees

that f(E[X]) is deﬁned. Similarly, if fis concave, then

E[f(X)] ≤f(E[X]).

Furthermore, if fis strictly concave, then equality in the above inequality im-

mediately implies that X=E[X] with probability 1.

Proof: Let y=aTx+bbe a “support hyperplane” for fwith “slope” vector aT

and aﬃne parameter bthat passes through the point (E[X], f (E[X])), where a

1A set O ∈ Rmis said to be convex if for every x= (x1, x2,···, xm)Tand y=

(y1, y2,···, ym)Tin O(where Tdenotes transposition), and every 0 ≤λ≤1, λx+(1−λ)y∈ O;

in other words, the “convex combination” of any two “points” xand yin Oalso belongs to O.

146

support hyperplane2for function fat x′is by deﬁnition a hyperplane passing

through the point (x′, f (x′)) and lying entirely below the graph of f(see Fig. B.4

for an illustration of a support line for a convex function over R).

support line

y=ax +b

f(x)

Figure B.1: The support line y=ax +bof the convex function f(x).

Thus,

(∀x∈ X)aTx+b≤f(x).

By taking the expectation value of both sides, we obtain

aTE[X] + b≤E[f(X)],

but we know that aTE[X] + b=f(E[X]). Consequently,

f(E[X]) ≤E[f(X)].

2A hyperplane y=aTx+bis said to be a support hyperplane for a function fwith “slope”

vector aT∈Rmand aﬃne parameter b∈Rif among all hyperplanes of the same slope vector

a, it is the largest one satisfying aTx+b≤f(x) for every x∈ O. Hence, a support hyperplane

may not necessarily pass through the point (x′, f(x′)) for every x′∈ O. Here, since we only

consider convex functions, the validity of the support hyperplane at x0passing (x′, f (x′)) is

therefore guaranteed. Note that when xis one-dimensional (i.e., m= 1), a support hyperplane

is simply referred to as a support line.

147

Bibliography

[1] S. Arimoto, “An algorithm for computing the capacity of arbitrary discrete

memoryless channel,” IEEE Trans. Inform. Theory, vol. 18, no. 1, pp. 14-20,

Jan. 1972.

[2] R. B. Ash and C. A. Dol´eans-Dade, Probability and Measure Theory, Aca-

demic Press, MA, 2000.

[3] C. Berrou, A. Glavieux and P. Thitimajshima, “Near Shannon limit error-

correcting coding and decoding: Turbo-codes(1),” Proc. IEEE Int. Conf.

Commun., pp. 1064-1070, Geneva, Switzerland, May 1993.

[4] C. Berrou and A. Glavieux, “Near optimum error correcting coding and

decoding: Turbo-codes,” IEEE Trans. Commun., vol. 44, no. 10, pp. 1261-

1271, Oct. 1996.

[5] D. P. Bertsekas, with A. Nedi´c and A. E. Ozdagler, Convex Analysis and

Optimization, Athena Scientiﬁc, Belmont, MA, 2003.

[6] P. Billingsley. Probability and Measure, 2nd. Ed., John Wiley and Sons, NY,

1995.

[7] R. E. Blahut, “Computation of channel capacity and rate-distortion func-

tions,” IEEE Trans. Inform. Theory, vol. 18, no. 4, pp. 460-473, Jul. 1972.

[8] R. E. Blahut, Theory and Practice of Error Control Codes, Addison-Wesley,

MA, 1983.

[9] R. E. Blahut. Principles and Practice of Information Theory. Addison Wes-

ley, MA, 1988.

[10] R. E. Blahut, Algebraic Codes for Data Transmission, Cambridge Univ.

Press, 2003.

[11] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University

Press, Cambridge, UK, 2003.

148

[12] T. M. Cover and J.A. Thomas, Elements of Information Theory, 2nd Ed.,

Wiley, NY, 2006.

[13] I. Csisz´ar and J. K¨orner, Information Theory: Coding Theorems for Discrete

Memoryless Systems, Academic, NY, 1981.

[14] I. Csisz´ar and G. Tusnady, “Information geometry and alternating min-

imization procedures,” Statistics and Decision, Supplement Issue, vol. 1,

pp. 205-237, 1984.

[15] S. H. Friedberg, A.J. Insel and L. E. Spence, Linear Algebra, 4th Ed., Pren-

tice Hall, 2002.

[16] R. G. Gallager, “Low-density parity-check codes,” IRE Trans. Inform. The-

ory, vol. 28, no. 1, pp. 8-21, Jan. 1962.

[17] R. G. Gallager, Low-Density Parity-Check Codes, MIT Press, 1963.

[18] R. G. Gallager, Information Theory and Reliable Communication, Wiley,

1968.

[19] R. Gallager, “Variations on theme by Huﬀman,” IEEE Trans. Inform. The-

ory, vol. 24, no. 6, pp. 668-674, Nov. 1978.

[20] G. R. Grimmett and D. R. Stirzaker, Probability and Random Processes,

Third Edition, Oxford University Press, NY, 2001.

[21] T. S. Han and S. Verd´u, “Approximation theory of output statistics,” IEEE

Trans. Inform. Theory, vol. 39, no. 3, pp. 752-772, May 1993.

[22] S. Ihara, Information Theory for Continuous Systems, World-Scientiﬁc, Sin-

gapore, 1993.

[23] R. Johanesson and K. Zigangirov, Fundamentals of Convolutional Coding,

IEEE, 1999.

[24] W. Karush, Minima of Functions of Several Variables with Inequalities as

Side Constraints, M.Sc. Dissertation, Dept. Mathematics, Univ. Chicago,

Chicago, Illinois, 1939.

[25] A. N. Kolmogorov, “On the Shannon theory of information transmission in

the case of continuous signals,” IEEE Trans. Inform. Theory, vol. 2, no. 4,

pp. 102-108, Dec. 1956.

[26] A. N. Kolmogorov and S. V. Fomin, Introductory Real Analysis, Dover Pub-

lications, NY, 1970.

149

[27] H. W. Kuhn and A. W. Tucker, “Nonlinear programming,” Proc. 2nd Berke-

ley Symposium, Berkeley, University of California Press, pp. 481-492, 1951.

[28] S. Lin and D. J. Costello, Error Control Coding: Fundamentals and Appli-

cations, 2nd Edition, Prentice Hall, NJ, 2004.

[29] D. J. C. MacKay and R. M. Neal, “Near Shannon limit performance of low

density parity check codes,” Electronics Letters, vol. 33, no. 6, Mar. 1997.

[30] D. J. C. MacKay, “Good error correcting codes based on very sparse matri-

ces,” IEEE Trans. Inform. Theory, vol. 45, no. 2, pp. 399-431, Mar. 1999.

[31] F. J. MacWilliams and N. J. A. Sloane, The Theory of Error Correcting

Codes, North-Holland Pub. Co., 1978.

[32] J. E. Marsden and M. J.Hoﬀman, Elementary Classical Analysis, W.H.

Freeman & Company, 1993.

[33] R. J. McEliece, The Theory of Information and Coding, 2nd. Ed., Cam-

bridge University Press, 2002.

[34] M. S. Pinsker, Information and Information Stability of Random Variables

and Processes, Holden-Day, San Francisco, 1964.

[35] T. J. Richardson and R. L. Urbanke, Modern Coding Theory, Cambridge

University Press, 2008.

[36] M. Rezaeian and A. Grant, “Computation of total capacity for discrete

memoryless multiple-access channels,” IEEE Trans. Inform. Theory, vol. 50,

no. 11, pp. 2779-2784, Nov. 2004.

[37] H. L. Royden. Real Analysis, Macmillan Publishing Company, 3rd. Ed., NY,

1988.

[38] C. E. Shannon, “A mathematical theory of communications,” Bell Syst.

Tech. Journal, vol. 27, pp. 379-423, 1948.

[39] C. E. Shannon, “Coding theorems for a discrete source with a ﬁdelity cri-

terion,” IRE Nat. Conv. Rec., Pt. 4, pp. 142-163, 1959.

[40] C. E. Shannon and W. W. Weaver, The Mathematical Theory of Commu-

nication, Univ. of Illinois Press, Urbana, IL, 1949.

[41] P. C. Shields, The Ergodic Theory of Discrete Sample Paths, American

Mathematical Society, 1991.

150

[42] N. J. A. Sloane and A. D. Wyner, Ed., Claude Elwood Shannon: Collected

Papers, IEEE Press, NY, 1993.

[43] W. R. Wade, An Introduction to Analysis, Prentice Hall, NJ, 1995.

[44] S. Wicker, Error Control Systems for Digital Communication and Storage,

Prentice Hall, NJ, 1995.

[45] R. W. Yeung, Information Theory and Network Coding, Springer, NY, 2008.

151

Transfer Meta-Learning: Information-Theoretic Bounds and Information Meta-Risk Minimization

Article

Full-text available

Oct 2021

Meta-learning automatically infers an inductive bias by observing data from a number of related tasks. The inductive bias is encoded by hyperparameters that determine aspects of the model class or training algorithm, such as initialization or learning rate. Meta-learning assumes that the learning tasks belong to a task environment, and that tasks are drawn from the same task environment both during meta-training and meta-testing. This, however, may not hold true in practice. In this paper, we introduce the problem of transfer meta-learning, in which tasks are drawn from a target task environment during meta-testing that may differ from the source task environment observed during meta-training. Novel information-theoretic upper bounds are obtained on the transfer meta-generalization gap, which measures the difference between the meta-training loss, available at the meta-learner, and the average loss on meta-test data from a new, randomly selected, task in the target task environment. The first bound, on the average transfer meta-generalization gap, captures the meta-environment shift between source and target task environments via the KL divergence between source and target data distributions. The second, PAC-Bayesian bound, and the third, single-draw bound, account for this shift via the log-likelihood ratio between source and target task distributions. Furthermore, two transfer meta-learning solutions are introduced. For the first, termed Empirical Meta-Risk Minimization (EMRM), we derive bounds on the average optimality gap. The second, referred to as Information Meta-Risk Minimization (IMRM), is obtained by minimizing the PAC-Bayesian bound. IMRM is shown via experiments to potentially outperform EMRM.

Serial Quantization for Sparse Time Sequences

Article

May 2021

Sparse signals are encountered in a broad range of applications. In order to process these signals using digital hardware, they must be first sampled and quantized using an analog-to-digital convertor (ADC), which typically operates in a serial scalar manner. In this work, we propose a method for serial quantization of sparse time sequences (SQuaTS) inspired by group testing theory, which is designed to reliably and accurately quantize sparse signals acquired in a sequential manner using serial scalar ADCs. Unlike previously proposed approaches which combine quantization and compressed sensing (CS), our SQuaTS scheme updates its representation on each incoming analog sample and does not require the complete signal to be observed and stored in analog prior to quantization. We characterize the asymptotic tradeoff between accuracy and quantization rate of SQuaTS as well as its computational burden. We also propose a variation of SQuaTS, which trades rate for computational efficiency. Next, we show how \ac{sqrss} can be naturally extended to distributed quantization scenarios, where a set of jointly sparse time sequences are acquired individually and processed jointly. Our numerical results demonstrate that SQuaTS is capable of achieving substantially improved representation accuracy over previous CS-based schemes without requiring the complete set of analog signal samples to be observed prior to its quantization, making it an attractive approach for acquiring sparse time sequences.

Cactus Mechanisms: Optimal Differential Privacy Mechanisms in the Large-Composition Regime

Preprint

Full-text available

Jun 2022

Most differential privacy mechanisms are applied (i.e., composed) numerous times on sensitive data. We study the design of optimal differential privacy mechanisms in the limit of a large number of compositions. As a consequence of the law of large numbers, in this regime the best privacy mechanism is the one that minimizes the Kullback-Leibler divergence between the conditional output distributions of the mechanism given two different inputs. We formulate an optimization problem to minimize this divergence subject to a cost constraint on the noise. We first prove that additive mechanisms are optimal. Since the optimization problem is infinite dimensional, it cannot be solved directly; nevertheless, we quantize the problem to derive near-optimal additive mechanisms that we call "cactus mechanisms" due to their shape. We show that our quantization approach can be arbitrarily close to an optimal mechanism. Surprisingly, for quadratic cost, the Gaussian mechanism is strictly sub-optimal compared to this cactus mechanism. Finally, we provide numerical results which indicate that cactus mechanism outperforms the Gaussian mechanism for a finite number of compositions.

It Was “All” for “Nothing”: Sharp Phase Transitions for Noiseless Discrete Channels

Article

Aug 2023

We establish a phase transition known as the “all-or-nothing” phenomenon for noiseless discrete channels. This class of models includes the Bernoulli group testing model and the planted Gaussian perceptron model. Previously, the existence of the all-or-nothing phenomenon for such models was only known in a limited range of parameters. Our work extends the results to all signals with arbitrary sublinear sparsity. Over the past several years, the all-or-nothing phenomenon has been established in various models as an outcome of two seemingly disjoint results: one positive result establishing the “all” half of all-or-nothing, and one impossibility result establishing the “nothing” half. Our main technique in the present work is to show that for noiseless discrete channels, the “all” half implies the “nothing” half, that is, a proof of “all” can be turned into a proof of “nothing.” Since the “all” half can often be proven by straightforward means—for instance, by the first-moment method—our equivalence gives a powerful and general approach towards establishing the existence of this phenomenon in other contexts.

Oversquashing in GNNs through the lens of information contraction and graph expansion

Conference Paper

Sep 2022

An Information-Theoretic Analysis of Bayesian Reinforcement Learning

Conference Paper

Sep 2022

Fixed-Horizon Active Hypothesis Testing and Anomaly Detection

Article

Jan 2022

Determining the presence of an anomaly or whether a system is safe or not is a problem with wide applicability. The model adopted for this problem is that of verifying whether a multi-component system has anomalies or not. Components can be probed over time individually or as groups in a data-driven manner. The collected observations are noisy and contain information on whether the selected group contains an anomaly or not. The aim is to minimize the probability of incorrectly declaring the system to be free of anomalies while ensuring that the probability of correctly declaring it to be safe is sufficiently large. This problem is modeled as an active hypothesis testing problem in the Neyman-Pearson setting. Asymptotically optimal rates and strategies are characterized. The general strategies are data driven and outperform previously proposed asymptotically optimal methods in the finite sample regime. Furthermore, novel component-selection are designed and analyzed in the non-asymptotic regime. For a specific class of problems admitting a key form of symmetry, strong non-asymptotic converse and achievability bounds are provided which are tighter than previously proposed bounds.

(ϵ, n) Fixed-Length Strong Coordination Capacity

Conference Paper

Oct 2021

Graph Signal Compression via Task-Based Quantization

Conference Paper

Jun 2021

Hypothesis Testing and Identification Systems

Article

Apr 2021

We study hypothesis testing problems with fixed compression mappings and with user-dependent compression mappings to decide whether or not an observation sequence is related to one of the users in a database, which contains compressed versions of previously enrolled users’ data. We first provide the optimal characterization of the exponent of the probability of the second type of error for the fixed compression mappings scenario when the number of users in the database grows exponentially. We then establish operational equivalence relations between the Wyner-Ahlswede-Körner network, the single-user hypothesis testing problem, the multi-user hypothesis testing problem with user-dependent compression mappings and the identification systems with user-dependent compression mappings. These equivalence relations imply the strong converse and exponentially strong converse for the multi-user hypothesis testing and the identification systems both with user-dependent compression mappings. Finally they also show how an identification scheme can be turned into a multi-user hypothesis testing scheme with an explicit transfer of rate and error probability conditions and vice versa.

Information Theory and Network Coding

Book

Jan 2008

Raymond W. Yeung

Information Theory and Network Coding consists of two parts: Components of Information Theory, and Fundamentals of Network Coding Theory. Part I is a rigorous treatment of information theory for discrete and continuous systems. In addition to the classical topics, there are such modern topics as the I-Measure, Shannon-type and non-Shannon-type information inequalities, and a fundamental relation between entropy and group theory. With information theory as the foundation, Part II is a comprehensive treatment of network coding theory with detailed discussions on linear network codes, convolutional network codes, and multi-source network coding. Other important features include: • Derivations that are from the first principle • A large number of examples throughout the book • Many original exercise problems • Easy-to-use chapter summaries • Two parts that can be used separately or together for a comprehensive course Information Theory and Network Coding is for senior undergraduate and graduate students in electrical engineering, computer science, and applied mathematics. This work can also be used as a reference for professional engineers in the area of communications.

Elementary Classical Analysis

Article

Mar 1995

Information and Information Stability of Random Variables and Processes.

Article

Jan 1964

A Probability and Measure Theory

Article

Jan 2000

Linear algebra. 3rd ed

Article

Information Theory: Coding Theorems for Discrete Memoryless Systems

Article

Jan 1981

Csiszár and Körner’s book is widely regarded as a classic in the field of information theory, providing deep insights and expert treatment of the key theoretical issues. It includes in-depth coverage of the mathematics of reliable information transmission, both in two-terminal and multi-terminal network scenarios. Updated and considerably expanded, this new edition presents unique discussions of information theoretic secrecy and of zero-error information theory, including the deep connections of the latter with extremal combinatorics. The presentations of all core subjects are self contained, even the advanced topics, which helps readers to understand the important connections between seemingly different problems. Finally, 320 end-of-chapter problems, together with helpful solving hints, allow readers to develop a full command of the mathematical techniques. It is an ideal resource for graduate students and researchers in electrical and electronic engineering, computer science and applied mathematics. © Akadémiai Kiadó, Budapest 1981 and Cambridge University Press 2011.

Information geometry and alternating minimization procedures

Article

Jan 1984
Stat Decis

Theory and Practice of Error Control Codes

Article

Jan 1983

Richard E. Blahut

Near Shannon limit performance of low density parity check codes

Article

Mar 1997

The authors report the empirical performance of Gallager's low density parity check codes on Gaussian channels. They show that performance substantially better than that of standard convolutional and concatenated codes can be achieved; indeed the performance is almost as close to the Shannon limit as that of turbo codes

The Theory of Information and Coding

Article

Jan 1984

Robert J. McEliece

Lecture Notes in Information Theory Part I

Figures

Recommended publications

The signature of the normalizer of Γ0(N)

Gaussian Assumption: The Least Favorable but the Most Useful [Lecture Notes]

Lecture Notes in Computer Science

Computing entropy rates for hidden Markov processes