ArticlePDF Available

Bayesian Nonparametrics

January 2011

January 2011
16

Authors:

R. V. Ramamoorthi

Michigan State University

Bibliography: p. [285]-299 Includes index

Content uploaded by R. V. Ramamoorthi

Content may be subject to copyright.

Springer Series in Statistics

Advisors:

P. Bickel, P. Diggle, S. Fienberg, K. Krickeberg,

I. Olkin, N. Wermuth, S. Zeger

J.K. Ghosh R.V. Ramamoorthi

Bayesian Nonparametrics

With 49 Illustrations

J.K. Ghosh R.V. Ramamoorthi

Statistics-Mathematics Division Statistics and Probability

Indian Statistical Institute Michigan State University

203 Barrackpore Trunk Road A431 Wells Hall

Kolkata 70035 East Lansing, MI 48824

India USA

Library of Congress Cataloging-in-Publication Data

Ghosh, J.K.

Bayesian nonparametrics / J.K. Ghosh, R.V. Ramamoorthi.

p. cm. — (Springer series in statistics)

Includes bibliographical references and index.

ISBN 0-387-95537-2 (alk. paper)

1. Bayesian statistical decision theory. 2. Nonparametric statistics. I. Ramamoorthi, R.V.

II. Title. III. Series.

QA279.5 .G48 2002

519.5′42—dc21 2002026665

ISBN 0-387-95537-2 Printed on acid-free paper.

written permission of the publisher (Springer-Verlag New York, Inc., 175 Fifth Avenue, New York,

NY 10010, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use

in connection with any form of information storage and retrieval, electronic adaptation, computer

software, or by similar or dissimilar methodology now known or hereafter developed is forbidden.

The use in this publication of trade names, trademarks, service marks, and similar terms, even if

they are not identified as such, is not to be taken as an expression of opinion as to whether or not

they are subject to proprietary rights.

Printed in the United States of America.

987654321 SPIN 10884896

Typesetting: Pages created by the authors using a Springer TEX macro package.

www.springer-ny.com

Springer-Verlag New York Berlin Heidelberg

A member of BertelsmannSpringer Science+Business Media GmbH

to our wives

Ira and Deepa

Preface

This book has grown out of several courses that we have given over the years at

Purdue University, Michigan State University and the Indian Statistical Institute on

Bayesian nonparametrics and Bayesian asymptotics. These topics seemed suﬃciently

rich and useful that a book length treatment seemed desirable.

Through the writing of this book we have received support from many people

and we would like to gratefully acknowledge these. Our early interest in the topic

came from discussions with Jim Berger, Persi Diaconis and Larry Wasserman. We

have received encouragement in our eﬀort from Mike Lavine, Steve McEachern, Susie

Bayarri, Mary Ellen Bock, J. Sethuraman and Shanti Gupta, who alas is no longer

with us.

We have enjoyed many years of collaboration with Subashis Goshal and much of

our joint work ﬁnds a place in this book. Besides, he looked over an earlier version

of the manuscript and gave very useful comments. The book also includes joint work

with Jyotirmoy Dey, Roy Erickson, Liliana Dragichi, Charles Messan, Tapas Samanta

and K.R.Srikanth. They have helped us with the proof, as have others. In particular,

Tapas Samanta played an invaluable role in helping us communicate electronically

and Charles Messan with computations.

Brendan Murphy, then a graduate student at Yale, gave us very useful feed back

on an earlier version of Chapter 1. We also beneﬁted from many suggestions and

criticisms from Jim Hannan on the same chapter. We like to thank Nils Hjort both

for his interest in the book and comments.

Dipak Dey made Sethuraman’s unpublished notes available to us and these notes

helped us considerably with Chapter 3.

When we ﬁrst thought of writing a book, it seemed that we would be able to cover

most, if not all, of what was known in Bayesian nonparametrics. However the last few

years have seen an explosion of new work and our goals have turned more modest.

We view this book as an introduction to the theoretical aspects of the topic at the

graduate level. There is no coverage of the important aspect of computations but

given the interest in this area we expect that a book on computations will emerge

before long.

Our appreciation to Vince Melﬁ for his advice in matters related to Latex. Despite

it, our limitations with Latex and typing skills would be apparent and we seek the

readers’ indulgence.

Contents

Introduction: Why Bayesian Nonparametrics—An Overview and Sum-

mary 1

1 Preliminaries and the Finite Dimensional Case 9

1.1 Introduction................................ 9

1.2 MetricSpaces............................... 10

1.2.1 preliminaries ........................... 10

1.2.2 Weak Convergence ........................ 12

1.3 Posterior Distribution and Consistency ................. 15

1.3.1 Preliminaries ........................... 15

1.3.2 PosteriorConsistencyandPosteriorRobustness ........ 18

1.3.3 Doob’sTheorem ......................... 22

1.3.4 Wald-Type Conditions ...................... 24

1.4 Asymptotic Normality of MLE and

Bernstein–von Mises Theorem ...................... 33

1.5 Ibragimov and Hasminski˘ı Conditions ................. 41

1.6 Nonsubjective Priors ........................... 46

1.6.1 Fully Speciﬁed .......................... 46

1.6.2 Discussion ............................ 52

1.7 Conjugate and Hierarchical Priors .................... 52

xContents

1.8 Exchangeability, De Finetti’s Theorem,

Exponential Families ........................... 54

2M(X)and Priors on M(X)57

2.1 Introduction................................ 57

2.2 The Space M(X) ............................. 58

2.3 (Prior) Probability Measures on M(X) ................. 62

2.3.1 XFinite .............................. 62

2.3.2 X=R............................... 64

2.3.3 TailFreePriors.......................... 70

2.4 TailFreePriorsand0-1Laws ...................... 75

2.5 Space of Probability Measures on M(R) ................ 78

2.6 De Finetti’s Theorem .......................... 83

3 Dirichlet and Polya tree process 87

3.1 Dirichlet and Polya tree process ..................... 87

3.1.1 Finite Dimensional Dirichlet Distribution ........... 87

3.1.2 Dirichlet Distribution via Polya Urn Scheme .......... 94

3.2 Dirichlet Process on M(R) ....................... 96

3.2.1 ConstructionandProperties................... 96

3.2.2 TheSethuramanConstruction.................. 103

3.2.3 Support of Dα........................... 104

3.2.4 Convergence Properties of Dα.................. 105

3.2.5 Elicitation and Some Applications ................ 107

3.2.6 Mutual Singularity of Dirichlet Priors .............. 110

3.2.7 Mixtures of Dirichlet Process .................. 113

3.3 PolyaTreeProcess ............................ 114

3.3.1 The Finite Case .......................... 114

3.3.2 X=R............................... 116

4 Consistency Theorems 121

4.1 Introduction................................ 121

4.2 Preliminaries ............................... 122

4.3 Finite and Tail free case ......................... 124

4.4 PosteriorConsistencyonDensities ................... 126

4.4.1 Schwartz Theorem ........................ 126

4.4.2 L1-Consistency .......................... 132

Contents xi

4.5 ConsistencyviaLeCam’sinequality................... 137

5 Density Estimation 141

5.1 Introduction................................ 141

5.2 PolyaTreePriors............................. 142

5.3 Mixtures of Kernels ............................ 143

5.4 Hierarchical Mixtures ........................... 147

5.5 RandomHistograms ........................... 148

5.5.1 WeakConsistency......................... 150

5.5.2 L1-Consistency .......................... 156

5.6 MixturesofNormalKernel........................ 161

5.6.1 DirichletMixtures:WeakConsistency.............. 161

5.6.2 Dirichlet Mixtures: L1-Consistency ............... 169

5.6.3 Extensions............................. 172

5.7 GaussianProcessPriors ......................... 174

6 Inference for Location Parameter 181

6.1 Introduction................................ 181

6.2 The Diaconis-Freedman Example .................... 182

6.3 ConsistencyofthePosterior ....................... 185

6.4 PolyaTreePriors............................. 189

7 Regression Problems 197

7.1 Introduction................................ 197

7.2 SchwartzTheorem ............................ 198

7.3 ExponentiallyConsistentTests ..................... 201

7.4 Prior Positivity of Neighborhoods .................... 206

7.5 PolyaTreePriors............................. 208

7.6 Dirichlet Mixture of Normals ....................... 209

7.7 BinaryResponseRegressionwithUnknownLink............ 212

7.8 StochasticRegressor ........................... 215

7.9 Simulations ................................ 215

8 Uniform Distribution on Inﬁnite-Dimensional Spaces 221

8.1 Introduction................................ 221

8.2 Towards a Uniform Distribution ..................... 222

8.2.1 TheJeﬀreysPrior......................... 222

8.2.2 Uniform Distribution via Sieves and Packing Numbers .... 223

xii Contents

8.3 Technical Preliminaries .......................... 224

8.4 TheJeﬀreysPriorRevisited ....................... 225

8.5 Posterior Consistency for Noninformative Priors for

Inﬁnite-Dimensional Problems ...................... 229

8.6 Convergence of Posterior at Optimal Rate ............... 231

9 Survival Analysis—Dirichlet Priors 237

9.1 Introduction................................ 237

9.2 Dirichlet Prior ............................... 238

9.3 Cumulative Hazard Function, Identiﬁability .............. 242

9.4 Priors via Distributions of (Z, δ)..................... 247

9.5 IntervalCensoredData.......................... 249

10 Neutral to the Right Priors 253

10.1 Introduction ................................ 253

10.2NeutraltotheRightPriors ....................... 254

10.3 Independent Increment Processes .................... 258

10.4BasicProperties.............................. 262

10.5BetaProcesses .............................. 265

10.5.1 Deﬁnition and Construction ................... 265

10.5.2 Properties............................. 268

10.6PosteriorConsistency........................... 271

11 Exercises 281

References 285

Index 300

Introduction: Why Bayesian Nonparametrics—An

Overview and Summary

Bayesians believe that all inference and more is Bayesian territory. So it is natural that

a Bayesian should explore nonparametrics and other inﬁnite-dimensional problems.

However, putting a prior, which is always a delicate and diﬃcult exercise in Bayesian

analysis, poses special conceptual, mathematical, and practical diﬃculties in inﬁnite-

dimensional problems. Can one really have a subjective prior based on knowledge and

belief, in an inﬁnite-dimensional space? Even if one settles for a largely non-subjective

prior, it is mathematically diﬃcult to construct prior distributions on such sets as the

space of all distribution functions or the space of all probability density functions

and ensure that they have large support, which is a minimum requirement because

a largely nonsubjective prior should not put too much mass on a small set. Finally,

there are formidable practical diﬃculties in the calculation of the posterior, which is

the single most important object in the output of any Bayesian analysis.

Nonetheless, a major breakthrough came with Ferguson’s [61] paper on Dirichlet

process priors. The hyperparameters α(R)andα(·) of these priors are easy to elicit, it

is easy to ensure a large support, and the posterior is analytically tractable. More ﬂex-

ibility was added by forming mixtures of Dirichlet processes, introduced by Antoniak

[4].

Mixtures of Dirichlet have been very popular in Bayesian nonparametrics, espe-

cially in analyzing right censored survival data. In these problems one can combine

analytical work with Markov Chain Monte Carlo (MCMC) to calculate and display

2WHY BAYESIAN NONPARAMETRICS

various posterior quantities in real time. By choosing α(·) equal to the exponential

distribution and by tuning the parameter α(R), one can make the analysis close to

classical analysis based on a parametric exponential or close to classical nonparamet-

rics. However, the whole range of α(R) oﬀers a whole continuum of options that are

not available in classical statistics, where typically one either does a model based

parametric analysis or use, fully nonparametric methods. An interesting example in

survival analysis is presented by Doss [53, 54]. Huber’s pioneering work in classical

statistics on a robust via media between these two extremes has been too technically

demanding to yield a ﬂexible set of methods that pass continuously from one extreme

to the other. These ideas are discussed further in Chapter 3 on Dirichlet priors.

Similarly one can analyze generalized linear models with a nonparametric Bayesian

choice of link functions. Bayesian nonparametrics is known to be a powerful, robust

alternative to regression analysis based on probit or logit models. References are

available in Chapter 7. There is some evidence of gaining an advantage in using

Bayesian nonparametrics to model random eﬀects in linear models for longitudinal

data.

Sometimes things can go wrong if one uses a Dirichlet process prior inappropriately.

Such a prior cannot be used for density estimation without some smoothing, but

smoothing leads to formidable diﬃculties in calculating the posterior or the Bayes

estimate of the density function. Solution of this computational problem by MCMC

is fairly recent; see Chapter 5 for references and discussion. A major advantage of the

Bayesian method is that choice of the smoothing parameter h, which is still a hard

problem in classical density estimation, is relatively automatic. The Bayesian version

of varying the smoothing parameter over diﬀerent parts of the data is also relatively

easy to implement. These are some of the major advances in Bayesian nonparametrics

in recent years.

A major theoretical advance has occurred recently in Bayesian semiparametrics.

One has the same advantages of ﬂexibility here as discussed earlier, but unfortu-

nately this is also an area where the Dirichlet process is inappropriate without some

smoothing. Instead one can use Polya tree priors that sit on densities and satisfy some

extra conditions. For details and references see Chapter 6.

A diﬃculty in Bayesian nonparametrics is that not much was known until recently

about the asymptotic behavior of the posterior and various forms of frequentist vali-

dation. One method of frequentist validation of Bayesian analysis is to see if one can

learn about the unknown true P0with vanishingly small error by examining where the

posterior puts most of its mass. This idea and the ﬁrst result of this sort are due to

Laplace. A precise statement of this property leads to the notion of consistency of the

AN OVERVIEW AND SUMMARY 3

posterior at P0, due to Freedman [69]. In the case of ﬁnite-dimensional parameters,

the posterior is usually consistent, and the data wash away the prior. For an inﬁnite-

dimensional parameter, this is an exception rather than the rule; see, for instance,

examples of Freedman [69] and his theorem: For a multinomial with inﬁnitely many

classes, the set of (P0,Π) for which posterior for the prior Π is consistent at P0,is

topologically small, i.e., of the ﬁrst category. Freedman had also introduced the notion

of tail free priors for which there is posterior consistency at P0. A striking example of

inconsistency was shown by Diaconis and Freedman [46] when a Dirichlet process is

used for estimating a location parameter. In his discussion of [46], Barron points out

that the use of a Dirichlet process prior in a location problem leads to a pathological

behavior of the posterior for the location parameter. It is clear that inconsistency is a

consequence of this pathology. Diaconis and Freedman [46] also suggested that such

examples would occur even if one uses a prior on densities, e.g., a Polya tree prior

sitting on densities.

Chapter 4 is devoted to general questions of consistency of the posterior and positive

results. Applications appear in many other chapters and in fact run through the whole

book. These results, as well as somewhat stronger results, like rates of convergence,

are fairly recent and due to many authors, including ourselves.

To sum up, Bayesian nonparametrics is suﬃciently well developed to take care

of many problems. Computation of the posterior is numerically feasible for several

classes of priors. We now know a fair amount of asymptotic behavior of posteriors

for diﬀerent priors to ensure consistency at plausible P0s. Most important, Bayesian

nonparametrics provides more ﬂexibility than classical nonparametrics and a more

robust analysis than both classical and Bayesian parametric inference. It deserves to

be an important part of the Bayesian paradigm.

This monograph provides a systematic, theoretical development of the subject. A

chapterwise summary follows:

1. After introducing some preliminaries, Chapter 1 discusses some fundamental

aspects of Bayesian analysis in the relatively simple context of ﬁnite dimensional

parameter space with dimension ﬁxed for all sample sizes. Because this subject is

treated well in many standard textbooks, the focus is on aspects such as nonsubjective

priors, also called objective priors, posterior consistency and exchangeability. These

are topics that usually do not receive much coverage in textbooks but are important

for our monograph,

Because elicitation of subjective priors or quantiﬁcation of expert knowledge is still

not easy, most priors used in practice, especially in nonparametrics, are nonsubjective.

We discuss the standard ways of generating such priors and how to modify them

4WHY BAYESIAN NONPARAMETRICS

when some subjective or expert judgment is available (Section 1.61.7). We also brieﬂy

discuss common criticisms of nonsubjective Bayesian analysis and answers 1.6.2

Posterior consistency is introduced, and the classical theorem of Doob is proved

with all details. Then, in the spirit of classical maximum likelihood theory, posterior

consistency is established under regularity conditions using the uniform strong law

of large numbers. Posterior consistency provides a frequentist validation that is es-

pecially important for inference on inﬁnite-or high dimensional parameters because

even with a massive amount of data, any inadequacy in the prior can still inﬂuence

the posterior a lot. Posterior normality (Section 1.4) is a sharpening of posterior

consistency that is related to Laplace approximation and plays an important role in

the construction of reference and probability matching priors. Convergence of poste-

rior distributions is usually studied under regularity conditions. A general approach

that also works for nonregular problems is presented in Section 1.5. Exchangeability

appears in the last sections Chapter 1.

In Chapter 2 we examine basic measure-theoretic questions that arise when we try

to check measurability of a set or function or put a prior on such a large space as

the set P of all probability measures on R. The Kolmogorov construction based on

consistent ﬁnite-dimensional distributions does not meet this requirement because the

Kolmogorov sigma-ﬁeld is too small to ensure measurability of important subsets like

the set of all discrete distributions on Ror the set of all P with a density with respect

to the Lebesgue measure. Questions of measurability and convergence are discussed

in Section 2.2.

An interesting fact is a proof that the set of discrete measures and the set of ab-

solutely continuous probability measures are measurable. The main results in the

chapter are the basic construction theorems 2.3.2 through 2.3.4. Tail free priors, in-

cluding the Dirichlet process prior, may be constructed this way. The most important

type of convergence, namely, weak convergence is discussed is detail in Section 2.5.

The main result is a characterization of tightness in the spirit of Sethuraman and

Tiwari (1982). Section 2.4 contains 0-1 laws for tail free priors as well as a theorem

due to Kraft that can be used to construct a tail free prior for densities.

De Finetti’s theorem appears in the last section.

The reader not interested in measure-theoretic issues may read this chapter quickly

to understand the main results and get a ﬂavor of some of the proofs. A reader with

more measure-theoretic interest will gain a solid theoretical framework for handling

priors for nonparametric problems and will also be rewarded with several measure-

theoretic subtleties that are interesting.

AN OVERVIEW AND SUMMARY 5

The most important prior in Bayesian nonparametrics is the Dirichlet process prior,

which plays a central role here as the normal in ﬁnite-dimensional problems. Most of

Chapter 3 is devoted to this prior. The last section is on Polya tree priors.

We introduce a Dirichlet prior (3.1) ﬁrst in the case of a ﬁnite sample space Xand

then for X=Rto help develop intuition for the main results regarding the latter. The

Dirichlet prior Dfor X=Ris usually called the Dirichlet process prior. Section 3.2

contains calculation and justiﬁcation of a formula for posterior and special properties.

It also contains Sethuraman’s clever and elegant construction, which applies to all X

and suggests how one can simulate from this prior. Other results of interest include a

characterization of support and convergence properties (Section 3.2) and the question

of singularity of two Dirichlet process priors with respect to each other. Part of the

reason why Dirichlet process priors have been so popular is the multitude of interesting

properties mentioned earlier, of which the most important are the ease in calculation

of posterior and the fact that the support is as rich as it should be for a prior for

nonparametric problems.

A second and equally important reason for popularity is the ﬂexibility, at least for

mixtures of Dirichlet, and the relative case with which one can elicit the hyperparam-

eters. These issues are discussed in 3.2.7

The last section extends most of this discussion to Polya tree priors which form

a much richer class. Though not as mathematically tractable as D, they are still

relatively easy to handle and one can use convenient, partly elicited hyperparameters.

As we have argued before, posterior consistency is a useful validation for a par-

ticular prior, especially in nonparametric problems. Chapter 4 deals with essentially

three approaches to posterior consistency for three kinds of problems, namely, purely

nonparametric problems of estimating a distribution function or its weakly contin-

uous functionals, semiparametrics, and density estimation. The Dirichlet and, more

generally, tail free priors have good consistency properties for the ﬁrst class of prob-

lems. Posterior consistency for tail free priors is discussed in the ﬁrst few pages of the

chapter.

In Bayesian semiparametrics, for example estimation of a location parameter (Chap-

ter 6) or the regression coeﬃcient (Chapter 7), addition of Euclidean parameters de-

stroys the tail free property of common priors like Dirichlet process and Polya tree.

Indeed, the use of Dirichlet leads to a pathological posterior. Posterior consistency in

this case is based on a theorem of Schwartz for a prior on densities. The two crucial

conditions are that the true probability measure lie in the Kullback-Leibler support of

the prior and there has to be uniformly exponentially consistent tests for H0:f=f0

6WHY BAYESIAN NONPARAMETRICS

VS H1:f∈Vc,whereVis a neighborhood whose posterior probability is being

claimed to converge to one. This is presented in Section 4.2.

The Schwartz theorem is well suited for semiparametrics but not for density es-

timation because the second condition in the theorem does not hold for a Vequal

to an L1-neighborhood of f0. Barron (unpublished) has suggested a weakening of

one of these conditions, suitably compensated by a condition on the prior. His con-

ditions are necessary and suﬃcient for a certain form of exponential convergence of

the posterior probability of Vto one. Ghosal, Ghosh and Ramamoorthi (1999) make

use of this theorem and some ideas of Barron, Schervish and Wasserman (1999) to

modify Schwartz’s result to make it suitable for showing posterior consistency with

L1-neighborhoods for a prior sitting on densities. All these results appear in Section

4.2.

Finally, Section 4.3 is devoted to another approach based on an inequality of

LeCam, which bypasses the veriﬁcation of the ﬁrst condition of Schwartz.

Applications of these results are made in Chapters 5 through 8. Somewhat diﬀerent

but direct calculations leading to posterior consistency appear in Chapters 9 and 10.

Chapter 5 focuses on three kinds of priors for density estimation: Dirichlet mix-

tures of uniform, Dirichlet mixtures of normal, and Gaussian process priors. Dirichlet

mixtures of normal are the most popular and the most studied. The Gaussian pro-

cess priors seem very promising but have not been studied well. Dirichlet mixtures of

uniform are essentially Bayesian histograms and have a relatively simple theory.

The chapter begins with fairly general construction of priors on densities in sections

5.2 and 5.3 and then specializes to Bayesian histograms and their consistency in

Sections 5.40, 5.4.1, and 5.4.2. Dirichlet mixtures of normals are studied in Sections

5.6 and 5.7. The L1-consistency of the posterior applies to the prior of Escobar and

West in [168]. The ﬁnal section contains an introduction to what is known about

Gaussian process priors.

Interesting issues that emerge from this rather technical chapter is that checking the

Kullback-Leibler support condition is especially hard for densities with Ras support,

whereas densities with bounded support are much easier to handle. A second source of

technical diﬃculty is the need for eﬃcient calculation of packing or covering numbers,

also called Kolmogorov’s metric entropy. These numbers play a basic role in Chapters

4,5 and 8.

Chapter 6 begins with the famous Diaconis-Freedman (1986) example where a

Dirichlet process prior and a euclidean location parameter lead to posterior inconsis-

tency. Barron (1986) has pointed out that there is a pathology in this case which is

even worse than inconsistency. We argue, as suggested in Chapter 4, that the main

AN OVERVIEW AND SUMMARY 7

problem leading to posterior inconsistency is that the tail free property does not hold.

It is good to have a density but that does not seem to be enough. However, no counter

example is produced.

The main contribution of the chapter is to suggest in Section 6.3 a strategy for

proving posterior consistency for the location parameter in semiparametric setting

and to provide in Section 6.4 a class of Polya tree priors which satisfy the conditions

of Section 6.3 for a rich class of true densities. A major assumption needed in Section

6.3 holds only for densities with Ras support. Later in the section we show how to

extend these results to densities with bounded support. Whereas in density estimation

bounded support helps, the converse seems to be true when one has to estimate a

location parameter.

The discussion of Bayesian semiparametrics is continued in Chapter 7 . We assume

a standard regression model

Y=α+βx +

with the error having a nonparametric density f. The main object is to estimate

the regression coeﬃcient but one may also wish to estimate the intercept αas well as

the true density of . The classical counterpart of this is Bickel [19].

Because Y’s are no longer i.i.d, the Schwartz theorem of Chapter 6 does not apply.

In Section 7.2 - we prove a generalization that is valid for n independent but not

necessarily identically distributed random variables.

The theorem needs two conditions which are exact analogues of the two conditions

in Schwartz’s theorem and one additional condition on the second moment of a log

likelihood ratio. Veriﬁcation of these conditions is discussed in Section 7.4.

In Section 7.3 we discuss suﬃcient conditions for the existence of uniformly consis-

tent tests for βalone or (α, β)or(α, β, f ).

Finally in sections 7.6 we verify the remaining two conditions for Polya tree priors

and Dirichlet mixtures of normals. Veriﬁcation of conditions require methods that are

substantially diﬀerent from those in Chapter 5.

Chapter 8 deals with three diﬀerent but related topics, namely, three methods of

construction of nonsubjective priors in inﬁnite dimensional problems involving densi-

ties, consistency proof for such priors using LeCam’s inequality and rates convergence

for such and other priors. They are discussed in sections 8.2,8.5 and 8.6 respectively.

In several examples it is shown that the rates of convergence are the best possible.

However, for most commonly used priors getting rates of convergence is still a very

hard open problem.

8WHY BAYESIAN NONPARAMETRICS

Chapters 9 and 10 deal with right censored data. Here, the object of interest is the

distribution of a positive random variable X, viewed as survival time. What we have

areobservationsofisZ=X∧Y,∆=I(X≤Y), where Yis a censoring random

variable, independent of X.

Chapter 9 begins with a model studied by Susarla and Van Ryzin [155] where the

distribution of Xis given a Dirichlet process prior. We give a representation of the

posterior and establish its consistency. Section 2 is a quick review of the notion of

cumulative hazard function and identiﬁability of the distribution of Xfrom that of

(Z, ∆). This is then used in the next section where we start with a Dirichlet prior

for the distribution of (Z, ∆) and use the identiﬁability result to transfer it to a prior

for the distribution of X. We expect that this method will be useful in constructing

priors for other kind of censored data. Section 9.4 is a preliminary study of Dirichlet

priors for interval censored data. We show that, unlike the right censored case, letting

α(R)→0 does not give the nonparametric maximum likelihood estimate.

Chapter 10 deals with neutral to right priors. These priors were introduced by

Doksum in 1974 [48] and after some initial work by Ferguson and Phadia [64] remained

dormant. There has been renewed interest in these priors since the introduction of

Beta processes by Hjort [100]. Neutral to right priors, via the cumulative hazard

function, gives rise to independent increment processes which in turn are described

by their L´evy representations. In Section 10.1 after giving the deﬁnition and basic

properties of neutral to right priors we move onto Section 10.2 where we brieﬂy review

the connection to independent increment processes and L´evy representations. Section

10.3 describes some properties of the prior in terms of the L´evy measure and Section

10.4 is devoted to Beta processes. The remaining parts of the chapter is devoted to

posterior consistency and is partly driven by a surprising example of inconsistency

due to Kim and Lee [114].

Chapter 11 contains some exercises. These were not systematically developed. How-

ever we have included in the hope that going through them will give the reader some

additional insight into the material.

Most work on Bayesian nonparametrics concentrates on estimation. This mono-

graph is no exception. However there is interesting new work on Bayes Factors and

their consistency [13], [37]. Even in the context of estimation, in the context of cen-

sored data, not much has been done beyond the independent right censored model.

There certainly is lot more to be done.

Preliminaries and the Finite Dimensional Case

1.1 Introduction

The basic Bayesian model consists of a parameter θand a prior distribution Π for θ

that reﬂects the investigator’s belief regarding θ. This prior is updated by observing

X1,X

2,...,X

n, which are modeled as i.i.d. Pθgiven θ. The updating mechanism is

Bayes theorem, which results in changing Π to the posterior Π(·|X1,X

2,...,X

n).

The posterior reﬂects the investigator’s belief as revised in the light of the data

X1,X

2,...,X

n. One may also report the predictive distribution of the future ob-

servations or summary measures like the posterior mean or variance. If there is a

decision problem with a speciﬁed loss function, one can choose the decision that min-

imizes the expected loss, with the associated loss calculated under the posterior. This

decision is the Bayes solution, or the Bayes rule. Ideally, a prior should be chosen

subjectively to express personal or expert knowledge and belief. Such evaluations and

quantiﬁcations are not easy, especially in high- or inﬁnite-dimensional problems. In

practice, mathematically tractable priors, for example, conjugate priors, are often

used as convenient and partly nonsubjective models of knowledge and belief. Certain

aspects of these priors are chosen subjectively.

Finally, there are completely nonsubjective priors, the choice of which also leads to

useful posteriors. For the ﬁnite-dimensional case a brief account appears in Section

1.6. For a moderate amount of data, i.e., for a moderate n, the eﬀect of prior on the

10 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

posterior is often negligible. In such cases the posterior arising from a nonsubjective

prior may be considered a good approximation for the posterior that one would have

gotten from a subjective prior.

The posterior, like the prior, is a probability measure on the parameter space Θ,

except that it depends on X1,X

2,...,X

nand the study of the posterior as n→∞is

naturally connected to the theory of convergence of probability measures. In Section

1.2.1, we present a brief survey of weak convergence of probability measures as well

as relations between various metrics and divergence measures.

A recurring theme throughout this monograph is posterior consistency, which helps

validate Bayesian analysis. Section 1.3 contains a formalization and brief discussion of

posterior consistency for separable metric space Θ. In Sections 1.3 and 1.4 we study

in some detail the case when Θ is ﬁnite-dimensional and θ→ Pθis smooth. This is the

framework of conventional parametric theory. Most of the results and asymptotics are

classical, but some are relatively new. While the main emphasis of this monograph is

in the nonparametric, and hence inﬁnite-dimensional situation, we hope that Sections

1.3 and 1.4 will serve to clarify the points of contact and points of diﬀerence with the

ﬁnite-dimensional case.

1.2 Metric Spaces

1.2.1 preliminaries

Let (S,ρ) be a metric space so that ρsatisﬁes (i) ρ(s1,s

2)=ρ(s2,s

1), (ii) ρ(s1,s

2)≥

0andρ(s1,s

2)=0 iﬀs1=s2and (iii) ρ(s1,s

3)≤ρ(s1,s

2)+ρ(s2,s

3).

Some basic properties of metric spaces are summarized here.

A sequence snin Sconverges to siﬀ ρ(sn,s)→0. The bal l with center s0and

radius δis the set B(s0,δ)={s:ρ(s0,s)<δ}. A set Uis open if every sin Uhas a

ball B(s, δ) contained in U.AsetVis closed if its complement Vcis open. A useful

characterization of a closed set is: Vis closed iﬀ sn∈Vand sn→simplies s∈V.

The intersection of closed sets is a closed set. For any set A⊂S, the smallest closed

set containing A, which is the intersection of all closed sets containing A, is called

the closure of Aand will be denoted by ¯

A. Similarly Ao, the union of all open sets

contained in Ais called the interior of A. The boundary ∂A of the set Ais deﬁned as

∂A =¯

A∩(Ac).

A subset Aof Sis compact if every open cover of Ahas a ﬁnite subcover, i.e.,

if {Uα:α∈Λ}areopensetsandA⊂∪

α∈ΛUα, then there exists α1,α

2,...,α

1.2. METRIC SPACES 11

such that A⊂∪

1Uαi.AsetAis compact iﬀ every sequence in Ahas a convergent

subsequence with limit in A.

The metric space Sis separable if it has a countable dense subset, i.e., if there is

a countable set S0with ¯

S0=S.Most of the sets that we consider are separable. In

particular, if Sis compact metric it is separable. Let Sbe separable and let S0be

a countable dense set. Consider the countable collection {B(si,1/n):si∈S0;n=

1,2,...}.IfUis an open set and if s∈U,thenforsomen>1,thereisaball

B(s, 1/n)⊂U.Letsi∈S0with ρ(si,s)<1/2n. Then sis in B(si,1/2n)and

B(si,1/2n)⊂B(s, 1/n)⊂U. This shows that in a separable space every open set is

a countable union of balls. This fact fails to hold when Sis not separable.

The Borel σ-algebra on Sis the σ-algebra generated by all open sets and will

be denoted by B(S). The remarks in the last paragraph show that if Sis separable

then B(S) is the same as the σ-algebra generated by open balls. In the absence of

separability these two σ-algebras will be diﬀerent.

It would sometimes be necessary to check that a given class of sets Cis the Borel

σ-algebra. A useful device to do this is the π-λtheorem given below. See Pollard

[[140], Section 2.10] for a proof and some discussion.

Theorem 1.2.1. [π-λtheorem] A class Dof subsets of Sis a π-system if it is

closed under ﬁnite intersection, i.e., if A, B are in Dthen A∩B∈D. A class Cof

subsets of Sisaλ-system if

(i) Sis in C;

(ii) An∈Cand An↑A, then A∈C;

(iii) A, B ∈Cand A⊂B, then B−A∈C.

If Cis a λ-system that contains a π-system D, then Ccontains the σ-algebra generated

by D.

Remark 1.2.1.An easy application of the π-λtheorem shows that if two probability

measures on Sagree on all closed sets then they agree on B(S).

Remark 1.2.2.If two probability measures on RKagree on all sets of the form

(a1,b

1]×(a2,b

2],...×(ak,b

k] then they agree on all Borel sets in Rk.

Deﬁnition 1.2.1. Let Pbe a probability measure on (S,B(S)).The smallest closed

set of P-measure 1 is called the support, or more precisely the topological support, of

12 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

When Sis separable the support of Palways exists. To see this let U0={U:

Uopen, P(U)=0}, then U0=∪U∈U0Uis open. Because U0is a countable union of

balls in U0,P(U0) = 0. It follows easily that F=Uc

0is the support of P. The support

can be equivalently deﬁned as a closed set Fwith P(F) = 1 and such that if s∈F

then P(U)>0 for every neighborhood Uof s.IfSis not separable then the support

of Pmay not exist.

1.2.2 Weak Convergence

We need elements of the theory of weak convergence of probability measures. The

details of the material discussed below can be found, for instance, in Billingsley [[21],

Chapter 1].

Let Sbe a metric space and B(S)betheBorel σ-algebra on S. Denote by C(S)

the set of all bounded continuous functions on S. Note that every function in C(S)is

B(S) measurable.

Deﬁnition 1.2.2. A sequence {Pn}of probability measures on Sis said to converge

weakly to a probability measure P, written as {Pn}→Pweakly, if

fdP

n→fdP for all f∈C(S)

The following “Portmanteau” theorem gives most of what we need.

Theorem 1.2.2. The following are equivalent:

1. {Pn}→P weakly;

2. fdP

n→fdPfor all fbounded and uniformly continuous;

3. lim sup Pn(F)≤P(F)for all Fclosed;

4. lim inf Pn(U)≥P(U)for all Uopen;

5. lim Pn(B)=P(B)for all B∈B(S)with P(∂B)=0.

In applications, Pns are often distributions on Sinduced by random variables Xns

taking values in S.IfSis not separable, then Pnis deﬁned on a σ-algebra much smaller

than B(S). In this case, to avoid measurability problems inner and outer probabilities

have to be used. For a version of Theorem 1.2.2 in this more general setting see van

der Vaart and Wellner [[161], 1.3.4]. The other useful result is Prohorov’s theorem.

1.2. METRIC SPACES 13

Theorem 1.2.3. [Prohorov] If Sis a complete separable metric space, then every

subsequence of Pnhas a weakly convergent subsequence iﬀ Pnis tight, i.e., for every

>0, there exists a compact set Kwith Pn(K)>1−for all n.

When Sis a complete separable metric space the space M(S)-the space of probabil-

ity measures on S-is also metrizable, complete, and separable under weak convergence.

In this case if fdP

n→fdPfor fin a countable dense set in C(S), then Pn→P

weakly.WenotethatsetsinM(S)oftheform

Q:fidP −fidQ

<δ,i=1,2,...,k;fi∈C(S)

constitute a base for the neighborhoods at P, i.e., any open set is a union of family

of sets of the form displayed above. The space M(S) and the space of probability

measures on M(S) are of considerable interest to us. We will return to a detailed

analysis of these spaces later; here are a few preliminary facts used later in this

chapter.

The space M(S) has many natural metrics.

Weak convergence. As discussed earlier M(S) is metrizable, i.e., there is a metric ρ

on M(S) such that ρ(Pn,P)→0iﬀPn→Pweakly [see section 6 in Billingsley

[21]]. The exact form of this metric is not of interest to us.

Total variation of L1.The total variation distance between Pand Qis given by

P−Q1=2sup

B|P(B)−Q(B)|.Ifpand qare densities of Pand Qwith

respect to some measure µ, then P−Q1is the L1-distance |p−q|dµ between

pand q. Sometimes, when there can be no confusion with other metrics, we will

omit the subscript 1 and denote the L1distance by just P−Qor in terms of

densities as p−q.

Hellinger metric. If pand qare densities of Pand Qwith respect to some σ-ﬁnite

measure µ, the Hellinger distance between Pand Qis deﬁned by H(P, Q)=

(√p−√q)2dµ1/2. This distance is convenient in the i.i.d. context because

A(Pn,Q

n)=An(P, Q), where A(P, Q)=√p√qdµ, is called the aﬃnity

between Pand Qand

H2(Pn,Q

n)=2(1−(A(P, Q))n)

The Hellinger metric is equivalent to the L1-metric. The next proposition shows

this.

14 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

Proposition 1.2.1.

P−Q2

1≤H2(P, Q)2(1+A(P, Q)) ≤P−Q12(1 + A(P, Q))

Proof. Let µdominate Pand Qand let p, q, be densities of Pand Qwith respect to

µ. Then

|p−q|dµ2

=|√p−√q||√p+√q|dµ2

≤(√p−√q)2dµ (√p+√q)2dµ

which is the ﬁrst inequality. Also H2(P, Q)≤P−Q1because

(√p−√q)2≤p+q−min(p, q)=|p−q|

As a corollary to the above proposition, we have the following.

Corollary 1.2.1. Replacing A(P, Q)by its upper bound 1 gives

P−Q2

1≤4H2(P, Q)≤4P−Q1

Writing H2(P, Q)=2(1−A(P, Q)) in the ﬁrst inequality, a bit of algebra gives

A(P, Q)≤1−P−Q2

Note that none of the three quantities discussed-the L1metric, the Hellinger metric,

or the aﬃnity A(P, Q)-depends on the dominating measure µ. The same holds for the

Kullback- Leibler divergence(K-L divergence) which is considered next.

Kullback-Leibler divergence. The Kullback-Leibler divergence between two prob-

ability measures, though not a metric, has played a central role in the classical

theory of testing and estimation and will play an important role in the later

chapters of this text. Let Pand Qbe two probability measures and let p, q be

their densities with respect to some measure µ. Then

K(P, Q)=plog p

qdµ ≥(1 −q

p)dP ≥0

and K(P, Q)=0iﬀP=Q. Here is a useful reﬁnement due to Hannan [92].

1.3. POSTERIOR DISTRIBUTION AND CONSISTENCY 15

Proposition 1.2.2.

K(P, Q)≥P−Q2

Proof.

plog(p/q)dµ =2(−log √q

√p)pdµ

≥2(1−(√q/√p)) pdµ

=2(1−A(P, Q)) = H2(P, Q)

The corollary to the previous proposition yields the conclusion.

Kemperman [112] has shown that K(P, Q)≥P−Q2

1/2 and that this inequality

is sharp.

Much of our study involves the convergence of sequences of functions of the form

Tn(X1,X

2,...,X

n):Ω→ M(Θ) where Ω = (X∞,A∞) with a measure P∞

0. The

diﬀerent metrics on M(Θ) provide ways of formalizing the convergence of Tnto T.

Thus

(i) Tn

weakly

→Talmost surely P0if

P∞

0ω:Tn(ω)weakly

→T(ω)=1

(ii) Tn

weakly

→Tin P0probability if

P∞

0{ω:ρ(Tn(ω),T(ω)) >}→0

where ρis a metric that generates weak convergence.

→Talmost surely P0or in P0-probability can be deﬁned similarly.

1.3 Posterior Distribution and Consistency

1.3.1 Preliminaries

We begin by formalizing the setup. Let Θ be the parameter space. We assume that

Θ is a complete separable metric space endowed with its Borel σ-algebra B(Θ). For

16 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

each θ∈Θ, Pθis a probability measure on a measurable space (X,A) such that, for

each A∈A,θ → Pθ(A)isB(Θ) measurable.

X1,X

2,... is a sequence of X-valued random variables that are, for each θ∈Θ,

independent and identically distributed as Pθ. It is convenient to think of X1,X

2,...

as the coordinate random variables deﬁned on Ω = (X∞,A∞)andP∞

θas the i.i.d.

product measure deﬁned on Ω. We will denote by Ωnthe space (Xn,An)andbyPn

the n-fold product of Pθ. When convenient we will also abbreviate X1,X

2,...,X

nby

Xn.

Suppose that Π is a prior, i.e., a probability measure on (Θ,B(Θ)). For each n,Π

and the Pθs together deﬁne a joint distribution of θand Xnnamely, the probability

measure λn,Πon Ωnby

λn,Π(B×A)=B

θ(A)dΠ(θ)

The marginal distribution λnof X1,X

2,...,X

nis

λn(A)=λn,Π(Θ ×A)

These notions also extend to the inﬁnite sequence X1,X

2,... .WedenotebyλΠ

the joint distribution of θ,X1,X

2,... and by λthe marginal distribution on Ω.

Any version of the conditional distribution of θgiven X1,X

2,...,X

nis called a

posterior distribution given X1,X

2,...,X

n. Formally, a function Π(·|·):B(Θ)×Ωn→

[0,1] is called a posterior given X1,X

2,...,X

nif

(a) for each ω∈Ωn,Π(·|ω) is a probability measure on B(Θ);

(b) for each B∈B(Θ), Π(B|· )isAnmeasurable; and

λn,Π(B×A)=A

Π(B|ω)dλn(ω)

In the case that we consider, namely, when the underlying spaces are complete and

separable, a version of the posterior always exists [Dudley [58], 10.2]. By condition

(b), Π(·|ω) is a function of X1,X

2,...,X

nand hence we will write the postrior

conveniently as Π(·|X1,X

2,...,X

n)orasΠ(·|Xn).

Typically, a candidate for the posterior can be guessed or computed heuristically

from the context. What is then required is to verify that it satisﬁes the three conditions

1.3. POSTERIOR DISTRIBUTION AND CONSISTENCY 17

listed earlier. When the Pθs are all dominated by a σ- ﬁnite measure µ,itiseasyto

see that, if pθ=dPθ/dµ, then

Π(A|Xn)=An

1pθ(Xi)dΠ(θ)

Θn

1pθ(Xi)dΠ(θ)

Thus in the dominated case, n

1pθ(Xi)/n

1pθ(Xi)dΠ(θ) is a version of the den-

sity with respect to Π of Π(·|Xn).

In the last expression the posterior given X1,X

2,...,X

nisthesameasthatgivena

permutation Xπ(1),X

π(2),...,X

π(n). Said diﬀerently, the posterior depends only on the

empirical measure (1/n)n

1δXi,whereforanyx,δxdenotes the measure degenerate

at x. This property holds also in the undominated case. A simple suﬃciency argument

shows that there is a version of the posterior given X1,X

2,...,X

nthat is a function

of the empirical measure.

Deﬁnition 1.3.1. Foreachn,letΠ(·|Xn) be a posterior given X1,X

2,...,X

The sequence {Π(·|Xn)}is said to be consistent at θ0if there is a Ω0⊂Ωwith

P∞

θ0(Ω0) = 1 such that if ωis in Ω0, then for every neighborhood Uof θ0,

Π(U|Xn(ω)) →1

Remark 1.3.1.When Θ is a metric space {θ:ρ(θ, θ0)<1/n :n≥1}forms a base

for the neighborhoods of θ0, and hence one can allow the set of measure 1 to depend

on U. In other words, it is enough to show that for each neighborhood Uof θ0,

Π(U|Xn(ω)) →1 a.e. P∞

θ0

Further, when Θ is a separable metric space it follows from the Portmanteau theo-

rem that consistency of the sequence {Π(·|Xn)}at θ0is equivalent to requiring that

{Π(·|Xn)}weakly

→δθ0a.e.Pθ0.

Thus the posterior is consistent at θ0,ifwithPθ0probability 1, as ngets large, the

posterior concentrates around θ0.

Why should one require consistency at a particular θ0? A Bayesian may think of

θ0as a plausible value and question what would happen if θ0were indeed the true

value and the sample size nincreases. Ideally the posterior would learn from the data

and put more and more mass near θ0. The deﬁnition of consistency captures this

requirement.

The idea goes back to Laplace, who had shown the following. If X1,X

2,...,X

nare

i.i.d. Bernoulli with Pθ(X=1)=θand π(θ) is a prior density that is continuous and

18 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

positive on (0,1), then the posterior is consistent at all θ0in (0,1). Von Mises [162]

calls this the second fundamental law of large numbers; the ﬁrst being Bernoulli’s

weak law of large numbers.

An elementary proof of Lapalace’s result for a beta prior may be of some interest.

Let the prior density with respect to Lebesgue measure on (0,1) be

Π(θ)= Γ(α+β)

Γ(α)Γ(β)θα−1(1 −θ)β−1

Then the posterior density given X1,X

2,...,X

nis

Γ(α+β+n)

Γ(α+r)Γ(β+(n−r))θα+r−1(1 −θ)β+(n−r)−1

where ris the number of Xis equal to 1. An easy calculation shows that the posterior

mean is

E(θ|X1,X

2,...,X

n)=α+β

α+β+nα

α+β+n

α+β+nr

which is a weighted combination of the consistent estimate r/n of the true value θ0

and the prior mean α/(α+β). Because the weight of r/n goes to 1,

E(θ|X1,X

2,...,X

n)→θ0a.e. Pθ0

A similar easy calculation shows that the posterior variance

Var(θ|X1,X

2,...,X

n)= (α+r)(β+(n−r))

(α+β+n)2(α+β+n+1)

goes to 0 with probability 1 under θ0. An application of Chebyshev’s inequality com-

pletes the proof.

1.3.2 Posterior Consistency and Posterior Robustness

Posterior consistency is also connected with posterior robustness. A simple result is

presented next [84].

Theorem 1.3.1. Assume that the family {Pθ:θ∈Θ}is dominated by a σ-

ﬁnite measure µand let pθdenote the density of Pθ.Letθ0be an interior point of Θ

and π1,π

2be two prior densities with respect to a measure ν, which are positive and

1.3. POSTERIOR DISTRIBUTION AND CONSISTENCY 19

continuous at θ0.Letπi(θ|Xn),i =1,2denote the posterior densities of θgiven Xn.

If πi(·|Xn),i=1,2are both consistent at θ0then

lim

n→∞ |π1(θ|Xn)−π2(θ|Xn)|dν(θ)=0 a.s Pθ0

Proof. We will show that with P∞

θ0probability 1,

Θ

π2(θ|Xn)

1−π1(θ|Xn)

π2(θ|Xn)

dν(θ)→0

Fix δ>0,η >0, and >0 and use the continuity at θ0to obtain a neighborhood

Uof θ0such that for all θ∈U



π1(θ)

π2(θ)−π1(θ0)

π2(θ0)

<δand |πj(θ0)−πj(θ)|<δfor j=1,2.

By consistency there exists Ω0,P∞

θ0(Ω0) = 1, such that for ω∈Ω0,

Πj(U|Xn(ω)) = Un

1pθ(Xi(ω)) πj(θ)dν(θ)

Θn

1pθ(Xi(ω)) πj(θ)dν(θ)→1

Fix ω∈Ω0and choose n0such that, for n>n

Πj(U|Xn(ω)) ≥1−ηfor j=1,2

Note that

π1(θ|Xn)

π2(θ|Xn)=π1(θ)

π2(θ)Θn

1pθ(Xi)π2(θ)dν(θ)

Θn

1pθ(Xi)π1(θ)dν(θ)

Hence for n>n

0and θ∈U, after some easy manipulation, we have

π1(θ0)

π2(θ0)−δ(1 −η)Un

1pθ(Xi(ω)) π2(θ)dν(θ)

Un

1pθ(Xi(ω)) π1(θ)dν(θ)

≤π1(θ|Xn(ω))

π2(θ|Xn(ω))

≤π1(θ0)

π2(θ0)+δ(1 −η)−1Un

1pθ(Xi(ω)) π2(θ)dν(θ)

Un

1pθ(Xi(ω)) π1(θ)dν(θ)

20 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

and by the choice of U,

(πj(θ0)−δ)U



pθ(Xi(ω)) dν(θ)≤U



pθ(Xi(ω))πj(θ)dν(θ)

≤(πj(θ0)+δ)U



pθ(Xi(ω)) dν(θ)

(1.1)

Using (1.1) we have, again for θ∈U,

π1(θ0)

π2(θ0)−δ(1 −η)π2(θ0)−δ

π1(θ0)+δ≤π1(θ|Xn(ω))

π2(θ|Xn(ω))

≤π1(θ0)

π2(θ0)+δ(1 −η)−1π2(θ0)+δ

π1(θ0)−δ

so that for δ, η small



π1(θ|Xn(ω))

π2(θ|Xn(ω)) −1

<

Hence, for n>n

|π1(θ|Xn(ω)) −π2(θ|Xn(ω))|dν(θ)

≤U

π2(θ|Xn(ω)) 

1−π1(θ|Xn(ω))

π2(θ|Xn(ω))

dν(θ)+2η

≤(1 −η)+2η

This completes the proof.

Another notion related to Theorem 1.3.1 is that of merging where, instead of the

posterior, one looks at the predictive distribution of Xn+1,X

n+2,... given X1...,X

Here the attempt is to formalize the idea that two Bayesians starting with diﬀerent

priors Π1and Π2would eventually agree in their prediction of the distribution of

future observations.

For a prior Π if we deﬁne, for any measurable subset Cof Ω

λΠ(C|Xn)=Θ

P∞

θ(C)Π(dθ|Xn)

1.3. POSTERIOR DISTRIBUTION AND CONSISTENCY 21

then, λΠ(·|Xn) is a version of the predictive distribution of Xn+1,X

n+2,... given

X1,X

2,...,X

n. Note that given Xn, the predictive distribution is a probability mea-

sure on Ω = R∞.

Let λΠ1(·|Xn)andλΠ2(·|Xn) be two predictive distributions, corresponding to

priors Π1and Π2.

An early result in merging is due to Blackwell and Dubins [24]. They showed that

if Π2is absolutely continuous with respect to Π1,thenforθin a set of Π2probability

1, the total variation distance between λΠ1(·|Xn)andλΠ2(·|Xn)goesto0almost

surely P∞

θ.

The connection with consistency was observed by Diaconis and Freedman [46].

Towards this, say that the predictive distributions merge weakly with respect to Pθ0if

thereexistsΩ

0⊂ΩwithP∞

θ(Ω0) = 1, such that for each ω∈Ω0,

φ(ω)λΠ1(dω|Xn(ω)) −φ(ω)λΠ2(dω|Xn(ω))→0

for all bounded continuous functions φon Ω.

Proposition 1.3.1. Assume that θ→ Pθis 1-1 and continuous with respect to

weak convergence. Also assume that there is a compact set Ksuch that Pθ(K)=1

for all θ.

If Π1and Π2are two priors such that the posteriors Π1(·|Xn)and Π2(·|Xn)are

consistent at θ0, then the predictive distributions λΠ1(·|Xn)and λΠ2(·|Xn), merge

weakly with respect to Pθ0.

Proof. Let Gbe the class of all functions on Ω that are ﬁnite linear combinations of

functions of the form

φ(ω)=



fi(ωi)

where f1,f

2,...,f

kare continuous functions on K.Itiseasytoseethatifφ∈Gthen

θ→ φ(ω)dP ∞

θ(ω) is continuous. Further, by the Stone-Weirstrass theorem Gis

dense in the space of all continuous functions on K∞.

From the deﬁnition of λΠ1(·|Xn)andλΠ2(·|Xn), if Ω0is the set where the posterior

converges to δθ0,thenforω∈Ω0,forφ∈G,

φ(ω)λΠi(dω|Xn(ω)) = ΘΩ

φ(ω)dP ∞

θ(ω)Π

i(dθ|(Xn(ω))

22 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

The inside integral gives rise to a bounded continuous function of θ. Hence by weak

consistency at θ0,forbothi=1,2 the right-hand side converges to Ωφ(ω)dP ∞

θ0(ω).

This yields the conclusion.

Further connections between merging and posterior consistency is explored in Di-

aconis and Freedman[46].

Note a few technical remarks: According to the deﬁnition, posterior consistency is

a property that is speciﬁc to the ﬁxed version Π(·|Xn). Measure theoretically, the

posterior is unique only up to λnnull sets. So the posterior is uniquely deﬁned up to

Pθ0if Pn

θ0is dominated by λn. Without this condition it is easy to construct examples

of two versions {Π1(·|Xn)}and {Π2(·|Xn)}such that one is consistent and the other

is not. It is easy to show that if {Pθ∈Θ}are all mutually absolutely continuous and

{Π1(·|Xn)}and {Π2(·|Xn)}are two versions of the posterior, then {Π1(·|Xn)}is

consistent iﬀ {Π2(·|Xn}is.

1.3.3 Doob’s Theorem

An early result on consistency is the following theorem of Doob [49].

Theorem 1.3.2. Suppose that Θand Xare both complete separable metric spaces

endowed with their respective Borel σ-algebras B(Θ) and Aand let θ→ Pθbe 1-1.

Let Πbe a prior and {Π(·|Xn)}be a posterior. Then there exists a Θ0⊂Θ, with

Π(Θ0)=1such that {Π(·|Xn)}n≥1is consistent at every θ∈Θ0.

Proof. The basic idea of the proof is simple. On the one hand, because for each θ

the empirical distribution converges a.s. P∞

θto Pθ, given any sequence of xi’s we can

pinpoint the true θ. On the other hand, any version of the posterior distributions

Π(·|Xn), via the martingale convergence theorem, converge a.s. with respect to the

marginal λΠ, to the posterior given the entire sequence. One then equates these two

versions to get the result. A formal proof of these observations needs subtle measure

theory.

As before let, Ω= XN,Bbe the product σ-algebra on Ω, λΠdenote both the joint

distribution of θand X1,X

2,... and the marginal distribution of X1,X

2,... .LetC

be a subset of Θ, then by the martingale convergence theorem, as n→∞,

Π(C|X1,X

2,...,X

n)→E(IC|X1,X

2,... ).

=fa.e. λΠ

We point out that the functions considered above are, formally, functions of two

variables (θ, ω). IC, is to be interpreted as IC×Ωand fis to be thought of as f(θ, ω)=

f(ω) and so on.

1.3. POSTERIOR DISTRIBUTION AND CONSISTENCY 23

We shall show that there exists a set Θ0with Π(Θ0) = 1 such that

for θ∈Θ0∩C, f =1a.e.P∞

θ(1.2)

This would establish the theorem. To see this, take U={U1,U

2,...,}a base for the

open sets of Θ. Take C=Uiin the above step and obtain the corresponding Θ0i⊂Θ

satisfying (1.2). If we set Θ0=∩iΘ0ithen (1.2) translates into “ the posterior is

consistent at all θ∈Θ0”.

To establish (1.2), let A0be a countable algebra generating A.Let

E={(θ, ω) : lim

n→∞



δXi(ω)(A)=Pθ(A) for all A∈A

The set E, since it arises from the limit of a sequence of measurable functions, is

a measurable set and further by the law of large numbers for each θthe sections Eθ

satisfy

(i) for all θ, P ∞

θ(Eθ)=1

(ii) if θ=θ,E

θ∩Eθ=∅

Deﬁne

f∗(ω)=1if,ω ∈∪

θ∈CEθ

0 otherwise.

It is a consequence of a deep result in set theory that ∪θ∈CEθis measurable, from

which it follows that f∗is measurable.

From its deﬁnition, f∗satisﬁes:

1. for all θ∈C,f∗=1a.e.P∞

2. for all θnot in C,f∗=0a.e.P∞

In other words for all θ,f∗=IC(θ)f∗a.e. P∞

We cla i m that f∗is a version of E(IC|X1,X

2,... ). For any measurable set B∈B,

IBf∗dλΠ=IBIC(θ)f∗dP ∞

θdΠ(θ)=IC(θ)P∞

θ(B)dΠ(θ)=λΠ(C×B)

Since fand f∗are both versions of E(IC|X1,X

2,... ), we have

f=f∗a.e. λΠ

24 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

By Fubini’s theorem, there exists a set Θ0with Π(Θ0) = 1, such that for θin Θ0

f=f∗a.e.P∞

(1.2) follows easily from the properties 1 and 2 of f∗mentioned earlier.

This completes the proof.

Remark 1.3.2.A well known result in set theory, the Borel Isomorphism theorem,

states that any two uncountable Borel sets of complete separable metric spaces are

isomorphic [[153],Theorem 3.3.13 ]. The result that we used from set theory is a

version of this theorem which states that if Sand Tare Borel subsets of complete

metric spaces and if φis a 1-1 measurable function from Sinto T, then, the range of

φis a measurable set and φ−1is also measurable. To get the result that we used, just

set S=E,T=Ωandφ(θ, ω)=ω.

Remark 1.3.3.Another consequence of the Borel Isomorphism theorem is that

Doob’s theorem holds even when Θ and Xare just Borel subsets of a complete

separable metric space.

Many Bayesians are satisﬁed with Doob’s theorem, which provides a sort of internal

consistency but fails to answer the question of consistency at a speciﬁc θ0of interest

to a Bayesian. Moreover in the inﬁnite-dimensional case, the set of θ0values where

consistency holds may be a very small set topologically [70] and may exclude inﬁnitely

many θ0s of interest. Disturbing examples and general results of this kind appear in

Freedman [69] in the context of an inﬁnite-cell multinomial.

If θ0is not in the support of the prior Π then there exists an open set Usuch

that Π(U) = 0. This implies that Π(U|Xn)=0a.sλn. Hence,it is not reasonable to

expect consistency outside the support of Π. Ideally, one might hope for consistency

at all θ0in the support of Π. This is often true for a ﬁnite-dimensional Θ. However,

for an inﬁnite-dimensional Θ this turns out to be too strong a requirement. We will

often prove consistency for a large set of θ0s . A Bayesian can then decide whether it

includes all or most of the θ0sofinterest.

1.3.4 Wald-Type Conditions

We begin with a uniform strong law.

Theorem 1.3.3. Suppose that Kis a compact subset of a separable metric space.

Let T(·,·)be a real-valued function on θ×Rsuch that

(i) for each x, T (·,x)is continuous in θ, and

1.3. POSTERIOR DISTRIBUTION AND CONSISTENCY 25

(ii) for each θ, T(θ, ·)is measurable.

Let X1,X

2,... i.i.d. random variables deﬁned on (Ω,A,P)with E(T(θ, X1)) = µ(θ)

and assume further that

Esup

θ∈K|T(θ, Xi)|<∞

Then, as n→∞,

sup

θ∈K



T(θ, Xi)−µ(θ)→0a.s. P

Proof. Continuity of T(., x) and separability ensures that sup

θ∈K|T(θ, Xi)|is measurable.

It follows from the dominated convergence theorem that θ→ µ(θ) is continuous.

Another application of the dominated convergence theorem shows that for any θ0∈K,

lim

δ→0Esup

|θ−θ0|<δ |T(θ, X1)−µ(θ)−T(θ0,X

1)−µ(θ0)|=0

Let Zji =sup

ρ(θ,θi)<δi|T(θ, Xj)−µ(θ)−T(θi,X

j)−µ(θi)|. By compactness of K,there

exist θ1,θ

2,...,θ

kand δ1,δ

2,...,δ

ksuch that K=∪k

1{θ:ρ(θ, θi)<δ

i},andEZ1i<

for i=1,2,...,k.

By the strong law of large numbers, since E(Z1,i)<for i=1,2,...,k,thereisa

Ω0with P(Ω0) = 1 such that for ω∈Ω0,n>n(ω), for i=1,2,...,k,



Zj,i <2

and 



j=1

T(θi,X

j)−µ(θi)

<

Now if θ∈{θ:ρ(θ, θi)<δ

i},

nT(θ, Xj(ω)) −µ(θ)

≤1

nZj,i(ω)+

nT(θi,X

j(ω)) −µ(θi)≤3

Hence sup

θ∈K|1

nT(θ, Xj(ω)) −µ(θ)|<3k.

26 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

Remark 1.3.4.A very powerful approach to uniform strong laws is through em-

pirical processes. One considers a sequence of i.i.d. random variables Xiand studies

uniformity over a family of functions Fwith an integrable envelope function φ, i.e.,

E(φ)<∞,and|f(x)|≤|φ(x)|,f ∈F. Good references are Pollard [[139], II.2] and

Van der Vaart and Wellner [[161], 2.4].

Here is an easy consequence of the last theorem. First a deﬁnition: Let Θ be a space

endowed with a σ-algebra and θ→ Pθbe 1-1. For each θin Θ, let X1,X

2,... be

i.i.d. Pθ. Assume that Pθs are dominated by a σ-ﬁnite measure µand pθ=dPθ/dµ.

Deﬁnition 1.3.2. A measurable function ˆ

θn(X1,X

2,...,X

n) taking values in Θ is

called a maximum likelihood estimate (MLE) if the likelihood function at X1,X

2,...,X

attains its maximum at ˆ

θn(X1,X

2,...,X

n) or formally,



pˆ

θn(X1,X2,...,Xn)(Xi)=sup



pθ(Xi)

Theorem 1.3.4. Let Θbe compact metric. For a ﬁxed θ0, let

T(θ, x)=log(pθ(x)/pθ0(x))

If T(θ, Xi)satisfy the assumptions of Theorem 1.3.3 with P=Pθ0, then

1. any MLE ˆ

θnis consistent at θ0;

2. if Πis a prior on Θand if θ0is in the support of Πthen the posterior deﬁned

by the density (with respect to Π)n

1pθ(Xi)/n

1pθ(Xi)dΠ(θ)is consistent at

θ0.

Proof. (i) Take any open neighborhood Uof θ0and let K=Uc. Note that µ(θ)=

Eθ0(T(θ, Xi)) = −K(θ0,θ)<0 for all θand hence by the continuity of µ(·), sup

θ∈K

µ(θ)<

On the one hand, by Theorem 1.3.3, given 0 <<|sup

θ∈K

µ(θ)|, there exists n(ω),

such that for n>n(ω),

sup

θ∈K

nT(θ, Xi)−µ(θ)

<

On the other hand, (1/n)T(ˆ

θn,X

i)≥0.So ˆ

θn∈ Kand hence ˆ

θn∈U.

1.3. POSTERIOR DISTRIBUTION AND CONSISTENCY 27

As a curiosity, we note that we have not used the measurability assumption on ˆ

θn.

We have shown that the samples where the MLE is consistent contain a measurable

set of P∞

θ0measure 1.

(ii) Let Ube a neighborhood of θ0. We shall show that Π(U|X1,X

2,...,X

n)→1

a.s Pθ0. As before, let K=Ucand T(θ, Xi)=log(pθ(Xi)/pθ0(Xi)) and Uδ={θ:

ρ(θ, θ0)<δ}.Let

A1= inf

θ∈¯

Uδ

µ(θ)andA2=sup

θ∈K

µ(θ)

Clearly A1<0,A

2<0. Choose δsmall enough so that Uδ⊂Uand |A1|<|A2|. This

can be done because µ(θ) is continuous and as δ↓0,inf

θ∈¯

Uδ

µ(θ)↑0.

Choose >0 such that A1−>A

2+. By applying the uniform strong law of

large numbers to Kand ¯

Uδ,forωin a set of Pθ0-measure 1, there exists n(ω)such

that for n>n(ω),



nT(θ, Xi)−µ(θ)

< ∀θ∈K∪¯

Uδ

Now

Π(U|X1,X

2,...,X

n)= Uen

1T(θ,Xi)dΠ(θ)

Uen

1T(θ,Xi)dΠ(θ)+Ucen

1T(θ,Xi)dΠ(θ)

≥1/1+ Ken

1T(θ,Xi)dΠ(θ)

Uδen

1T(θ,Xi)dΠ(θ)

≥1/1+ Π(K)en(A2+)

Π(Uδ)en(A1−)

Since A2−A1+2<0 and Π(Uδ)>0, the last term converges to 1 as n→∞.

Remark 1.3.5.Theorem 1.3.4 is related to Wald’s paper [163]. His conditions and

proofs are similar but he handles the noncompact case by assumptions of the kind

given next which ensure that the MLE ˆ

θnis inside a compact set eventually, almost

surely. Here are two assumptions; we will refer to them as Wald’s conditions:

1. Let Θ = ∪Kiwhere the Kis are compact and K1⊂K2⊂.... For any sequence

θi∈Kc

(i−1) ∩Ki,lim

ip(x, θi)=0.

2. Let φi(x)= sup

θ∈Kc

(i−1)

(log p(x, θ)/p(x, θ0)). Then Eθ0φ+

i(X1)<∞for some i.

28 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

Assumption (1) implies that lim

i→∞ φi(x)=−∞. Using Assumption (2), the monotone

convergence theorem and the dominated convergence theorem one can show

lim

i→∞ Eθ0φi(X1)=−∞

Thus, given any A3<0, we can choose a compact set Kjsuch that

Eθφj=Eθ0sup

θ∈Kc

(j−1)

log p(Xi,θ)−Eθ0p(Xi,θ

0)<A

Using

nsup

θ∈Kc



log p(Xi,θ)/p(Xi,θ

0)≤1



sup

θ∈Kc

log p(Xi,θ)/p(Xi,θ

and applying the usual SLLN to 1/n n

i=1 φj(Xi), it can be concluded that eventually

it is ≤0 a.s. Pθ0. This implies that eventually, ˆ

θn∈Kja.s Pθ0. This result for the

compact case can now be used to establish consistency of ˆ

θn.

Remark 1.3.6.Suppose Θ is a convex open subset of Rpand for θ∈Θ,

log fθ(xi)=A(θ)+



θjxi+ψ(xi)

and ∂log fθ

∂θ ,∂2log fθ

∂θ2exist. Then by Lehman[123]

I(θ)=Eθ∂log fθ

∂θ 2

=−Eθ∂2log fθ

∂θ2=d2A(θ)

dθ2>0

Thus the likelihood is log concave. In this case also the MLE is consistent without

compactness by a simple direct argument using Theorem 1.3.4. Start with a bounded

open rectangle around θ0and let Kbe its closure. Because Kis compact, the MLE

θK,withKas the parameter space exists and given any open neighborhood V⊂Kof

θ0,ˆ

θKlies in Vwith probability tending to 1. If ˆ

θK∈Vit must be a local maximum

and hence a global maximum because of log concavity. This completes the proof. In

the log concave situation more detailed and general results are available in Hjort and

Pollard [101]

Remark 1.3.7.Under the assumptions of either of the last remarks it can be veriﬁed

that the posterior is consistent.

1.3. POSTERIOR DISTRIBUTION AND CONSISTENCY 29

The next two examples show that even in the ﬁnite-dimensional case consistency

of the MLE and the posterior do not always occur together.

Example. This example is due to Bahadur. Our presentation follows Lehman [124].

Here Θ = {1,2,...,}.Foreachθ, we deﬁne a density fθon [0,1] as follows:

Let h(x)=e1/x2. Deﬁne a0=1andanby an−1

an(h(x)−C)dx =1−Cwhere

0<C<1. Because 1

0e1/x2dx =∞it is easy to show that ans are unique and tend

to0asn→∞.

Deﬁne fk(x)on[0,1] by

fk(x)=h(x)ifak<x<a

k−1

Cotherwise

For ea ch k,letX1,X

2,...,X

nbe i.i.d. fk. Denoting min(X1,X

2,...,X

n)byX(n)

1,we

can write the likelihood function as

LX1,X2,...,Xn(k)=Cnif ak−1<X

(n)

diif ak−1>X

(n)

where di=IAi(Xi)h(Xi)+IAc

i(Xi)Cand Ai=(ai,a

i−1].

Because h(x)>1, the likelihood function attains its maximum in the ﬁnite set

{k:ak>X

(n)

1}, and hence an MLE exists.

Fix j∈Θ. We shall show that any MLE ˆ

θnfails to be consistent at jby showing

Pjn



log fˆ

θn(Xi)

fj(Xi)>1→1

Actually, we show more, namely, for each j,ˆ

θnconverges in Pjprobability to ∞.

Fix mand consider the set Θ1={1,2,...,m}⊂Θ. It is enough to show as n→∞,

Pj{ˆ

θn∈ Θ1}→1

Deﬁne k∗(X1,X

2,...,X

n)tobekif X(n)

1∈(ak,a

k−1). Because the likelihood func-

tion at ˆ

θnis larger than that at k∗it suﬃces to show that



log fK∗

n(Xi)

fj(Xi)→∞in Pjprobability

30 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

Towards this ﬁrst note that for any kand j,



log fk(Xi)

fj(Xi)=

(k)

log h(Xi)

C−

(j)

log h(Xi)

where (k)is the sum over all isuch that Xi∈(ak,a

k−1).With k∗

nin place of k,we

have



log fk∗

n(Xi)

fj(Xi)=

(∗)

log h(Xi)

C−

(j)

log h(Xi)

where (∗)is the sum over all isuch that Xi∈(ak∗

n,a

k∗

n−1).

Because for each x,h(x)/C > 1, the ﬁrst sum on the right-hand side is larger than

log(h(X(n)

(1) )/C), one of the terms appearing in the sum. Formally,

(∗)

log h(Xi)

C≥log h(X(n)

(1) )

On the other hand, because his decreasing

(j)

log h(Xi)

C≤νk,n log h(aj)

where νk,n is the number of Xisin(aj,a

j−1).

Thus n



nlog fk∗

n(Xi)

fj(Xi)≥1

nlog h(X(n)

C−1

nνk,n

log h(aj)

Because (1/n)νk,n →Pj(aj,a

j−1), the second term converges to a ﬁnite constant.

We complete the proof by showing

nlog h(X(n)

1)= 1

n(X(n)

1)2→∞

in Pjprobability.

Toward this, consider X∼Pjand Y∼U(0,1/C). Then for all x,

P(X>x)≤P(Y>x)

1.3. POSTERIOR DISTRIBUTION AND CONSISTENCY 31

To see this, P(Y>x)=1−Cx and for P(X>x)notethat

if x>a

j−1then P(X>x)=C(1 −x)<1−Cx

If x∈(aj,a

j−1) then P(X>x)≤1−ajC≤1−Cx

and

if x<a

j,then P(X>x)=1−Cx

Consequently X(n)

(1) is stochastically smaller than Y(n)

(1) and because his decreasing

P{h(X(n)

(1) )>x}≥P{h(Y(n)

(1) )>x}.

Therefore to show that (1/n)logh(X(n)

(1) )→∞in Pjprobability, it is enough to

show that (1/n)logh(Y(n)

(1) )→∞in U(0,1/C) probability. This follows because

nlog h(Y(n)

(1) )= 1

n(Y(n)

(1) )2

and easy computation shows that nY (n)

(1) has a limiting distribution and is hence

bounded in probability and Y(n)

(1) →0 a.s.

On the other hand, Θ being countable, Doob’s theorem assures consistency of the

posterior at all j∈Θ. This result also follows from Schwartz’s theorem which provides

more insight on the behavior of the posterior.

Intuitively, a Bayesian with a proper posterior is better oﬀ in such situations be-

cause a proper prior assigns a small probability to large values of K, which cause

problems for ˆ

θn. For an illuminating discussion of integrating rather than maximizing

the likelihood, see the discussion of a counterexample due to Stein in [9].

Example. This is an example where the posterior fails to be consistent at θ0in

the support of Π. This example is modeled after an example of Schwartz [145], but is

much simpler. In the next example Θ is ﬁnite-dimensional. In the inﬁnite-dimensional

case there are many such examples due to Freedman [69] and Diaconis and Freedman

[46], [45].

LetΘ=(0,1) ∪(2,3) and X1,X

2,...,X

nbe i.i.d U(0,θ). Let θ0=1. Π is a prior

with density π, which is positive and continuous on Θ with π(θ)=e−1/(θ−θ0)2on

(0,1).Because 1

0π(θ)dθ < 1, there exists such a prior density π, which is also positive

on (2,3).

32 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

We will argue that the posterior density fails to be consistent at θ0by showing that

the posterior probability of (2,3) goes to one in Pθ0probability. The proof rests on

the following facts both of which are easy to verify:

Let X(n)denote the maximum of X1,X

2,...,X

n. Then under Pθ0, i.e., under U(0,1),

n(X(n)−θ0)=OP(1). In fact, n(X(n)−θ0) converges to an exponential distribution.

The second fact is (1/n)log(1 −Xn−1

(n))→0inPθ0probability, because by direct

calculation the distribution of (1 −Xn−1

(n))w

→U(0,1).

Now the posterior probability of (2,3) is given by

3

θnI(0,θ)(X(n))π(θ)dθ

1

θnI(0,θ)(X(n))π(θ)dθ +3

θnI(0,θ)(X(n))π(θ)dθ

Because 0 ≤X(n)≤1 a.e. Pθ0, the numerator is equal to 3

2(1/θn)π(θ)dθ and the

ﬁrst integral in the denominator is 1

X(n)

θnπ(θ)dθ. So the posterior probability of

(2,3) reduces to

1+1

X(n)θ−nπ(θ)dθ

3

2θ−nπ(θ)dθ =1

1+I1

I2

Now

I1≤π(X(n))1

X(n)

θ−ndθ =π(X(n))

n−1

(1 −Xn−1

(n))

Xn−1

(n)

and (1/n)logI1is less than

−n−1

nlog X(n)−log(n−1)

n+1

nlog(1 −Xn−1

(n))+ 1

nlog π(X(n))

As n→∞the ﬁrst two terms on the right side go to 0. The third goes to 0 by the

second remark. The last term, using the explicit form of πon (0,1), goes to −∞ in

Pθ0probability. Thus (1/n)logI1→−∞in Pθ0probability.

On the other hand

3nΠ(2,3) <3

θnπ(θ)dθ < 1

2nΠ(2,3)

Hence

−(log 3)Π(2,3) ≤1

nlog I2≤−(log 2)Π(2,3)

1.4 ASYMPTOTIC NORMALITY 33

and thus log(I1/I2)→−∞in Pθ0probability. Equivalently, I1/I2→0inPθ0proba-

bility.

In this example, the MLE is consistent. We could have taken the parameter space

to be [, 1] ∪[2,3] and ensured compactness. What goes wrong here, as we shall later

recognize, is the lack of continuity of the Kullback-Leibler information and, of course,

the behavior of Π in the neighborhood of θ0. If a prior Π satisﬁes Π(θ0,θ

0+h)>

0,for all h>0, then similar calculations or an application of the Schwartz theorem,

to be proved later, show that the posterior is consistent.

Remark 1.3.8.We have seen that consistency of MLE neither implies nor is implied

by consistency of the posterior. The following condition implies both. Let Vbe any

open set containing θ0. Then the condition is

sup

θ∈Vc



fθ(Xi)/fθ0(Xi)→0a.sθ0

Theorem 1.3.4 implies this stronger condition.

1.4 Asymptotic Normality of MLE and

Bernstein–von Mises Theorem

A standard result in the asymptotic theory of maximum likelihood estimates is its

asymptotic normality. In this section we brieﬂy review this and its Bayesian parallel-

the Bernstein–von Mises theorem-on the asymptotic normality of the posterior dis-

tribution. A word about the asymptotic normality of the MLE: This is really a result

about the consistent roots of the likelihood equation ∂log fθ/∂θ = 0. If a global MLE

θnexists and is consistent, then under a diﬀerentiability assumption it is easy to see

that for each Pθ0,ˆ

θnis a consistent solution of the likelihood equation almost surely

Pθ0. On the other hand, if fθis diﬀerentiable in θ, then for each θ0it is possible to

construct [Serﬂing [147] 33.3; Cram´er [35]] a sequence Tnthat is a solution of the like-

lihood equation and that converges to θ0. The problem, of course, is that Tndepends

on θ0and so will not qualify as an estimator. If there exists a consistent estimate

θ

n, then a consistent sequence that is also a solution of the likelihood equation can

be constructed by picking ˆ

θnto be the solution closest to θ

n. For a sketch of this

argument, see Ghosh [89].

As before, let X1,X

2,...,X

nbe i.i.d. fθ,wherefθis a density with respect to

some dominating measure µand θ∈Θ, and Θ is an open subset of R.Wemakethe

following regularity assumptions on fθ:

34 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

(i) {x:fθ(x)>0}is the same for all θ∈Θ

(ii) L(θ, x) = log fθ(x) is thrice diﬀerentiable with respect to θin a neighborhood

(θ0−δ, θ0+δ). If ˙

L, ¨

L,and ...

Lstand for the ﬁrst, second, and third derivatives,

then Eθ0˙

L(θ0)andEθ0¨

L(θ0) are both ﬁnite and

sup

θ∈(θ0−δ,θ0+δ)|...

L(θ, x)|<M(x)andEθ0M<∞

(iii) Interchange of the order of expectation with respect to θ0and diﬀerentiation at

θ0are justiﬁed, so that

Eθ0˙

L(θ0)=0,E

θ0¨

L(θ0)=−Eθ0(˙

L(θ0))2

(iv) I(θ0).

=Eθ0(˙

L(θ0))2>0.

Theorem 1.4.1. If {fθ:θ∈Θ}satisﬁes conditions (i)–(iv) and if ˆ

θnis a consis-

tent solution of the likelihood equation then √n(ˆ

θn−θ0)D

→N(0,1/I(θ0)).

Proof. Let Ln(θ)=n

1L(θ, Xi). By Taylor expansion

0= ˙

Ln(ˆ

θn)= ˙

Ln(θ0)+(

θn−θ0)¨

Ln(θ0)+(ˆ

θn−θ0)2

...

Ln(θ)

where θ0<θ

<ˆ

θn.Thus,

√n(ˆ

θn−θ0)=

√n˙

Ln(θ0)

−1

n¨

Ln(θ0)−1

2(ˆ

θn−θ0)1

...

Ln(θ)

By the central limit theorem, the numerator converges in distribution to N(0,I(θ0));

the ﬁrst term in the denominator goes to I(θ0) by SLLN; the second term is oP(1) by

the assumptions on ˆ

θnand ...

We next turn to asymptotic normality of the posterior. We wish to prove that if

θnis a consistent solution of the likelihood equation, then the posterior distribution

of √n(θ−ˆ

θn) is approximately N(0,1/I(θ0)). Early forms of this theorem go back to

Laplace, Bernstein, and von Mises [see [46] for references]. A version of this theorem

appears in Lehmann [124]. Condition (v) in Theorem 1.4.2 is taken from there. Other

related references are Bickel and Yahav [20], Walker [164], LeCam [121], [120] and

1.4 ASYMPTOTIC NORMALITY 35

Borwanker et al. [27]. Ghosal [75, 76, 77] has developed posterior normality results in

cases where the dimension of the parameter space is increasing. Further reﬁnements

developing asymptotic expansions appear in Johnson [107],[108] , Kadane and Tierney

[158] and Woodroofe [173]. Lindley [129], Johnson [108] and Ghosh et al. [82], provide

expansions of the posterior that reﬁne posterior normality. See the next section for

an alternative uniﬁed treatment of regular and nonregular cases.

Theorem 1.4.2. Suppose {fθ:θ∈Θ}satisﬁes assumptions (i)–(iv) of the Theo-

rem 1.4.1 and ˆ

θnis a consistent solution of the likelihood equation. Further, suppose

(v) for any δ>0, there exists an >0such that

Pθ0sup

|θ−θ0|>δ

n(Ln(θ)−Ln(θ0)) ≤−→1

(vi) The prior has a density π(θ)with respect to Lebesgue measure, which is con-

tinuous and positive at θ0.

Let Xnstand for X1,X

2,...,X

nand fθ(Xn)for its joint density. Denote by π∗(s|Xn)

the posterior density of s=√n(θ−ˆ

θn(Xn)). Then as n→∞,

R

π∗(s|Xn)−I(θ0)

√2πe−s2I(θ0)

2

ds Pθ0

→0 (1.3)

Proof. Because s=√n(θ−ˆ

θn),

π∗(s|Xn)= π(ˆ

θn+s

√n)fˆ

θn+s/√n(Xn)

Rπ(ˆ

θn+t

√n)fˆ

θn+t/√n(Xn)dt

To avoid notational mess, we suppress the Xnand rewrite the last line as

π(ˆ

θn+s

√n)eLn(ˆ

θn+s/√n)−Ln(ˆ

θn)

Rπ(ˆ

θn+t/√n)eLn(ˆ

θn+t

√n)−Ln(ˆ

θn)dt

Thus we need to show

R

π(ˆ

θn+s/√n)eLn(ˆ

θn+s/√n)−Ln(ˆ

θn)

Rπ(ˆ

θn+t/√n)eLn(ˆ

θn+t/√n)−Ln(ˆ

θn)dt −I(θ0)

2πe−s2I(θ0)

2

ds Pθ0

→0 (1.4)

36 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

It is enough to show that

R

π(ˆ

θn+t

√n)eLn(ˆ

θn+t/√n)−Ln(ˆ

θn)−π(θ0)e−t2I(θ0)

2

dt Pθ0

→0 (1.5)

To see this, note that writing Cnfor R

π(ˆ

θn+t/√n)eLn(ˆ

θn+t/√n)−Ln(ˆ

θn)dt, (1.4) is

C−1

nR

π(ˆ

θn+s

√n)eLn(ˆ

θn+s/√n)−Ln(ˆ

θn)−CnI(θ0)

2πe−s2I(θ0)

2

dsPθ0

→0

Because (1.5) implies that Cn→π(θ0)2π/I(θ0) it is enough to show that the

integral inside the brackets goes to 0 in probability, and this term is less than I1+I2,

where

I1=R

π(ˆ

θn+s

√n)eLn(ˆ

θn+s/√n)−Ln(ˆ

θn)−π(θ0)e−s2I(θ0)

2

and

I2=R

π(θ0)e−s2I(θ0)

2−CnI(θ0)

2πe−s2I(θ0)

2

Now I1goes to 0 by (1.5) and I2is equal to



π(θ0)−CnI(θ0)

2πR

e−s2I(θ0)

2ds

which goes to 0 because Cn→π(θ0)2π/I(θ0).

To achieve a further reduction, set

hn=−1



L(ˆ

θn,X

i)=−1

n¨

Ln(ˆ

θn,X

Because as n→∞,hn→I(θ0) a.s. Pθ0, to verify (1.5) it is enough if we show that

R

π(ˆ

θn+t

√n)eLn(ˆ

θn+t/√n)−Ln(ˆ

θn)−π(ˆ

θn)e−t2hn

2

dt Pθ0

→0 (1.6)

To show (1.6), given any δ,c>0, we break Rinto three regions:

A1={t:|t|<clog √n},

1.4 ASYMPTOTIC NORMALITY 37

A2={t:clog √n<|t|<δ

√n},and

A3={t:|t|>δ

√n}.

We begi n w ith A3.

A3

π(ˆ

θn+t

√n)eLn(ˆ

θn+t/√n)−Ln(ˆ

θn)−π(ˆ

θn)e−t2hn

2

≤A3

π(ˆ

θn+t

√n)eLn(ˆ

θn+t/√n)−Ln(ˆ

θn)dt +A3

π(ˆ

θn)e−t2hn

2dt

The ﬁrst integral goes to 0 by assumption (v). The second is seen to go to 0 by the

usual tail estimates for a normal.

Because ˆ

θn→θ0, by Taylor expansion, for large n,

Ln(ˆ

θn+t

√n)−Ln(ˆ

θn)= t2

2n¨

Ln(ˆ

θn)+1

6(t

√n)3...

Ln(θ)=−t2hn

2+Rn

for some θ∈(θ0,ˆ

θn). Now consider

A1

π(ˆ

θn+t

√n)e−t2hn

2+Rn−π(ˆ

θn)e−t2hn

2

≤A1

π(ˆ

θn+t

√n)e−t2hn

2+Rn−e−t2hn

2dt +A1

π(ˆ

θn+t

√n)−π(ˆ

θn)

e−t2hn

2dt

Because πis continuous at θ0, the second integral goes to 0 in Pθ0probability. The

ﬁrst integral equals

A1

π(ˆ

θn+t

√n)e−t2hn

2eRn−1dt

≤A1

π(ˆ

θn+t

√n)e−t2hn

2e|Rn||Rn|dt

(1.7)

Now,

sup

t∈A1

Rn=sup

t∈A1

√n)3...

Ln(θ)≤c3(log √n)3

nOP(1) = oP(1)

and hence (1.7) is

≤sup

t∈A1

π(ˆ

θn+t

√n)A1

e−t2hn

2e|Rn||Rn|dt =oP(1)

38 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

Next consider

A2

π(ˆ

θn+t

√n)et2hn

2+Rn−π(ˆ

θn)e−t2hn

2

≤A2

π(ˆ

θn+t

√n)et2hn

2+Rndt +A2

π(ˆ

θn)e−t2hn

2dt

The second integral is

≤2π(ˆ

θn)e−hnclog √n

2[δ√n−clog √n]

≤Kπ(ˆ

θn)√n

nchn/4

so that by choosing clarge, the integral goes to 0 in Pθ0probability.

For the ﬁrst integral, because t∈A2,and clog √n<|t|<δ

√n,wehave|t|/√n<

δ.Thus|Rn|=(|t|

√n)31

...

Ln(θ)≤δt2

...

Ln(θ)

Because sup

θ∈(θ0−δ,θ0+δ)

(1/n)

...

Ln(θ)is OP(1), by choosing δsmall we can ensure that

Pθ0|Rn|<t2

4hnfor all t∈A2>1−for n>n

0(1.8)

Pθ0−t2hn

2+Rn<−t2hn

4for all t∈A2>1−(1.9)

Hence, with probability greater than 1 −,

A2

π(ˆ

θn+t

√n)e−t2hn

2+Rndt

≤sup

θ∈A2

π(ˆ

θn+t/ t

√n)A2

e−t2hn/4dt

→0asn→∞

Finally, the three steps can be put together, ﬁrst by choosing a δto ensure ( 1.8) and

then by working with this δin steps 1 and 3.

An asymptotic normality result also holds for Bayes estimates.

Theorem 1.4.3. In addition to the assumptions of Theorem 1.4.2 assume that

|θ|π(θ)dθ < ∞.Letθ∗

n=RθΠ(dθ|X1,X

2,...,X

n)be the Bayes estimate with

respect to squared error loss. Then

1.4 ASYMPTOTIC NORMALITY 39

(i) √n(ˆ

θn−θ∗

n)→0in Pθ0probability

(ii) √n(θ∗

n−θ0)converges in distribution to N(0,1/I(θ0)).

Proof. The assumption of ﬁnite moment for πand a slight reﬁnement of detail in the

proof of Theorem 1.4.2 strengthens the assertion to

R

(1 + |t|)π(ˆ

θn+t

√n)eLn(ˆ

θn+t/√n)−Ln(ˆ

θn)−e−t2hn

2dt Pθ0

→0 (1.10)

Consequently

R

(1 + |t|)

π∗(t|Xn)−I(θ0)

√2πe−t2I(θ0)

2

dt Pθ0

→0

and hence Rt

π∗(t|Xn)−(I(θ0)/2π)e−t2I(θ0)

2

dt

Pθ0

→0. Note that because

I(θ0)

2πR

te−t2I(θ0)

2dt =0

we have Rtπ

∗(dt|Xn)→0.

To relate these observations to the theorem, note that

θ∗

n=R

θΠ(dθ|X1,X

2,...,X

n)=R

(ˆ

θn+t

√n)π∗(dt|Xn)

and hence √n(θ∗

n−ˆ

θn)=Rtπ

∗(dt|Xn).

Assertion (ii) follows from (i) and the asymptotic normality of ˆ

θndiscussed earlier.

Remark 1.4.1.This theorem shows that the posterior mean of θcan be approxi-

mated by ˆ

θnup to an error of oP(n−1/2). Actually, under stronger assumptions one

can show [82] that the error is of the order of n−1. A result of this type also holds

for the posterior variance.

Remark 1.4.2.With a stronger version of assumption (v), namely, for any δ,

sup

|θ−θ0|>δ

n[Ln(θ)−Ln(θ0)] ≤−eventually a.e. Pθ0

and ˆ

θn→θ0a.s., we can have the L1-distance in (1.3) go to 0 a.s. Pθ0.

40 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

Remark 1.4.3.If we have almost sure convergence at each θ0, then by Fubini, the

L1-distance evaluated with respect to the joint distribution of θ,X1,X

2,...,X

ngoes

to 0. For reﬁnements of such results see [82].

Remark 1.4.4.Multiparameter extensions follow in a similar way.

Remark 1.4.5.It follows immediately from (1.5) that

log R



fθ(Xi)π(θ)dθ =Ln(ˆ

θn) + log Cn−1

2log n

=Ln(ˆ

θn)−1

2log n+1

2log 2π−1

2log I(θ0) + log π(θ0)+oP(1)

In the multiparameter case with a pdimensional parameter, this would become

log R



fθ(Xi)π(θ)dθ =Ln(ˆ

θn)−p

2log n+p

2log 2π−1

2log ||I(θ0)||+logπ(θ0)+oP(1)

where ||I(θ0)|| stands for the determinant of the Fisher information matrix.

This is identical to the approximation of Schwarz [146] needed for developing his

BIC (Bayes information criteria) for selecting from Kgiven models. Schwarz rec-

ommends the use of the penalized likelihood under model jwith a pj-dimensional

parameter, namely,

Ln(ˆ

θn)−pj

2log n

to evaluate the jth model. One chooses the model with highest value of this criterion.

The proof suggested here does not assume exponential families as in Schwarz[146]

but assumes that the true density f0is in the model being considered. To have a

similar approximation when f0is not in the model, one assumes

inf

θf0log f0

fθ

is attained at θ0. We use this θ0in the assumptions of this section.

Remark 1.4.6.The main theorem in this section remains true if we replace the

normal distribution N(0,1/I(θ0)byN(0,1/a)wherea=−(1/n)(d2log L/dθ2)|ˆ

θnis

the observed Fisher information per unit observation. To a Bayesian, this form of the

theorem is more appealing because it does not involve a true (but unknown) value

θ0. The proof requires very little change.

1.5. IBRAGIMOV AND HASMINSKI˘

I CONDITIONS 41

1.5 Ibragimov and Hasminski˘ı Conditions

Ibragimov and Hasminski˘ı, henceforth referred to as IH, in their text [102] used a

very general framework for parametric models that includes both the regular model

treated in the last section and nonregular problems like U(0,θ). In fact, IH verify

their conditions for various classes of nonregular problems and some stochastic pro-

cesses. Within their framework we will provide a necessary and suﬃcient condition

for a suitably normed posterior to have a limit in probability. This theorem includes

Theorem 1.4.2 on posterior normality under slightly diﬀerent conditions and with

results on nonregular cases. It also answers some questions on nonregular problems

raised by Smith [152].

We begin with notations and conditions appropriate for this section. Let Θ be an

open set in Rk. For simplicity we take kto be 1.

The joint probability distribution of X1,X

2,...,X

nis denoted by Pn

θand its density

with respect to Lebesgue measure (or any other σ- ﬁnite measure) by p(Xn,θ). Let

φnbe a sequence of positive constants converging to 0. If k>1 then φnwould be

ak-dimensional vector of such constants. In the so-called regular case treated in the

last section, φn=1/√n. In the nonregular cases, typically φn→0 at a faster rate.

Consider the map Udeﬁned by U(θ)=φ−1

n(θ−θ0), where θ0is the true value. Let

Unbe the range of this map, i.e., Un={U(θ):θ∈Θ}. The variable uis a suitably

scaled deviation of θfrom θ0. The likelihood ratio process is deﬁned as

Zn(u, Xn)=p(Xn,θ

0+φnu)

p(Xn,θ

The IH conditions can be thought of as two conditions on the Hellinger distance

and one on weak convergence of ﬁnite-dimensional distributions of Zn.

IH conditions

1. For some M>0, m1≥0, α>0, n0≥1,

Eθ0Z

n(u1)−Z

n(u2)

2≤M(1 + Am1)|u1−u2|α

∀u1,u

2∈U

nwith |u1|≤A, |u2|≤A

for all n≥n0.

Note that the left-hand side is the square of the Hellinger distance between

p(Xn,θ

0+φnu1)andp(Xn,θ

0+φnu2). The condition is like a Lipschitz condition

in the rescaled parameter space but uniformly in n.

42 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

2. For all u∈U

nand n≥n0,

Eθ0Z

n(u)≤e−gn(|u|)

where gnis a sequence of real-valued functions satisfying the following condi-

tions:

(a) for each n≥1, gn(y)↑∞as y→∞,

(b) for any N>0,

lim

y→∞

n→∞

yNe−gn(y)=0

3. The ﬁnite-dimensional distributions of {Zn(u):u∈U

n}converge to those of a

stochastic process {Z(u):u∈R}.

For i.i.d. X1,X

2,...,X

nwith compact Θ, condition 2 will hold if φ−1

nis bounded

by a power of n, as is usually the case. This may be seen as follows: Note that

Eθ0Z

n(u)=[A(θ0,θ

0+φnu)]n

where [A(θ0,θ

0+φnu)]nis the aﬃnity between pθ0and p(θ0+φnu)given by

√pθ0p(θ0+φnu)dx. Deﬁne

gn(y)=−nlog A(θ0,θ

0+φny)ify∈U

∞otherwise

Condition 2(a) and 2(b) follow trivially. For non compact cases the condition is

similar to the Wald conditions. The following result appears in IH (theorem I.10.2).

Theorem 1.5.1. Let Πbe a prior with continuous positive density at θ0with respect

to the Lebesgue measure. Under the IH conditions and with squared error loss, the nor-

malized Bayes estimate φ−1

n(˜

θn−θ0)converges in distribution to uZ(u)du/ Z(u)du.

A similar result holds for other loss functions. This result of IH is similar to the

result that was derived as a corollary to the Bernstein–von Mises theorem on posterior

normality. So it is natural to ask if such a limit, not necessarily normal, exists for the

posterior under conditions of IH.

We begin with a fact that immediately follows from the Hewitt-Savage 0-1 law.

1.5. IBRAGIMOV AND HASMINSKI˘

I CONDITIONS 43

Proposition 1.5.1. Suppose X1,X

2,...,X

nare i.i.d. and Πis a prior.

Let ˆ

θ(X1,X

2,...,X

n)be a symmetric function of X1,X

2,...,X

n.Let

t=φ−1

nθ−ˆ

θ(X1,X

2,...,X

n)

and let Abe a Borel set. Suppose

Π(t∈A|X1,X

2,...,X

n)Pθ0

→YA

Then YAis constant a.e. Pθ0.

In view of this, the following deﬁnition of convergence of posterior seems appropri-

ate, at least in the i.i.d. case.

Deﬁnition 1.5.1. For some symmetric function ˆ

θ(X1,X

2,...,X

n) the posterior

distribution of t=φ−1

nθ−ˆ

θ(X1,X

2,...,X

n)has a limit Qif

sup

A{|Π(t∈A|X1,X

2,...,X

n)−Q(A)|} Pθ0

→0

In this case, ˆ

θ(X1,X

2,...,X

n) is called a proper centering.

We now state our main result.

Theorem 1.5.2. Suppose the IH conditions hold and Πis a prior with continuous

positive density at θ0with respect to the Lebesgue measure. If a proper centering

θ(X1,X

2,...,X

n)exists, then there exists a random variable Wsuch that

(a) φ−1

n(θ0−ˆ

θ(X1,X

2,...,X

n)) converges in distribution to W.

(b) For almost all η∈R, with respect to the Lebesgue measure ξ(η−W)=q(η)is

nonrandom, where ξ(u)=Z(u)/RZ(u)du,u∈R.

Conversely if b holds for some random variable W, then the posterior mean given

X1,X

2,...,X

n, is a proper centering with Q(A)=Aq(t)dt.

Remark 1.5.1.Under the IH conditions it can be shown that the posterior mean

given X1,X

2,...,X

nexists. (See the proof of IH theorem 10.2)

Remark 1.5.2.It is proved in Ghosal et al. [79] that under IH conditions the poste-

rior with centering at θ0converges weakly to ξ(.) a.s. Pθ0. Theorem 1.5.2 shows that

if weak convergence is to be strengthened to convergence in probability by centering

at a suitable ˆ

θ(X1,X

2,...,X

n), then conditions (a) and (b) are needed.

44 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

Example 1.5.1. We sketch how the current theorem leads to (a version of) the

Bernstein–von Mises theorem. Assume that the Xis are i.i.d. and conditions 1 and 2

of IH hold and that the following stochastic expansion used earlier in this chapter is

valid.

log Zn(u)= u

√n



∂log p(Xi,θ)

∂θ |θ0−u2

2I(θ0)+oP(1).

Then

log Zn(u)D

→uV −u2

2I(θ0)whereVis a N(0,I(θ0)) random variable.

Let log Z(u)=uV −(u2/2)I(θ0). This implies that

(log Zn(u1),log Zn(u2),...log Zn(um))

converges in distribution to

(log Z(u1),log Z(u2),...log Z(um))

i.e., condition 3 of IH holds. An elementary calculation now shows that W=V/I(θ0)

and q(η) is the normal density at ηwith mean 0 and variance I−1(θ0).

Some feeling about condition 1 in the regular case may be obtained as follows: Easy

calculation shows

Eθ0Z

n(u1)(Z

n(u2)=A(u1,u

2)n

If we expand (pθ0+(u/√n))=

2up to the quadratic term and integrate, we get the

following approximation.

{1+C(u1−u2)2

n+3Rn}

Because

Eθ0(Z

n(u1)−Z

n(u2)

=2−2A(u1,u

2)n

it can be bounded as required in condition 2 under appropriate conditions on the

negligibility of the remainder term Rn. A useful suﬃcient condition is provided in

lemma 1.1 of IH.

Example. The following is a nonregular case where the posterior converges to a

limit:

p(x, θ)=e−(x−θ)x>θ

0 otherwise

1.5. IBRAGIMOV AND HASMINSKI˘

I CONDITIONS 45

The norming constant φnis n−1and a convenient centering is ˆ

θ(X1,X

2,...,X

n)=

min(X1,X

2,...,X

n). Conditions 1 and 2 of IH are veriﬁed in chapter 5 of IH under

very general assumptions that cover the current example. We shall verify the easy

condition 3 and the necessary and suﬃcient condition of Theorem 1.5.2. Let Vn=

n(ˆ

θ(X1,X

2,...,X

n)−θ)andWbe a random variable exponentially distributed on

(−∞,0) with mean −1. Then Vnand Whave the same distribution for all n. Also

Zn(u)=euif u−Vn<0

0 otherwise

Deﬁne Z(u) similarly with Wreplacing Vn. Because Wand Vnhave the same distri-

bution, the ﬁnite-dimensional distributions of Znand Zare the same. Moreover

ξ(u)=eu+Wif u+W<0

0 otherwise

and so q(η)=eηif η<0 and 0 otherwise. The case when Pθis uniform can be

reduced to this case by a suitable transformation of Xand θ.

Example. This example deals with the hazard rate change point problem. Consider

X1,X

2,...,X

ni.i.d. with hazard rate

fθ(x)

1−Fθ(x)=aif 0 <x<θ

bif x>θ

Typically ais much bigger than b. This density has been used to model electronic

components with initial high hazard rate and cancer relapse times. For details see

Ghosh et al.[85].

Let ˆ

θ(X1,X

2,...,X

n)betheMLEofθ. It can be shown that φn=n−1is the right

norming constant and that the IH conditions hold. But the necessary condition that

ξ(u−W) is nonrandom fails. On the other hand, if a, b are also unknown, it can be

shown that the posterior distribution of √n(a−ˆa),√n(b−ˆ

b)has a limit in the

sense of theorem 1.5.2. For details see [85] and [79]

Remark 1.5.3.Ghosal et al. [84] show that typically in non-regular examples the

necessary condition of Theorem 1.5.2 fails.

Remark 1.5.4.Theorems 2.2 and 2.3 of [84] imply consistency of the posterior

under conditions of IH.

Remark 1.5.5.If φs

n<∞for some s>0, then posterior consistency holds in

the a.s sense.

46 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

1.6 Nonsubjective Priors

This section contains a brief discussion of nonsubjective priors. This term has been

generally used in the literature for the so-called noninformative priors. In this section

we use it as a generic description of all priors that are not elicited in a fully subjective

manner.

1.6.1 Fully Speciﬁed

Fully speciﬁed nonsubjective priors try to quantify low information in one sense or

another. Because there is no completely satisfactory deﬁnition of information, many

choices are available. Only the most common are discussed. A comprehensive survey is

by Kass and Wasserman [111]. A quick overview is available in Ghosh and Mukherjee

[86] and Ghosh [83]. In particular, we use this term to describe conjugate priors and

their mixtures.

ForconveniencewetakeΘ=Rp.The use of uniform distribution, namely, the

Lebesgue measure, as a prior goes back to Bayes and Laplace. It has been criticized

as being improper (i.e., total measure is not ﬁnite), a property that applies to all

the priors considered in this section, and is a consequence of Θ being unbounded.

An improper prior may be used only if it leads to a proper posterior for all samples.

This posterior may then be used to calculate Bayes estimates and so on. However,

even then there arise problems with testing hypotheses and model selection. Because

we will not consider testing for inﬁnite-dimensional Θ we will not pursue this. For

ﬁnite-dimensional Θ, attractive possibilities are available. See, for example, Berger

and Pericchi [16] and Ghosh and Samanta [88]

As pointed out by Fisher, choice of uniform distribution is not invariant in the

following sense. Take a smooth 1-1 function η(θ)ofθ. Argue that if one has no

information about θthenthesameistrueofη(θ), and hence one can quantify this

belief by a uniform distribution for η. Going back to θone gets a nonuniform prior π

for θsatisfying

π(θ)=|dη

dθ |

where |dη/dθ|is the Jacobian, i.e., the determinant of the p×pmatrix [∂ηi/∂θj].

It appears that Fisher’s criticism led to the decline of Bayesian methods based on

uniform priors. This also helped the growth of methods based on maximizing the

likelihood. However, Basu [9] makes a strong case for a uniform distribution after a

suitable ﬁnite discrete approximation to Θ. This idea will be taken up in Chapter 8.

1.6. NONSUBJECTIVE PRIORS 47

A natural Bayesian answer to Fisher’s criticism is to look for a method that pro-

duces priors π1(θ), π2(η)forθand ηsuch that one can pass from one to the other by

the usual Jacobian formula

π1(θ)=π2(η(θ))|dη

dθ |(1.11)

Suppose the likelihood satisﬁes regularity conditions and the p×pFisher’s infor-

mation matrix

I(θ)=Eθ∂log fθ

∂θi·∂log fθ

∂θj

is positive deﬁnite. Then Jeﬀreys suggested the use of

π1(θ)={det I(θ)}1/2

This is known as the Jeﬀreys prior. It is easily veriﬁed that (1.11) is satisﬁed if we

set

π2(η)=det Eθ∂log fθ

∂ηi·∂log fθ

∂ηj1/2

using the Fisher information matrix in the η-space. One apparently unpleasant as-

pect is the dependence of the prior on the experiment. This is examined in the next

subsection.

The Jeﬀreys prior was the most popular nonsubjective prior until the introduction

of reference priors by Bernardo [18]. The algorithm described next is due to Berger

and Bernardo [14], [15]. We follow the treatment given in Ghosh [83].

For a discrete random variable or vector Wwith probability function p(w), the

Shannon entropy is

S(p)=S(W)=−Ep(log p(W))

It can be axiomatically developed and is a basic quantity in information and com-

munication theory. Maximization of entropy, which is equivalent to minimizing infor-

mation, leads to a discrete uniform distribution, provided Wassumes only ﬁnitely

many values.

Unfortunately, no such universally accepted measure exists if W is not discrete. In

the general case we may still deﬁne

S(p)=S(W)=−Ep(log p(W))

where pis the density with respect to some σ-ﬁnite measure µ. Unfortunately, this

S(p) depends on µand is rarely used directly in information or communication theory.

Further, if one maximizes S(p)onegetsp=constant, i.e. one gets essentially µ.

48 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

A diﬀerent measure, also due to Shannon, was used by Lindley [128] and Bernardo

[18]. Consider two random vectors V,W with joint density p. Then

S(p)≡S(V,W)=S(V)+SV(W)

where

SV(W)=E(I(W|V))

I(W|V)=−E{log p(W|V)|V}

Here SV(W) is the part of the entropy of Wthat can be explained by its dependence

on V. The residual entropy is

S(W)−SV(W)=EElog p(W|V)

p(W)|V≥0

Because p(w|v)log(p(w|v)/p(w)) µ(dw)≥0

this quantity is taken as a measure of entropy in the construction of reference priors.

Let X=(X1,X

2,...,X

n) have density p(x|θ) and let the prior be p(θ) and posterior

density be p(θ|x). Lindley’s measure of information in Xis

S(X, p(θ)) = EElog p(θ|x)

p(θ)|X (1.12)

So it is a measure of how close the prior is to the posterior. If the prior is most in-

formative, i.e., degenerate at a point, then the quantity is 0. Maximizing the quantity

should therefore make the prior as noninformative as possible provided S(X,p(θ)) is

the correct measure of entropy.

Bernardo[18] recommended taking a limit ﬁrst as n→∞and then maximizing.

Taking a limit seems to introduce some stability and removes dependence on n. Sub-

sequent research has shown that maximizing for a ﬁxed nmay lead to discrete priors,

which are unacceptable as noninformative.

To ensure that a limit exists, one assumes i.i.d. observations with enough regularity

conditions for posterior normality in a suﬃciently strong sense. Details are available

in Clarke and Barron [33].

Suppose Kiis an increasing sequence of compact sets whose union is the whole

parameter space Θ. To avoid confusion with the density pthe dimension of θis taken

as d. Then using the posterior normality

S(x, p)=−E(log p(θ)) + E(log p(θ|X))

=−E(log p(θ)) + Elog N(θ)+o(1)

1.6. NONSUBJECTIVE PRIORS 49

where Nis the normal density with mean ˆ

θand dispersion matrix I−1(ˆ

θ)/n.

The second term on the right equals

−nE (θi−ˆ

θi)(θj−ˆ

θj)Iij(ˆ

θ)

2+Elog det I(ˆ

θ)1/2+d

2log n

2π

If we approximate I0(ˆ

θ)byI0(θ)andE(θi−ˆ

θi)(θj−ˆ

θj)byIij(θ)/n,S(x, p) simpliﬁes

to d

2log n

2πe +Ki

p(θ)log{det I(θ)}1/2−Ki

p(θ)logp(θ)+o(1) (1.13)

Thus as n→∞,S(X, p) is decomposed into a term that does not depend on p(θ)

and

J(p, Ki)=Ki

p(θ)log {det I(θ)}1/2

p(θ)dθ

which is maximized at

pi(θ)=const. {det I(θ)}1/2if θ∈K1

= 0 otherwise

If one lets i→∞,pis may be regarded as converging to the Jeﬀreys prior. This

is a rederivation of the Jeﬀreys prior from an information theoretic point of view

by Bernardo [18]. To get a reference prior, one writes θ=(θ1,θ

2), where θ1is the

parameter of interest and θ2is a nuisance parameter. Let dibe the dimension of θi,

and for convenience take Θ = Θ1×Θ2.

For a ﬁxe d θ1,letp(θ2|θ1) be a conditional prior for θ2given θ1. By integrating out

θ2, one is left with θ1and X. Then one ﬁnds the marginal prior p(θ1) as described

earlier. This depends on the choice p(θ2|θ1). Bernardo [18] recommended use of the

Jeﬀreys prior const ·det{I22(θ)}1/2, treating θ2as variable and with θ1held constant.

Here I22(θ)=[Iij (θ),i,j,=d1+1,...,d

1+d2].

Fix compact sets Ki1,K

i2of Θ1and Θ2. Consider priors concentrating on Ki1×Ki2.

Let pi(θ2|θ1) be a given conditional prior. Our ﬁrst object is to maximize the entropy

in θ1and ﬁnd the marginal p(θ1).

Let

S(X, pi(θ1)) = Elog pi(θ1|X)

pi(θ1)

=S(X, pi(θ1,θ

2)) −Ki1

pi(θ1)S(X, pi(θ2|θ1)) dθ1

(1.14)

50 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

Assuming that one can interchange integration with respect to θ1, using the asymp-

totic form of (1.13) of S(X, p(θ1,θ

2),

S(X, pi(θ1)) = d1

2log n

2πe +Ki1

pi1(θ1)log ψi(θ1)

pi(θ1)dθ1+o(1)

where

ψi(θ1)=expKi1

pi(θ2|θ1)logdet I(θ)

det I22(θ)1/2

dθ2

Maximizing S(X, pi(θ1)) asymptotically,

pi(θ1)=const ψi(θ1)onKi1

where the constant is for normalization.

Then

pi(θ1,θ

2)=constant ψi(θ1)pi(θ2|θ1)onKi1×Ki2

0 elsewhere

Finally take

p(θ2|θ1)=ci(θ1){det I22(θ)}1/2on Ki2

0 otherwise

To choose a limit in some sense, ﬁx θ0=(θ10,θ

20) and assume

lim pi(θ1,θ

2)/pi(θ10,θ

20)=p(θ1,θ

exists for all θ∈Θ. Then p(θ1,θ

2) is the reference prior when θ1is more important

than θ2. If the convergence to p(θ1,θ

2) is uniform on compacts, then for any pair of

sets B1,B

2contained in a ﬁxed Ki01×Ki02

lim B1pi(θ1,θ

2)dθ

B2pi(θ1,θ

2)dθ =B1p(θ1,θ

2)dθ

B2p(θ1,θ

2)dθ

Berger and Bernardo [15] recommend a d-dimensional break up of θas (θ1,θ

2,...,θ

and a d-step algorithm starting with

p(θd|θ1,...,θ

d−1)=c(θ1,θ

2,...,θ

d−1)Idd(θ)onKid

Some justiﬁcation for this is provided in Datta and Ghosh [38].

1.6. NONSUBJECTIVE PRIORS 51

There is still another class of nonsubjective priors obtained by matching what a

frequentist might do (because, presumably, that is how a Bayesian without prior in-

formation would act). Technically, this amounts to matching posterior and frequentist

probabilities up to a certain order of approximation. This leads to a diﬀerential equa-

tion involving the prior. For d= 1 the Jeﬀreys prior is the unique solution. For d>1,

reference priors are often a solution of the matching equation. More details are given

in Ghosh [83].

Finally, there is one class of problems in which there is some sort of consensus

on what nonsubjective prior to use. These are problems where a nice group Gof

transformations leaves the problem invariant and either acts transitively on Θ, i.e.,

{g(θ0); g∈G}= Θ, or reduces Θ to a one-dimensional maximal invariant parameter.

See, for example, Berger [13]. In the next example Gacts transitively. In such problems

the right invariant Haar measure is a common choice and is a reference prior. The

Jeﬀreys prior is a left invariant Haar measure which causes problems [see, e.g., Dawid,

Stone, and Zidek [39]). For examples involving one-dimensional maximal invariants,

see Datta and Ghosh [38]. Here also reference priors do well.

Example 1.6.1. Xis are i.i.d. normal with mean θ2and variance θ1;θ1is the

parameter of importance. The information matrix is

I(θ)=1

2θ2

θ1

and so the reference prior may be obtained through the following steps:

pi(θ2|θ1)=dion Ki2

ψi(θ1)=exp[

Ki2

dilog 1

√2θ1

]dθ2

pi(θ1,θ

2)=ci

θ1

on Ki2×Ki2

pi(θ1,θ

2)=θ10/θ1

which is also known to arise from the right invariant Haar measure for (µ, σ). The

Jeﬀreys prior is proportional to θ−3

1, which corresponds to the left invariant Haar

measure.

If the mean is taken to be θ1and variance θ2, then the reference prior is proportional

to θ−1

1. But, in general, a reference prior depends on how the components are ordered.

52 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

1.6.2 Discussion

Nonsubjective priors are best thought of as providing a tool for calculating posteriors.

Theorems like posterior normality indicate that the eﬀect of the prior washes away as

the sample size increases. Hence a posterior obtained from a nonsubjective prior may

be thought of as an approximation to a posterior obtained from a subjective prior.

Though there is no unique choice for a nonsubjective prior, the posterior obtained

from diﬀerent nonsubjective priors will usually be close to each other, even for mod-

erate values of n. Thus lack of uniqueness may not matter very much.

It is true that a nonsubjective prior usually depends on the experiment, e.g.,

through the information matrix I(θ). This would not seem paradoxical if one remem-

bers that nonsubjective priors have low information, and it seems that information

cannot be deﬁned except in the context of an experiment. The measure of information

used by Bernardo [18] clariﬁes this.

Nonsubjective priors are typically improper but some justiﬁcation comes from the

work of Heath and Sudderth [97], [96]. They show that, at least for amenable groups,

the posterior obtained from a right invariant measure can be obtained from a proper

ﬁnitely additive prior.

For improper priors one has to verify that the posteriors are proper. In many

cases this is not easy. Some Bayesians use improper priors and restrict it to a large

compact set. In general, this is not advisable. It is a remarkable fact that for the

Jeﬀreys or reference priors, the posteriors are often proper, but there exist simple

counterexamples; see for example, [38]. If the likelihood shows marked inhomogeneities

asymptotically, as in the so-called nonergodic cases, one must take these into account

through suitable conditioning.

1.7 Conjugate and Hierarchical Priors

Let Xis be i.i.d. Consider exponential densities with a special parametrization

fθ(x)=exp{A(θ)+



θjTj(x)+ψ(x)}

Given X1,X

2,...,X

n, the suﬃcient statistic is (n

1T1(xi),...,n

1Tp(xi)). Assume

Θisanopenp- dimensional rectangle. Because

Eθ∂log fθ

∂θj=0

1.7. CONJUGATE AND HIERARCHICAL PRIORS 53

one has ∂A(θ)

∂θj

=Eθ(Tj)=ηj(θ)

η=(η1,η

2,...,η

p) provides another natural parametrization. Note that the MLE

ˆη=T/n.

A class of priors Cis said to be a conjugate family if given p∈Cthe posterior for

all nbelongs to C. One can generate such families by choosing a σ-ﬁnite measure ν

on Θ and deﬁning elements of Cby

p(θ|m, t1,t

2,...,t

p)=const. exp{mA(θ)+



θjtj}(1.15)

where mis a positive integer and t1,t

2,...,t

pare elements in the sample space of

T1,T

2...,T

p. The constants m, t1,t

2,...,t

pare parameters of the prior distribution

chosen such that the prior is proper.

Usually, νis a nonsubjective prior. Then the prior displayed in (1.15) can be in-

terpreted as a posterior when the prior is νand one has a conceptual sample of size

myielding values of suﬃcient statistics T=(t1,t

2,...,t

p), i.e., compared with νit

represents prior information equivalent to a sample of size m.

The case when νis the Lebesgue measure deserves special attention. Under certain

conditions, one can prove the following by an argument involving integration by parts,

E(η|X1,X

2,...,X

n)=mE(η)+nˆη

m+n(1.16)

which shows that the posterior mean is a convex combination of the prior mean

and a suitable frequentist estimate. The relation strengthens the interpretation of m

as a measure of information in the prior. The elements of Ccorresponding to the

Lebesgue measure are usually called conjugate priors. Diaconis and Ylvisaker [47]

have shown that these are the only priors that satisfy (1.16). One can elicit the values

of t1,t

2,...,t

pby eliciting the prior mean and mby comparing prior information

with information from a sample. This makes these priors relatively easy to elicit,

but because one is only eliciting some aspects of the prior, a conjugate prior is a

nonsubjective prior with some parameters reﬂecting prior belief.

Example. fθis normal density with mean µand standard deviation σ. Here θ1=

µ/2σ2,θ2=−1/σ2,A(θ)=−(µ2/2σ2)−log σ,andT1(x)=x, T2(x)=x2. A conjugate

prior is of the form

p(θ)=Const. emA(θ)+t1θ1+t2θ2

54 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE

which can be displayed as the product of a normal and inverse gamma.

Example. fθis Bernoulli with parameter θ. Conjugate priors are beta distributions.

Example. fθis multinomial with parameters θ1,θ

2,...,θ

p,whereθi≥0,θi=1.

Conjugate priors are Dirichlet distributions discussed in the next chapter.

Conjugate priors have been criticized on two grounds. The relation (1.16) may not

be reasonable if there is conﬂict between the prior and the data. For example, if p=1

and the prior mean is 0 and ˆηis 20, should one believe the data or the prior? A convex

combination of two incompatible estimates is unreasonable.

For N(µ, σ2), a t-prior for µand a nonsubjective prior for σensures that in cases

like this the posterior mean shifts more toward the data, i.e., a choice of such a prior

means that, in cases of conﬂict, one trusts the data. The t-prior is a scale mixture of

normal. In general, it seems that mixtures on conjugate priors will possess this kind

of property, but we have not seen any general investigation in the literature.

The other criticism of conjugate priors is that only one parameter mis left to model

the prior belief on uncertainty. Once again, a mixture of conjugate priors oﬀers more

ﬂexibility.

These mixtures may be thought of as modeling prior belief in a hierarchy of stages

called hierarchical priors. The reason for their current popularity in Bayesian analysis

is that they are ﬂexible and posterior quantities can be calculated by Markov chain

Monte Carlo. A good source is Schervish [144].

1.8 Exchangeability, De Finetti’s Theorem,

Exponential Families

Subjective priors can be elicited in special simple cases, a relatively recent treatment is

Kadane et al. [109]. However there is one class of problems where subjective judgments

can be made relatively easily and can lead to both a model and a prior.

Suppose {Xi}is a sequence of random variables. This sequence is said to be ex-

changeable if for any ndistinct i1,i

2,...,i

P{Xi1∈B1,X

i2∈B2,...,X

in∈Bn}=P{X1∈B1,X

2∈B2,...,X

n∈Bn(1.17)

Suppose {Xi}take values in {0,1}. One may be able to judge if the {Xi}sare

exchangeable. In some sense, such judgments are fundamental to science when one

makes inductions about future based on past experience. The next theorem of De

1.8 EXCHANGEABILITY 55

Finetti shows that this subjective judgment leads to a model and aﬃrms the existence

of a prior.

Theorem 1.8.1. If a sequence of random variables {Xi}is exchangeable and if

each Xitakes values in {0,1}then there exists a distribution Πsuch that

P{X1=x1,X

2=x2,...,X

n=xn1}=1

θr(1 −θ)n−rdΠ(θ)

with r=n

1xi

The theorem implies that one has a Bernoulli model and a prior Π. To specify a

prior, one needs additional subjective judgments. For example, if given X1,X

2,...,X

one predicts Xn+1 =(α+xi)/(α+β+n), Π then must be a beta prior.

Regazzini [67] has shown that judgments on Exchangeability, along with certain

judgments on predictive distributions of Xn+1 given X1,X

2,...,X

nlead to a similar

representation theorem, which leads to an exponential model along with a mixing

distribution, which may be interpreted as a prior. Earlier Bayesian derivations of

exponential families is due to Lauritzen [117] and Diaconis and Freedman [44]. A

good treatment is in Schervish [144] where partial exchangeability and its modeling

through hierarchical priors is also discussed.

M(X) and Priors on M(X)

2.1 Introduction

As mentioned in Chapter 1, in the nonparametric case the parameter space Θ is

typically the set of all probability measures on X. We denote the set of all probability

measures on Xby M(X). The cases of interest to us are when Xis a ﬁnite set and

when X=R. The Bayesian aspect requires prior distributions on M(X), in other

words, probabilities on the space of probabilities. In this chapter we develop some

measure-theoretic and topological features of the space M(X) and discuss various

notions of convergence on the space of prior distributions.

The results in this chapter, except for the last section, are mainly used to assert

the existence of the priors discussed later. Thus, for a reader who is prepared to

accept the existence theorems mentioned later, a cursory reading of this chapter would

be adequate. On the other hand, for those who are interested in measure-theoretic

aspects, a careful reading of this chapter will provide a working familiarity with the

measure-theoretic subtleties involved. The last section where formal deﬁnitions of

consistency are discussed, can be read independently. While we generally consider

the case X=R, most of the arguments would go through when Xis a complete

separable metric space.

58 2. M(X)AND PRIORS ON M(X)

2.2 The Space M(X)

As before, let Xbe a complete separable metric space with Bthe corresponding

Borel σ-algebra on X. Denote by M(X) the space of all probability measures on

(X,B).

As seen in the chapter 1 there are many reasonable notions of convergence on the

space M(X) , but they are not all equally convenient for our purpose. We begin with

a brief discussion of these.

Total Variation Metric. Recall that the total variation metric was deﬁned by

P−Q=2sup

B|P(B)−Q(B)|

If pand qare densities of Pand Qwith respect to some σ-ﬁnite measure µ, then

P−Qis just the L1-distance |p−q|dµ between pand q. The total variation

metric is a strong metric. If x∈Xand δxis the probability degenerate at x, then

Ux={P:P−δx<}={P:P(x)>1−}is a neighborhood of δx. Further

if x=xthen Ux∩Ux=∅. Thus, when Xis uncountable, {Ux:x∈X}is an

uncountable collection of disjoint open sets, the existence of which renders M(X)

nonseparable. Further, no sequence of discrete measures can converge to a continuous

measure and vice versa. These properties make the total variation metric uninteresting

when considered on all of M(X).

The total variation metric when restricted to sets of the form Lµ—all probability

measures dominated by a σ-ﬁnite measure µ—is extremely useful and interesting. In

this context we will refer to the total variation as the L1-metric. It is a standard result

that Lµwith the L1-metric is complete and separable.

Hellinger Metric. This metric was also discussed in Chapter 1. Brieﬂy the Hellinger

distance between Pand Qis deﬁned by

H(P, Q)=(√p−√q)2dµ1/2

where pand qare densities with respect to µ. The Hellinger metric is equivalent to

the L1metric. Associated with the Hellinger metric is a useful quantity A(P, Q) called

aﬃnity,deﬁnedasA(P, Q)=√p√qdµ. The relation H2(Pn,Q

n)=2−2(A(P, Q))n,

where Pn,Q

nare n-fold product measures, makes the Hellinger metric convenient in

the i.i.d. context.

Setwise convergence. The metrics deﬁned in the last section provide corresponding

notions of convergence. Another natural way of saying Pnconverges to Pis to require

2.2. THE SPACE M(X)59

that Pn(B)→P(B) for all Borel sets B. A way of formalizing this topology is as

follows. Let Fbe the class of functions {P→ P(B):B∈B}.OnM(X) give the

smallest topology that makes the functions in Fcontinuous. It is easy to see that

under this topology, if fis a bounded measurable function, then P→ fdPis

continuous. Sets of the form {P:|P(Bi)−P0(Bi)|<

i,B

1,B

2,...,B

k∈B}give a

neighborhood base at P0.

Setwise convergence is an intuitively appealing notion, but it has awkward topo-

logical properties that stem from the fact that convergence of Pn(B)toP(B)forsets

in an algebra does not ensure the convergence for all Borel sets. We summarize some

additional facts as a proposition.

Proposition 2.2.1. Under setwise convergence:

(i) M(X)is not separable,

(ii) If P0is a continuous measure then P0does not have a countable neighborhood

base, and hence the topology of setwise convergence is not metrizable.

Proof. (i) Ux={P:P{x}>1−}is a neighborhood of δx,andasxvaries form

an uncountable collection of disjoint open sets.

(ii) Suppose that there is a countable base for the neighborhoods at P0.LetB0be

a countable family of sets such that sets of the type

U={P:|P(Bi)−P0(Bi)|<

i,B

1,B

2,...,B

k∈B

form a neighborhood base at P0. It then follows that Pn(B)→P(B) for all

Borel sets Biﬀ Pn(B)→P(B) for all sets in B0.

Let Bn=σ(B1,B

2,...,B

n)whereB1,B

2,...is an enumeration of B0.Denote by

Bn1,B

n2,...B

nk(n)the atoms of Bn. Deﬁne Pnto be the discrete measure that

gives mass P0(Bni)toxni where xni is a point in Bni. Clearly Pn(Bmj )→P0(Bmj)

for all mj.OntheotherhandPn(∪i,m {xmi}) = 1 for all nbut P0((∪i,m {xmi})=

These shortcomings persist even when we restrict attention to subsets M(X)ofthe

form Lµ.

Supremum Metric. When Xis R, the Glivenko-Cantelli theorem on convergence

of empirical distribution suggests another useful metric, which we call the supremum

60 2. M(X)AND PRIORS ON M(X)

metric. This metric is deﬁned by

dK(P, Q)=sup

t|P(−∞,t]−Q(−∞,t]|

Under this metric M(X) is complete but not separable.

Weak Convergence. In many ways weak convergence is the most natural and useful

topology on M(X). Say that

Pn→Pweakly or Pn

weakly

→Pif

fdP

n→fdP

for all bounded continuous functions fon X. For any P0a neighborhood base consists

of sets of the form ∩k

1{P:fidP0−fidP <}where fi,i =1,2,...,k are

bounded continuous functions on X. One of the things that makes the weak topology

so convenient is that under weak convergence M(X) is a complete separable metric

space.

The main results that we need with regard to weak convergence are the Portman-

teau theorem and Prohorov’s theorem given in Chapter 1.

Because M(X) is a complete separable metric space under weak convergence, we

deﬁne the Borel σ-algebra BMon M(X) to be the smallest σ-algebra generated by

all weakly open sets, equivalently all weakly closed sets. This σ-algebra has a more

convenient description as the smallest σ-algebra that makes the functions {P→

P(B):B∈B}measurable. Let B0be the σ-algebra generated by all weakly open

sets. Consider all Bsuch that P→ P(B)isB0-measurable. This class contains all

closed sets, and from the π-λtheorem (Theorem 1.2.1) it follows easily that BMis

the σ-algebra generated by all weakly open sets.

We have discussed two other modes of convergence on M(X) : the total variation

and setwise convergence. It is instructive to pause and investigate the σ-algebras

corresponding to these and their relationship with BM.

Because these are nonseparable spaces, there is no good acceptable notion of a

Borel σ-algebra. In the case of total variation metric, the two common σ-algebras

considered are

(i) Bo—the σ-algebra generated by open sets and

(ii) Bb—the σ-algebra generated by open balls.

2.2. THE SPACE M(X)61

The σ-algebra Bogenerated by open sets is much larger than BM. To see this, restrict

the σ-algebra to the space of degenerate measures DX={δx:x∈X}.Theneachδx

is relatively open, and this will force the restriction of Boto DXto be the power set.

On the other hand, BMrestricted to DXis just the inverse of the Borel σ-algebra on

Xunder the map δx→ x.

Because every open ball is in BM,soiseverysetinthe σ-algebra generated by

these balls. It can be shown that Bbis properly contained in BM.

Similar statements hold when we consider the σ-algebras for setwise convergence.

The corresponding σ-algebras here would be those generated by open sets and those

generated by basic neighborhoods at a point. A discussion of these diﬀerent σ-algebras

can be found in [71].

We next discuss measurability issues on M(X) . Following are a few of elementary

propositions.

Proposition 2.2.2. (i) If B0is an algebra generating Bthen

σ{P→ P(B):B∈B

0}=BM

(ii) σP→ fdP:fbounded measurable=BM

Proof. (i) Let ˜

B={B:P→ P(B)isBMmeasurable}. Then ˜

Bis a σ-algebra and

contains B0. The result now follows from Theorem 1.2.1.

(ii) It is enough to show that P→ fdPis BMmeasurable. This is immediate for

fsimple, and any bounded measurable fis a limit of simple functions.

Proposition 2.2.3. Let fP(x)be a bounded jointly measurable function of (P, x).

Then P→ fP(x)dP (x)is BMmeasurable.

Proof. Consider

G=F⊂M(X)×X such that P(FP)isBMmeasurable

Here FPis the P-section {x:(P, x)∈F}of F.Gis a λ-system that contains the

π-classofallsetsoftheformC×B;C∈B

M,B∈B, and by Theorem 1.2.1 is the

product σ-algebra on M(X)×X. This proves the proposition when fP(x)=IF(P, x).

The proof is completed by verifying when fP(x) is simple, and by passing to limits.

Proposition 2.2.3 can be used to prove the measurability of the set of discrete

probabilities.

62 2. M(X)AND PRIORS ON M(X)

Proposition 2.2.4. The set of discrete probabilities is a measurable subset of

M(X).

Proof. If E={(P, x):P{x}>0}is a measurable set, then setting fP(x)=IE(P, x),

the set of discrete measures is just {P:fP(x)dP =1}and would be measurable by

Proposition 2.2.3. To see that E={(P,x):P{x}>0}is measurable, we show that

(P, x)→ P{x}is jointly measurable in (P, x). Consider the set of all a measurable

subsets Fof X×Xsuch that (P, x)→ P(Fx) is measurable in (P, x). As before,

Fx={y:(x, y)∈F}. This class contains all Borel sets of the form B1×B2and is

aλ-system, and by Theorem 1.2.1 is the Borel σ-algebra on X×X.In particular

(P, x)→ P(Fx) is measurable when F={(x, x):x∈X}is the diagonal and

E={(P, x):P(Fx>0)}.

Consider fP(x) used in Proposition 2.2.4. Then Pis continuous iﬀ fP(x)dP =0.

It follows that the set of continuous measures is a measurable set.

If µis a σ-ﬁnite measure on R, then Lµis a measurable subset of M(X). To see

this, assume without loss of generality that µis a probability measure. Let Bnbe an

increasing sequence of algebras, with ﬁnitely many atoms, whose union generates B.

Denote the atoms of Bnby Bn1,B

n2,...B

k(n), and for any probability measure P,

set fP(x) = lim k(n)

1P(Bni)/µ(Bni) when it exists and 0 otherwise. To complete the

argument note that Lµ={P:fP(x)dµ =1}.

2.3 (Prior) Probability Measures on M(X)

2.3.1 XFinite

Suppose X={1,2,...,k}.In this case M(X) can be identiﬁed with the (k−1)

dimensional probability simplex Sk={p1,p

2,...,p

k:0≤pi≤1,pi=1}.One

way of deﬁning a prior on M(X) is by deﬁning a measure on Sk. Any such measure

deﬁnes the joint distribution of {P(A):A⊂X}, because for any A, P (A)=i∈Api,

where pk=1−k−1

1pi.

An example of a prior distribution on Skis the uniform distribution—the normal-

ized Lebesgue measure on {p1,p

2,...,p

k−1:0≤pi≤1,pi≤1}.Another example

is the Dirichlet density which is given by

Π(p1,p

2,...,p

k−1)=Γ(k

1αi)

Γ(αi)pα1−1

1pα2−1

2...p

αk−1−1

k−1(1 −

k−1



pi)αk−1

2.3. (PRIOR) PROBABILITY MEASURES ON M(X)63

where α1,α

2,...,α

kare positive real numbers. This density will be studied in greater

detail later.

A diﬀerent parametrization of M(X) yields another method of constructing a prior

on M(X). Assume for ease of exposition that Xcontains 2kelements {x1,x

2,...,x

2k}.

Let

B0={x1,x

2,...,x

2k−1}and B1={x2k−1+1,x

2k−1+2,...,x

2k}

be a partition of Xinto two sets. Let B00,B

01 be a partition of B0into two halves

and B10,B

11 be a similar partition of B1. Proceeding this way we can get partitions

B12...i0,B

12...i1of B12...iwhere each iis 0 or 1 and i<k. Clearly, this partition

stops at i=k.

We next note that the partitions can be used to identify Xwith Ek={0,1}k.

Any x∈Xcorresponds to a sequence 1(x)2(x)...

k(x)wherei(x)=0ifxis in

B1(x)2(x)...i−1(x)0 and 1 if xis in B1(x)2(x)...i−1(x)1 . Conversely, any sequence 12...

corresponds to the point ∩k

1B12...i. Thus there is a correspondence—depending on

the partition—between the set M(X) of probability measures on Xand the set

M(Ek) of probability measures on Ek.

Any probability measure on Ekis determined by quantities like

y12...k=P(i+1 =0|1,

2,...,

Speciﬁcally, let E∗

kbe the set of all sequences of 0 and 1 of length less than k, including

the empty sequence ∅.If0≤y≤1 is given for all ∈E∗

k, then there is a probability

on Ekby

P(12...

k)=



i=1,i=0

y12...i−1



i=1,i=1

(1 −y12...i−1)

where i= 1 corresponds to the empty sequence ∅. Hence construction of a prior on

Ekamounts to a speciﬁcation of the joint distribution for {y:∈E∗

k}.

A little reﬂection will show that all we have done is to reparametrize a probability

Pon Xby

P(B0),P(B00|B0),P(B10|B1),...,P(B12...k−10|B12...k−10)

Of interest to us is the case where the Ys, equivalently P(B0|B)s, are all indepen-

dent. The case when these are independent beta random variables—the Polya tree

processes—will be studied in Chapter 3

Yet another method of obtaining a prior distribution on M(X) is via De Finetti’

theorem. De Finetti’s theorem plays a fundamental role in Bayesian inference, and

we refer the reader to [144] for an extensive discussion.

64 2. M(X)AND PRIORS ON M(X)

Let X1,X

2,...,X

nbe X-valued random variables. X1,X

2,...,X

nis said to be ex-

changeable if X1,X

2,...,X

nand Xπ(1),X

π(2),...,X

π(n)have the same distribution for

every permutation πof {1,2,...,n}. A sequence X1,X

1,...is said to be exchangeable

if X1,X

2,...,X

nis exchangeable for every n.

Theorem 2.3.1. [De Finetti] A sequence of X-valued random variables is ex-

changeable iﬀ there is a unique measure Πon M(X)such that for all n,

M(X)



p(xi)dΠ(p)=Pr {X1=x1,X

2=x2,...,X

n=xn}

In general it is not easy to construct Π from the distribution of the Xis. Typically,

we will have a natural candidate for Π. By uniqueness, it is enough to verify the

preceding equation. On the other hand, given Π, the behavior of X1,X

1,... often

gives insight into the structure of Π.

As an example, let X={x1,x

2,...,x

k}.Letα1,α

2,...,α

kbe positive integers. Let

¯α(i)=αi/αj. Consider the following urn scheme: Suppose a box contains balls of

k- colors, with αiballs of color i. Choose a ball at random, so that P(X1=i)=¯α(i).

Replace the ball and add one more of the same color. Clearly, P(X2=j|X1=i)=

(αj+δi(j))/(αi+1) whereδi(j)=1ifi=jand 0 otherwise.Repeat this process

to obtain X3,X

4,... Then

(i) X1,X

2,... are exchangeable; and

(ii) the prior Π for this case is the Dirichlet density on Sk.

2.3.2 X=R

We next turn to construction of measures on M(X) . Because the elements of M(X)

are functions on B,M(X) can be viewed as a subset of [0,1]Bwhere the product

space [0,1]Bis equipped with the canonical product σ-algebra, which makes all the

coordinate functions measurable. Note that the restriction of the product σ-algebra

to M(X)isjustBM. A natural attempt to construct measures on M(X) would be to

use Kolomogorov’s consistency theorem to construct a probability measure on [0,1]B,

which could then be restricted to M(X) . However M(X) is not measurable as a

subset of [0,1]B, and that makes this approach somewhat inconvenient. To see that

M(X) is not measurable, note that singletons are measurable subsets of M(X) but

not so in the product space.

When X=R, distribution functions turn out to be a useful crutch to construct

priors on M(R). To elaborate:

2.3. (PRIOR) PROBABILITY MEASURES ON M(X)65

(i) Let Qbe a dense subset of Rand let F∗be all real-valued functions on Qsuch

that

(a) Fis right-continuous on Q,

(b) Fis nondecreasing, and

(ii) Let Fbe all real-valued functions on Rsuch that

(a) Fis right-continuous on R,

(b) Fis non decreasing, and

(iii) M(R)={P:Pis a probability measure on R}

There is a natural 1-1 correspondence between these three sets: Let φ1:M(R)→

Fbe the function that takes a probability measure Pto its distribution function

FP(t)=P(−∞,t] and let φ2:F→F

∗be the function that maps a distribution

function to its restriction on Q. These maps are 1-1, onto, and bi-measurable. Thus

any probability measure on F∗can be transferred to a probability on Fand then

to M(R). A prior on F∗only involves the distributions of

(F(t1),F(t2)−F(t1),...,F(tk)−F(tk−1))

for tisinQ. However, because any F(t) is a limit of F(tn),t

n∈Q, the distributions of

quantities like (F(t1),F(t2)−F(t1),...,F(tk)−F(tk−1)) for ti-real can be recovered,

at least as limits. On the other hand since a general Borel set Bhas no simple

description in terms of intervals, one can assert the existence of a distribution for

P(B) that is compatible with the prior on F∗, but it may not be possible to arrive

at anything resembling an explicit description of the distribution.

It is convenient to use the notation L(·|Π) to stand for the distribution or law of a

quantity under the distribution Π.

Theorem 2.3.2. Let Q be a countable dense subset of R. Suppose for every kand

every collection t1<t

2< ... < t

kwith {t1,t

2,...,t

k}⊂Q,Πt1,t2,...,tkis a probability

measure on [0,1]kwhich is a speciﬁcation of a distribution of ((F(t1),F(t2),...,F(tk))

such that

(i) if {t1,t

2,...,t

k}⊂{s1,s

2,...,s

l}then the marginal distribution on (t1,t

2,...,t

obtained from Πs1,s2,...,slis Πt1,t2,...,tk;

66 2. M(X)AND PRIORS ON M(X)

(ii) if t1<t

2then Πt1,t2{F(t1)≤F(t2)}=1;

(iii) if (t1n,t

2n,...,t

kn)↓(t1,t

2,...,t

k)then Π(t1n,t2n,...,tkn)converges in distribution

to Π(t1,t2,...,tk); and

(iv) if tn↓−∞then Πtn→0in distribution and if tn↑∞then Πtn→1in

distribution.

then there exists a probability measure Πon M(R)such that for every t1<t

2<...<

tk,with {t1,t

2,...,t

k}⊂Q,

L((F(t1),F(t2),,...,F(tk)) |Π) = Πt1,t2,...,tk.

Proof. By the Kolomogorov consistency theorem (i) ensures the existence of a proba-

bility measure Π on [0,1]Qwith Π(t1,t2,...,tk)as marginals. We will argue that Π(F∗)=1

Suppose F∗

1=∩ti<tjF∈[0,1]Q:F(ti)≤F(tj). Because Qis countable by (ii),

Π(F∗

1)=1.

Next, ﬁx tand a sequence tnin Qdecreasing to t.OnF∗

1,F(tn) as a function of F

is decreasing in nand hence has a limit. If F∗(t) = limnF(tn) then F∗(t)≥F(t)and

by assumption (iii) EΠF∗(t)=EΠF(t), so that F∗(t)=F(t)a.e.Π.Consequently

Π{F∈F∗

1: F is right-continuous at t}=1

and the countability of Qyields

Π{F:Fis monotone and Fis right-continuous at all t∈Q}=1

A similar argument shows that with Π probability 1, for Fin F∗

1,limt→∞ =F(t)=

1,and limt→−∞ F(t) = 0. This shows that Π(F∗)=1.

Thus we have established the existence of a probability measure on F∗. Using the

discussion preceding the theorem this prior can be lifted to all of M(R).

The assumptions of Theorem 2.3.2 require speciﬁcation of ﬁnite-dimensional dis-

tribution only for tisinQand the conclusion also involves only the ﬁnite dimensional

distributions for tisinQ. It is easy to see that if one starts with Π(t1,t2,...,tk)with ti’s

real and satisfying the conditions of Theorem 2.3.2 then one would get a Π for which

the marginals are Π(t1,t2,...,tk)for tis real.

A convenient way of specifying the distribution of (F(t1),F(t2),...,F(tk)) for t1<

t2<...,t

k, is by specifying the distribution, say Π

t1,t2,...,tk,of

(F(t1),F(t2)−F(t1),...,F(tk−F(tk−1))

2.3. (PRIOR) PROBABILITY MEASURES ON M(X)67

The convenience arises from the fact that (−∞,t

1],(t1,t

2],...,(tk,∞) can be thought

of as k+ 1 cells and (p1,p

2,...,p

k+1) as the corresponding multinomial probabili-

ties. Note that Π

t1,t2,...,tkis a probability measure on Sk={(p1,p

2,...,p

k:pi≥



pi≤1}. If the speciﬁcations of the collection Π

t1,t2,...,tksatisfy assumptions

(ii),(iii), and (iv) of Theorem 2.3.2, then so would the collection Πt1,t2,...,tk=L((p1,p

p2,...,k

1pi)|Π

t1,t2,...,tk).These observations give the following easy variant of Theo-

rem 2.3.2.

Theorem 2.3.3. Suppose that for every kand every collection t1<t

2< ... < t

with {t1,t

2,...,t

k}⊂R,Πt1,t2,...,tkis a probability measure on Sk={(p1,p

2,...,p

k):

pi≥0,



pi≤1}such that

(i) if {t1,t

2,...,t

k}⊂{s1,s

2,...,s

l}then the marginal distribution on (t1,t

2,...,t

obtained from Πs1,s2,...,slis Πt1,t2,...,tk;

(ii) if (t1n,t

2n,...,t

kn)→(t1,t

2,...,t

k)then Π(t1n,t2n,...,tkn)converges in distribution

to Π(t1,t2,...,tk); and

(iii) if tn↓−∞then Πtn→0in distribution and if tn↑∞then Πtn→1in

distribution.

then there exists a probability measure Πon F(equivalently on M(R)such that for

every t1<t

2<...< t

k,with {t1,t

2,...,t

k}⊂R,

L((F(t1),F(t2)−F(t1),...,F(tk)−F(tk−1)) |Π) = Πt1,t2,...,tk

Suppose (B1,B

2,...,B

k) is a collection of disjoint subsets of R; the next theorem

shows that if the distribution of P(B1),P(B2),...,P(Bk) are themselves prescribed

consistently then the prior Π would have the prescribed marginal distribution for

(P(B1),P(B2),...,P(Bk)).

Theorem 2.3.4. Suppose for each collection of disjoint Borel sets (B1,B

2,...,B

a distribution ΠB1,B2,...,Bkis assigned for (P(B1),P(B2),...,P(Bk)) such that

(i) ΠB1,B2,...,Bkis a probability measure on k-dimensional probability simplex Skand

if A1,A

2,...,A

lis another collection of disjoint Borel sets whose elements are

68 2. M(X)AND PRIORS ON M(X)

union of sets from (B1,B

2,...,B

k)then

ΠA1,A2,...,Al=distribution of 

Bi⊂A1

P(Bi),

Bi⊂A2

P(Bi),..., 

Bi⊂Al

P(Bi)

(ii) if Bn↓∅; and ΠBn→0in distribution,

(iii) P(R)≡1.

Then there exists a probability measure Πon M(R)such that for any collection of

disjoint Borel sets (B1,B

2,...,B

k), the marginal distribution of (P(B1),...,P(Bk))

under Πis ΠB1,B2,...,Bk.

Remark 2.3.1.Given ΠB1,B2,...,Bkas earlier, we can extend the deﬁnition to obtain

ΠA1,A2,...,Amfor any collection (not necessarily disjoint) of Borel sets A1,A

2,...,A

Toward this, let B1=A1,B

i=Ai−∪j<iAj, and deﬁne ΠA1,A2,...,Amas the distribution

of (P(B1,P(B1)+P(B2)+...,m

1P(Bj)) under ΠB1,B2,...,Bm. The following proof

shows that the marginal distribution under Π of (P(A1),P(A2),...,P(Ak)) of any

collection of Borel sets is ΠA1,A2,...,Ak.

Proof. As in the Theorem 2.3.3 start with partitions of the form Bi=(ti−1,t

i]for

i=1,2,...,k; and let Π be the measure obtained on F.Letφ2be the map from Fto

M(R) deﬁned by φ2(F)=PF,wherePFis the probability measure corresponding to

F. It is easy to see that this map is 1-1 and measurable. We will continue to denote

by Π the induced measure on M(R).

ΠbyconstructionsitsonM(R). What we then need to show is that the marginal

distribution of (P(B1),P(B2),...,P(Bk)) under Π is ΠB1,B2,...,Bk.

Step 1 (ii) implies that

if (B1n,B

2n,...,B

kn)↓(B1,B

2,...,B

k1) then

(P(B1n),P(B2n),...,P(Bkn)) →(P(B1),P(B2),...,P(Bk)) in distribution.

To see this,

((P(B1n),P(B2n),...,P(Bkn))

=(P(B1)+(P(B1n)−P(B1)),P(B2)+(P(B2n)−P(B2)),...,

P(Bk)+(P(Bkn)−P(Bk)))

2.3. (PRIOR) PROBABILITY MEASURES ON M(X)69

and for each i,(Bin −Bi)↓∅and hence (P(Bin)−P(Bi)) goes to 0 in distribution

and hence in probability. As a result, the whole vector

((P(B1n)−P(B1)),(P(B2n)−P(B2)),...,(P(Bkn)−P(Bk))) ↓0 in probability

Step 2 Denote by B0the algebra generated by intervals of the form (a, b]. For

any B1,B

2,...,B

k,letL(P(B1),P(B2),...,P(Bk)|Π) denote the distribution of the

vector (P(B1),P(B2),...,P(Bk)) under Π. Fix k.LetCi=(ai,b

i],i =2,...,k.

Consider

B=B1:L(P(B1),P(C2),...,P(Ck)|Π) = Π(B1,C2,...,Ck)

Then ˆ

Bcontains all sets of the form (a, b], is closed under disjoint unions of such

sets, and hence contains B0. In addition, by Step 1 this is a monotone class. So ˆ

Bis

Step 3 Now consider

B2:L(P(B1),P(B2),P(C3),...,P(Ck)|Π) = Π(B1,B2,C3,...,Ck)

From step 2, this class contains all sets of the form (a, b], and their ﬁnite disjoint

unions and hence contains B0. Further, it is a monotone class and so is B. Continuing

similarly, it follows that for any Borel sets B1,B

2,...,B

L(P(B1),P(B2),...,P(Bk)|Π) = ΠB1,B2,...,Bk

Example 2.3.1. Let αbe a ﬁnite measure on R. For any partition (B1,B

2,...,B

k),

let ΠB1,B2,...,Bkon Skbe a Dirichlet (α(B1),α(B2),...,α(Bk)). We will show in Chap-

ter 3 that this assignment satisﬁes the conditions of Theorem 2.3.4.

Remark 2.3.2.Theorem 2.3.4 on constructing a measure Π on Fthrough ﬁnite-

dimensional distribution can be viewed from a diﬀerent angle. Toward this, for each

n, divide the interval [−2n,2n] into intervals of length 2−nand let −2n=tn1<

tn2<...< t

nk(n)=2

ndenote the endpoints of the intervals. These deﬁne a partition

of Rinto k(n) + 1 cells in an obvious way. Any probability (p1,p

2,...,p

k(n)+1)on

these k(n) + 1 cells corresponds to a distribution function on R,whichisconstanton

each interval and thus any probability Πtn1,tn2,...,tnk(n)on Sk(n)+1 deﬁnes a probability

measure µnon Fn= all distribution functions, which are constant on the interval

(tni,t

ni+1]. The consistency assumption on Πtn1,tn2,...,tnk(n)shows that the marginal

distribution on Fnobtained from µn+1 is just µn.Nowitcanbeshownthat

70 2. M(X)AND PRIORS ON M(X)

1. {µn}n≥1is tight as a sequence of probability measures on F. To see this, let

εi↓0 and let Kibe a sequence of compact subsets of R. Then

{P:P(Ki)≥1−εifor all i}

is a compact subset of M(R). What is needed to show tightness is that given δ,

there is a set of the form given earlier with µnmeasure greater than 1−δfor all n.

Use assumptions (i) and (iii) of Theorem 2.3.4 and show that for each i, you can

get an nisuch that for all n, µn{F:F(tni0)>ε

iand 1−F(tni,k(ni))>ε

i}<δ/2i;

2. {µn}converges to a measure Π; and

3. Π satisﬁes the conclusions of Theorem 2.3.4.

2.3.3 Tail Free Priors

When Xis ﬁnite, we have seen that by partitioning Xinto

{B0,B

1},{B00,B

01,B

10,B

11},...

and reparametrizing a probability by P(B0),P(B00 |B0)..., we can identify measures

on M(X)withEk—the set of sequences of 0s and 1s of length k. Tail free priors arise

when these conditional probabilities are independent. In this section we extend this

method to the case X=R.

Let Ebe all inﬁnite sequences of 0s and 1s, i.e., E={0,1}N.Denote by Ekall

sequences 1,

2,...,

kof 0s and 1s of length k, and let E∗=∪kEkbe all sequences

of 0s and 1s of ﬁnite length. We will denote elements of E∗by .

Start with a partition

T0={B0,B

of Xinto two sets. Let

T1={B00,B

01,B

10,B

11,}

where B00,B

01 is a partition of B0and B10,B

11 is a partition of B1. Proceeding this

way,let Tnbe a partition consisting of sets of the form B,where∈Enand further

B1,B0is a partition of B.

We assume that we are given a sequence of partitions T={Tn}n≥1constructed as

in the last paragraph such that the sets {B:∈E∗}generate the Borel σ-algebra.

2.3. (PRIOR) PROBABILITY MEASURES ON M(X)71

Deﬁnition 2.3.1. ApriorΠonM(R)issaidtobetail free with respect to

T={Tn}n≥1if rows in

{P(B0)}

{P(B00|B0),P(B10|B1)}

{P(B000|B00 ),P(B000|B00),P(B010|B01 ),P(B100|B10),P(B110|B11 )}

·········

are independent.

To turn to the construction of tail free priors on M(R), start with a dense set of

numbers Q, like the binary rationals in (0,1), and write it as Q={a:∈E∗}such

that for any 0<<1 and construct the following sequence of partitions of R:

T0={B0,B

1}is a partition of Rinto two intervals, say

B0=(−∞,a

0],B

1=(a0,∞)

Let T1={B00,B

01,B

10,B

11,},where

B00 =(−∞,a

00],B

01 =(a00,a

and

B10 =(a0,a

01],B

11 =(a01,∞)

Proceeding this way, let Tnbe a partition consisting of sets of the form B1,2,...,n,

where 1,

2,...,

nare 0 or 1 and further B1,2,...,n0,B

1,2,...,n1is a partition of

B1,2,...,n.

The assumption that Qis dense is equivalent to the statement that the sequence

of partitions T={Tn}n≥1constructed as in the last paragraph are such that the sets

{B:∈E∗}generate the Borel σ-algebra.

For ea ch ∈E∗, let Ybe a random variable taking values in [0,1].Ifweset

Y=P(B0|B), then for each k,{Y:∈∪

i≤kEi}deﬁne a joint distribution for

P(B):∈Ek. By construction, these are consistent. In order for these to deﬁne a

prior on M(R) we need to ensure that the continuity condition (ii) of Theorem 2.3.2

holds.

Theorem 2.3.5. If Y=P(B0|B), where Y:∈E∗is a family of [0,1] valued

random variables such that

(i)

Y⊥{Y0,Y

1}⊥{Y00,Y

01,Y

10,Y

11}⊥...

72 2. M(X)AND PRIORS ON M(X)

(ii) for each ∈E∗,

Y0Y00Y000 ... =0and Y1Y11 ... =0 (2.1)

then there exists a tail free prior Πon M(R)(with respect to the partition under

consideration) such that Y=P(B0|B).

Proof. As noted earlier we need to verify condition (ii) of Theorem 2.3.2. In the

current situation it amounts to showing that if o=0

10

2...

kand as n→∞,an

decreases to a0, then the distribution of Fanconverges to Fa0. Because any

sequence of adecreasing to a0is a subsequence of a01,a

010,a

0100,···,

Fa010...0=Fa0+P(B010...0)

and

P(B01,0...0)=P(B0)(1 −Y0)Y01Y010 ...

the result follows from (ii).

These discussions can be usefully and elegantly viewed by identifying Rwith the

space of sequences of 0s and 1s.

As before, let Ebe {0,1}N.Any probability on Egives rise to the collection of

numbers {y:∈E∗},wherey12...n=P(n+1 =0|12...

n). Conversely, setting

y12...n=P(n+1 =0|12...

n), any set numbers {y:∈E∗},with0≤y≤1

determines a probability on E.Inotherwords,

P(12...

k)=



i=1,i=0

y12...i−1



i=1,i=1

(1 −y12...i−1) (2.2)

Hence, to deﬁne a prior on M(E), we need to specify a joint distribution for {Y:

∈E∗},whereeachYis between 0 and 1.

As in the ﬁnite case, we want to use the partitions T={Tn}n≥1to identify R

with sequences of 0s and 1s. and Let x∈R.φ(x) is the function that sends xto the

sequence in E,where

1(x)=0 ifx∈B01(x)=1 ifx∈B1

i(x)=0 ifx∈B1,2,...,i−10i(x)=1 ifx∈B12...i−11

Because each Tnis a partition of R,φdeﬁnes a function from Rinto E.φis 1-

1, measurable but not onto E. The range of φwill not contain sequences that are

2.3. (PRIOR) PROBABILITY MEASURES ON M(X)73

eventually 0. This is another way of saying that with binary expansions we consider

the expansion with 1 in the tails rather than 0s. If D={∈E:i= 0 for all i≥

nfor some n}∪{:i= 1 for all i}, then φis 1-1, measurable from Ronto Dc∩E.

Further, φ−1is measurable on Dc∩E. Thus, as before, the set of probability measures

M(R) can be identiﬁed with M0(E)—the set of probability measures on Ethat give

mass 0 to D. This reduces the task of deﬁning a prior on M(R) to one of deﬁning a

prior on M0(E).

The condition P(D) = 0 gets translated to

y0(y00)... = 0 for all ∈E∗and y1y11 ... =0 (2.3)

As before, deﬁning a prior on M(R), equivalently on M0(E), amounts to deﬁning

{Y:∈E∗}such that (2.3) is satisﬁed almost surely. Satisfying (2.3) almost surely

corresponds to condition (ii) in Theorem 2.3.5.

A useful way to specify a prior on M(E)isbyhavingYfor of diﬀerent lengths

be mutually independent, which yields tail free priors. In Chapter 3, we return to this

construction, to develop Polya tree priors.

Tail free prior are conjugate in the sense that if the prior is tail free, then so is the

posterior. To avoid getting lost in a notational mess we ﬁrst state an easy lemma.

Lemma 2.3.1. Let ξ1,ξ

2,...,ξ

kbe independent random vectors (not necessarily

of the same dimension) with joint distribution µ=k

1µi.LetJbe a subset of

{1,2,...,k}and let µ∗be the probability with

dµ∗

dµ =C

j∈J

ξj

Then ξ1,ξ

2,...,ξ

kare independent under µ∗.

Proof. Clearly C=j∈J[ξjdµj]−1. Further,

Prob(ξi∈Bi:1≤i≤k)=(ξi∈Bi:1≤i≤k)

C[

j∈J

ξj]dµ

=

i/∈J

µi(Bi)

j∈JBjξjdµj

ξjdµj

74 2. M(X)AND PRIORS ON M(X)

Theorem 2.3.6. Suppose Πis a tail free prior on M(R)with respect to the sequence

of partitions {T k}k≥1. Given P, let X1,X

2,...,X

nbe,i.i.d. P; then the posterior is

also tail free with respect to {T k}k≥1.

Proof. We will prove the result for n= 1; the general case follows by iteration.

Consider the posterior distribution given Tk. Because {B:∈Ek}are the atoms of

Tk, it is enough to ﬁnd the posterior distribution given X∈Bfor each ∈Ek.

Let =12...

k. Then the likelihood of P(B)is



P(B1,2,...,j|B1,2,...,j−1)

so that the posterior density of {P(B1|B)}with respect to Π is



i=1,i=0

P(B12...i|B12...i−1)



i=1,i=1

(1 −P(B12...i|B12...i−1)

From Lemma 2.3.1

{P(B1|B):∈E1},{P(B1|B):∈E2},...,{P(B1|B):∈Ek−1}

are independent under the posterior.

In particular if m<k, independence holds for

{P(B1|B):∈E1},{P(B1|B):∈E2},...,{P(B1|B):∈Em−1}.

Letting k→∞, an application of the martingale convergence theorem gives the

conclusion for the posterior given X1.

In this section we have discussed two general methods of constructing priors on

M(R) . There are several other techniques for obtaining nonparametric priors. There

are priors that arise from stochastic processes. If fis the sample path of a stochastic

process then ˆ

f=k−1(f)efyields a random density when k(f)=Eefis ﬁnite. We

will study a method of this kind in the context of density estimation. Or one can

look at expansions of a density using some orthogonal basis and put a prior on the

coeﬃcients. A class of priors called neutral to the right priors, somewhat like tail free

priors, will be studied in Chapter 10 on survival analysis.

2.4. TAIL FREE PRIORS AND 0-1LAWS 75

2.4 Tail Free Priors and 0-1 Laws

Suppose Π is a prior on M(R)and{B:∈E∗}is a set of partitions as described

in the last section. To repeat, for each n,Tn={B:∈En}is a partition of Rand

B0,B1is a partition of B. Further B=σ{B:∈E∗}. Unlike the last section it

is not required that Bbe intervals. The choice of intervals as sets in the partition

played a crucial role in the construction of a probability measure on M(R). Given a

probability measure on M(R), the following notions are meaningful, even if the B

are not intervals.

For notational convenience, as before, denote by Y=P(B0|B). Formally, Yis a

random variable deﬁned on M(R)withY(P)=P(B0|B). Recall that Π is said to

be tail free with respect to the partition T={Tn}n≥1if

Y⊥{Y0,Y

1}⊥{Y00,Y

01,Y

10,Y

11}⊥...

Theorem 2.4.1. Let λbe any ﬁnite measure on R, with λ(B)>0for all .If

0<Y

<1for all then

Π{P:P<<λ}=0or 1

Proof. Assume without loss of generality that λis a probability measure.

Let Z0=Y,Z1={Y0,Y

1},Z

2={Y00,Y

01,Y

10,Y

11},.... By assumption, Z1,Z

2,...

are independent random vectors. The basic idea of the proof is to show that L(λ)=

{P:P<<λ}is a tail set with respect to the Zis. The Kolmogorov 0 −1law

then yields the conclusion. In the next two lemmas it is shown that for each n,L(λ)

depends only on Zn,Z

n+1,... and is hence a tail set.

Lemma 2.4.1. When P(B)>0, deﬁne P(·|B)to be the probability P(A|B)=

P(A∩B)/P (B). Deﬁne λ(·|B)similarly. Fix n; then

L(λ)={P:P(·|B)<< λ(·|B)for all ∈Ensuch that P(B)>0}

Proof. Because

P(A)= 

∈En

P(A|B)P(B)andλ(A)= 

∈En

λ(A|B)λ(B)

the result follows immediately.

76 2. M(X)AND PRIORS ON M(X)

Lemma 2.4.2. Let Y={Y(P):∈E∗,P ∈M(R)}. The elements yof Yare

thus a collection of conditional probabilities arising from a probability. Conversely

any element yof Ygives rise to a probability which we denote by Py. Then for each

∈En, for all A∈B, and for every yin Y

Py(A|B)depends only on Zn,Z

n+1,...

Proof. Let

B0=A: for all y,P

y(A|B) depends only on Zn,Z

n+1,...



Because 0 <Y

<1 for all ∈E∗,Py(B)>0 for all ∈E∗. Hence B0contains the

algebra of ﬁnite disjoint unions of elements in {B:∈∪

m>nEm}and is a monotone

class. Hence B0=B.

Remark 2.4.1.Let Π be tail free with respect to T={Tn}n≥1such that 0 <Y

<

1; for all ∈E∗. Argue that Pis discrete iﬀ P(.|B) is discrete for all ∈En.Now

use the Kolmogorov 0-1 law to conclude that Π{P:Pis discrete }= 0 or 1.

The next theorem, due to Kraft, is useful in constructing priors concentrated on

sets like L(λ).

Let Π,{B:∈E∗},{Y:∈E∗}be as in the Theorem 2.4.1, and, as before

given any realization y={y:∈E∗},letPydenote the corresponding probability

measure on R.

Theorem 2.4.2. Let λbe a probability measure on Rsuch that λ(B)>0for all

∈E∗. Suppose

y(x)= 

∈En

Py(B)

λ(B)IB(x)= 

∈Enk

i=1,i=0 y12...i−1k

i=1,i=1(1 −y12...i−1)

λ(B)

If supnEΠ fn

y(x)!2≤Kfor all xthen Π{P:P<<λ}=1

Proof. For ea ch y∈Y, by the martingale convergence theorem fn

yconverges almost

surely λtoafunctionfy. Consider the measure Π ×λ, which is the joint distribution

of yand x,on 

∈E∗

Y×R.

2.4. TAIL FREE PRIORS AND 0-1LAWS 77

Because for each y,fn

y→fya.s λ,wehavefn

y→fya.s Π ×λ. Further, under

our assumption fn

y(x):n≥1is uniformly integrable with respect to Π ×λand

hence EΠ×λfn

y(x)−fy(x)→0. Now for each y, by Fatou’s lemma, Eλfy≤1.

On the other hand, EΠ×λfn

y(x) = 1 for all n, and by the L1-convergence mentioned

earlier, EΠ×λfy(x)=1.ThusEλfy=1a.e.πand this shows π{L(λ)}=1.

The next theorem is an application of the last theorem. It shows how, given a

probability measure λ, by suitably choosing both the partitions and the parameter

of the Ys , we can obtain a prior that concentrates on L(λ).

Theorem 2.4.3. Let λbe a continuous probability distribution on R. Denote by

Fthe distribution function of λand construct a partition as follows:

B0=F−1(0,1/2] B1=F−1(1/2,1]

B00 =F−1(0,1/4],B

01 =F−1(1/4,1/2] B10 =F−1(1/2,3/4],B

11 =F−1(3/4,1]

and in general

B1,2,...,n=F−1n



i

2n,



i

2n+1

2n

Suppose E(Y)=1/2for all ∈E∗and sup

∈En

V(Y)≤bn,with bn<∞. Then

the resulting prior satisﬁes Π(L(λ)) = 1.

Proof. λ(B)>0, because λ(B0|B)=1/2, for all B. Fix x.Ifx∈B12,...n, then

Y(x)=



i=0

Y1−i

12,...i−1(1 −Y12,...i−1)i

1/2

and

E[fn

Y(x)]2=



4E [Y2

12,...i−1]1−i[(1 −Y12,...i−1)2]i!

≤



4ai

where ai=maxEY 2

12,...i−1,E(1 −Y12,...i−1)2.Now

78 2. M(X)AND PRIORS ON M(X)

EY 2

12,...i−1=V(Y12,...i−1)+(1/2)2≤bi+1/4

and

E1−Y12,...,i−1)2≤bi+1/4

Thus n

14ai≤n

1(1+4bi) converges, because bn<∞.

2.5 Space of Probability Measures on M(R)

We next turn to a discussion of probability measures on M(R). To get a feeling for

what goes on we begin by asking when are two probability measures Π1and Π2

equal?

Clearly Π1=Π

2if for any ﬁnite collection B1,B

2,...,B

kof Borel sets,

(P(B1),P(B2),...,P(Bk))

has the same distribution under both Π1and Π2. This is an immediate consequence

of the deﬁnition of BM.

Next suppose that (C1,C

2,...,C

k) are Borel sets. Consider all intersections of the

form

C1

1∩C2

2∩···∩Ck

where i=0,1, C1

i=Ciand C0

i=Cc

i. These intersections would give rise to a

partition of X, and since every Cican be written as a union of elements of this

partition, the distribution of (P(C1),P(C2),...,P(Ck)) is determined by the joint

distribution of the probability of elements of this partition. In other words, if the

distribution of (P(B1),P(B2),...,P(Bk)) under Π1and Π2are the same for every

ﬁnite disjoint collection of Borel sets then Π1=Π

2.Following is another useful

proposition.

Proposition 2.5.1. Let B0={Bi:i∈I}be a family of sets closed under ﬁnite

intersection that generates the Borel σ-algebra Bon X. If for every B1,B

2,...,B

in B0,(P(B1),P(B2),...,P(Bk)) has the same distribution under Π1and Π2, then

Π1=Π

Proof. Let B0

M={E∈B

M:Π

1(E)=Π

2(E)}. Then B0

Mis a λ-system. For any J

ﬁnite subset of I, by our assumption Π1and Π2coincide on the σ-algebra BJ

M—the

σ-algebra generated by {P(Bj):j∈J}and hence BJ

M⊂B

M.Further the union of

Mover all ﬁnite subsets of Iforms a π-system. Because these also generate BM,

M=BM.

2.5. SPACE OF PROBABILITY MEASURES ON M(R)79

Remark 2.5.1.A convenient choice of B0is the collection of all open balls, all closed

balls, etc. When X=Ra very useful choice is the collection {(−∞,a]:a∈Q},where

Qis a dense set in R.

As noted earlier M(R) when equipped with weak convergence becomes a complete

separable metric space with BMas the Borel σ-algebra. Thus a natural topology

on M(R) is the weak topology arising from this metric space structure of M(R).

Formally, we have the following deﬁnitions.

Deﬁnition 2.5.1. A sequence of probability measure {Π}non M(R)issaidto

converge weakly to a probability measure Π if

φ(P)dΠn→φ(P)dΠ

for all bounded continuous functions φon M(R).

Note that continuity of φis with respect to the weak topology on M(R). If fis a

bounded continuous function on Rthen φ(P)=fdP is bounded and continuous on

M(R) . However in general there is no clear description of all the bounded continuous

functions on M(R). If Xis compact metric, then the following description is available.

If Xis compact metric then, by Prohorov’s theorem, so is M(X) under weak

convergence. It follows from the Stone-Weirstrass theorem that the set of all functions

of the form





j=1

φri

fi,j

where φri

fi,j (P)=fi,j(x)dP (x)withfi,j(x) continuous on X, is dense in the space of

all continuous functions on M(X).

The following result is an extension of a similar result in Sethuraman and Tiwari

[149].

Theorem 2.5.1. A family of probability measures {Πt:t∈T}on M(R)is tight

with respect to weak convergence on M(R)iﬀ the family of expectations {EΠt:t∈T},

where EΠt(B)=P(B)dΠt(P), is tight in R.

Proof. Let µt=EΠt.Fix δ>0. By the tightness of {µt:t∈T}, for every positive

integer dthere exists a sequence of compact sets Kdin R, such that sup

µt(Kc

d)≤

6δ/(d3π2).

80 2. M(X)AND PRIORS ON M(X)

For d=1,2,...,let, Md={P∈M(R):P(Kc

d)≤1/d}, and let M=∩dMd. Then,

by the pormanteau and Prohorov theorems, Mis a compact subset of M(R), in the

weak topology. Further, by Markov’s inequality,

Πn(Mc

d)≤dEΠt(P(Kc

d))

=dµt(Kc

≤6δ

d2π2

Hence, for any t∈T, Πt(M)≤d6δ/(d3π2)=δ. This proves that {µt}t∈Tis

tight. The converse is easy.

Theorem 2.5.2. Suppose Π,Πn,n ≥1are probability measures on M. If any of

the following holds then Πnconverges weakly to Π.

(i) For any (B1,B

2,...,B

k)of Borel sets

LΠn(P(B1),P(B2),...,P(Bk)) →L

Π(P(B1),P(B2),...,P(Bk))

(ii) For any disjoint collection (B1,B

2,...,B

k)of Borel sets

LΠn(P(B1),P(B2),...,P(Bk)) →L

Π(P(B1),P(B2),...,P(Bk))

(iii) For any (B1,B

2,...,B

k)where for =i=1,2,...,k,Bi=(ai,b

i],

LΠn(P(B1),P(B2),...,P(Bk)) →L

Π(P(B1),P(B2),...,P(Bk))

(iv) For any (B1,B

2,...,B

k)where for =i=1,2,...,k,Bi=(ai,b

i],ai,b

irationals,

LΠn(P(B1),P(B2),...,P(Bk)) →L

Π(P(B1),P(B2),...,P(Bk))

(v) For any (B1,B

2,...,B

k)where for =i=1,2,...,k,Bi=(−∞,t

i],

LΠn(P(B1),P(B2),...,P(Bk)) →L

Π(P(B1),P(B2),...,P(Bk))

(vi) For any (B1,B

2,...,B

k)where for =i=1,2,...,k,Bi=(−∞,t

i],tirationals

LΠn(P(B1),P(B2),...,P(Bk)) →L

Π(P(B1),P(B2),...,P(Bk))

2.5. SPACE OF PROBABILITY MEASURES ON M(R)81

Proof. Because (vi) is the weakest, we will show that (vi) implies Πn

weakly

→Π. Note

that for all rationals t,EΠn(P(−∞,t)) →EΠ(P(−∞,t)) and hence EΠnconverges

weakly to EΠ. By the Theorem 2.5.1 this shows that {Πn}is tight. If Π∗is the limit

of any subsequence of {Πn}, then it follows, using Proposition 2.5.1, that Π∗=Π.

Remark 2.5.2.Note that Πn

weakly

→Π does not imply any of the preceding. The

modiﬁcations are easy, however. For example (i) would be changed to “For any

(B1,B

2,...,B

k) of Borel sets such that (P(B1),P(B2),...,P(Bk)) is continuous a.e

Π.”

We have considered other topologies on M(R) namely, total variation, setwise con-

vergence and the supremum metric. It is tempting to consider the weak topologies on

probabilities on M(R) induced by these topologies. But as we have observed, these

topologies possess properties that make the notion of weak convergence awkward to

deﬁne and work with. Besides, the σ-algebras generated by these topologies, via either

open sets or open balls do not coincide with BM[57]. Our interests do not demand

such a general theory. Our chief interest is when the limit measure Π is degenerate

at P0, and in this case we can formalize convergence via weak neighborhoods of P0.

When Π = δP0,Π

weakly

→δP0iﬀ Πn(U)→Π(U) for every open neighborhood U.

Because weak neighborhoods of P0areoftheformU={P:fidP −fidP0},

weak convergence to a degenerate measure δP0can be described in terms of continuous

functions of Rrather than those on M(R) and can be veriﬁed more easily. The next

proposition is often useful when we work with weak neighborhoods of a probability

P0on R.

Proposition 2.5.2. Let Qbe a countable dense subset of R. Given any weak neigh-

borhood U of P0there exist a1<a

2...<a

nin Qand δ>0such that

{P:|P[ai,a

i+1)−P0[ai,a

i+1)|<δ for 1≤i≤n}⊂U

Proof. Suppose U={P:|fdP−fdP0|<},wherefis continuous with compact

support. Because Qis dense in Rgiven δthere exist a1<a

2...<a

nin Qsuch that

f(x)=0forx≤a1,x≥an,and|f(x)−f(y)|<δfor x ∈[ai,a

i+1],1≤i≤n−1.

Then the function f∗deﬁned by

f∗(x)=f(ai)forx∈[ai,a

i+1),i =1,2,...n−1

satisﬁes sup

x|f∗(x)−f(x)|<δ.

82 2. M(X)AND PRIORS ON M(X)

For any P,f∗dP =f(ai)P[ai,a

i+1),

|f∗dP −f∗dP0|<ckδwhere c=sup

x|f(x)|

In addition, if Pis in Uthen we have

|fdP −fdP0|<2δ+ckδ

Thus with Bi=[ai,a

i+1] for small enough δ,{P:|P(Bi)−P0(Bi)|<δ}is contained

in U. The preceding argument is easily extended if Uis of the form

{P:|fidP −fidP0|≤i,1≤i≤k, ficontinuous with compact support}

Following is another useful proposition.

Proposition 2.5.3. Le t U={F:sup

−∞<x<∞|F0(s)−F(x)|<}be a supre-

mum neighborhood of a continuous distribution function F0. Then Ucontains a weak

neighborhood of F0.

Proof. Choose −∞ =x0<x

1<x

2<...< x

k=∞such that F(xi+1)−F(xi)</4

for i=1,...,k−1. Consider

W={F:|F(xi)−F0(xi)|</4},i=1,2,...,k

If x∈(xi−1,x

i),

≤|F(xi−1)−F0(xi−1)|+|F0(xi−1)−F0(xi)|

+|F(xi)−F0(xi)|+|F0(xi−1)−F0(xi)|

which is less than if F∈W.

2.6. DE FINETTI’S THEOREM 83

2.6 De Finetti’s Theorem

Much of classical statistics has centered around the conceptually simplest setting of

independent and identically distributed observations. In this case, X1,X

2,... are

a sequence of i.i.d. random variables with an unknown common distribution P.In

the parametric case, Pwould be constrained to lie in a parametric family, and in

the general nonparametric situation Pcould be any element of M(R). The Bayesian

framework in this case consists of a prior Π on the parameter set M(R); given P

the X1,X

2,... is modeled as i.i.d. P. In a remarkable theorem, De Finetti showed

that a minimal judgment of exchangeability of the observation sequence leads to the

Bayesian formulation discussed earlier.

In this section we brieﬂy discuss De Finetti’s theorem. A detailed exposition of

the theorem and related topics can be found in Schervish [144] in the section on De

Finetti’s theorem and the section on Extreme models.

As before, let X1,X

2,... be a sequence of X-valued random variables deﬁned on

Ω=R∞.

Deﬁnition 2.6.1. Let µbe a probability measure on R∞. The sequence X1,X

2,...

is said to be exchangeable if, for each n and for every permutation gof {1,...,n},the

distribution of X1,X

2,...,X

nisthesameasthatofXg(1),X

g(2),...,X

g(n).

Theorem 2.6.1 (De Finetti). Let µbe a probability measure on R∞. Then

X1,X

2,... is exchangeable iﬀ there is a unique probability measure Πon M(R)such

that for all nand for any Borel sets B1,B

2,...,B

µ{X1∈B1,X

2∈B2,...,X

n∈Bn}=M(R)



P(Bi)dΠ(P) (2.4)

Proof. We begin by proving the theorem when all the Xis take values in a ﬁnite set

X={1,2,...,k}. This proof follows Heath and Sudderth [95].

So let X={1,2,...,k}and µbe a probability measure on X∞such that X1,X

2,...

is exchangeable. For each n, let Tn(X1,X

2,...,X

n)=(r1,r

2,...,r

k), where rj=



i=1

I{j}(Xi) is the number of occurrences of jsinX1,X

2,...,X

n.Letµ∗

ndenote the

distribution of Tn/n =(r1/n, r2/n,...,r

k/n) under µ.µ∗

nis then a discrete probability

measure on M(X) supported by points of the form (r1/n, r2/n,...,r

k/n), where for

j=1,2,...,k,rj≥0 is an integer and rj=n. Because M(X) is compact, there

is a subsequence {ni}that converges to a probability measure Π on M(X). We will

argue that Π satisﬁes (2.4).

84 2. M(X)AND PRIORS ON M(X)

Because X1,X

2,...,X

nis exchangeable, it is easy to see that the conditional distri-

bution of X1,X

2,...,X

ngiven Tnis also exchangeable. In particular, the conditional

probability given Tn(X1,X

2,...,X

n)=(r1,r

2,...,r

k) is just the uniform distribution

on T−1

n(r1,r

2,...,r

k). In other words, the conditional distribution of X1,X

2,...,X

given Tn=(r1,r

2,...,r

k) is the same as the distribution of nsuccessive draws from

an urn containing nballs with riof color i,fori=1,2,...,k.

Fix mand n>m. Then, given Tn(X1,X

2,...,X

n)=(r1,r

2,...,r

k), the conditional

probability that

X1=1,...,X

s1=1,X

s1+1 =2,...,X

s1+s2=2,...,X

m−sk−1+1 =k,...,X

m=k

is (r1)s1(r2)s2...(rk)sk/(n)m,where for any real aand integer b,(a)b=b−1

0(a−i).

Because



(r1,r2,...,rk),rj=n

(r1)s1(r2)s2...(rk)sk

(n)m

µTn

n=(

n,r2

n,...,rk

n)

=M(X)

(p1n)s1(p2n)s2...(pkn)sk

(n)m

dµ∗

n(p1,p

2,...,p

As n→∞the sequence of functions

(p1n)s1(p2n)s2...(pkn)sk

(n)m

converges uniformly on M(X)to psj

jso that by taking the limit through the sub-

sequence {ni}, the probability of

(Xi=1,1≤i≤s1;Xi=2,s

1+1≤i≤s1+s2,...,X

i=k, m −sk−1+1≤i≤m)

is M(X)psj

jdΠ(p1,p

2,...,p

k) (2.5)

Uniqueness is immediate because if Π1,Π2are two probability measures on M(X)

satisfying (2.5) then it follows immediately that they have the same moments.

To move on to the general case X=R,letB1,B

2,...,B

kbe any collection of

disjoint Borel sets in R. Set B0=∪k

1Bic.

2.6. DE FINETTI’S THEOREM 85

Deﬁne Y1,Y

2,... by Yi=jif Xi∈Bj. Because X1,X

2,... is exchangeable, so are

Y1,Y

2,.... Since each Yitakes only ﬁnitely many values, we use what we have just

proved and writing Xi∈Bjfor Yi=j, there is probability measure ΠB1,B2,...,Bkon

{p1,p

2,...,p

k:pj≥0,pj≤1}such that for any m,

µ(X1∈Bi1,X

2∈Bi2,...,X

m∈Bim)=m



P(Bij)dΠB1,B2,...,Bk(P) (2.6)

where i1,i2,...,im are all elements of {0,1,2,...,k}and P(B0)=1−k

1P(Bi).

We will argue that these ΠB1,B2,...,Bks satisfy the conditions of Theorem 2.3.4.

If A1,A

2,...,A

lis a collection of disjoint Borel sets such that Biareunionofsets

from A1,A

2,...,A

lthen the distribution of P(B1),P(B2),...,P(Bk) obtained from

P(A1),P(A2),...,P(Al)andΠ

B1,B2,...,Bkboth would satisfy (2.5). Uniqueness then

shows that both distributions are same.

If (B1n,B

2n,...,B

kn)→(B1,B

2,...,B

k) then (2.6) again shows that moments of

ΠB1n,B2n,...,Bkn converges to the corresponding moment of ΠB1,B2,...,Bk.

It is easy to verify the other conditions of Theorem 2.3.4. Hence there exists a Π

with ΠB1,B2,...,Bks as marginals. It is easy to verify that Π satisﬁes (2.4).

De Finetti’s theorem can be viewed from a somewhat general perspective. Let Gn

be the group of permutations on {1,2,...,n}and let G=∪Gn.Everyg∈Ginduces

in a natural way a transformation on Ω = X∞through the map, if, say gin Gn, then

(x1,...,x

n,...)→ (xg(1),...,x

g(n),...). It is easy to see that the set of exchangeable

probability measures is the same as the set of probability measures on Ω that are

invariant under G. This set is a convex set, and De Finetti’s theorem asserts that the

set of extreme points of this convex set is {P∞:P∈M(X)}and that every invariant

measure is representable as an average over the set of extreme points. This view of

exchangeable measures suggests that by suitably enlarging Git would be possible

to obtain priors that are supported by interesting subsets of M(X) . Following is a

simple, trivial example.

Example 2.6.1. Let H={h, e},whereh(x)=−xand e(x)=x. Set H=

∪Hn.If(h1,h

2,...,h

n)) ∈Hn, then the action on Ω is deﬁned by (x1,x

2,...,x

n)→

(h(x1),h(x2),...,h(xn). Then an exchangeable probability measure µis Hinvariant

iﬀ it is a mixture of symmetric i.i.d. probability measures. To see this by De Finetti’s

theorem

µ(A)=P∞(A)dΠ(P)

86 2. M(X)AND PRIORS ON M(X)

Because by Hinvariance µ(X1∈A, X2∈−A)=µ(X1∈A, X2∈A), it is not hard

to see that EΠ(P(A)−P(−A))2= 0. Letting Arun through a countable algebra

generating the σ-algebra on X, we have the result.

More non trivial examples are in Freedman [68]

Suﬃciency provides another frame through which De Finetti’s theorem can be use-

fully viewed. The ideas leading to such a view and the proofs involve many measure-

theoretic details. Most of the interesting examples involve invariance and suﬃciency

in some form. We do not discuss these aspects here but refer the reader to the excel-

lent survey in Schervish [144], the paper by Diaconis and Freedman [[44]] and Fortini,

Ladelli, and Regazzini [67].

To use DeFinetti’s theorem to construct a speciﬁc prior on M(R), we need to know

what to expect from the prior in terms of the observables X1,X

2,...,X

n. Although

this method of assigning a prior is attractive from a philosophical point of view, it

is not easy to either describe explicitly an exchangeable sequence or identify a prior,

given such a sequence. We will not pursue this aspect here.

Dirichlet and Polya tree process

3.1 Dirichlet and Polya tree process

In this chapter we develop and study a very useful family of prior distributions on

M(R) introduced by Ferguson [61]. Ferguson introduced the Dirichlet processes, un-

covered many of their basic properties, and applied them to a variety of nonparametric

estimation problems, thus providing for the ﬁrst time a Bayesian interpretation for

some of the commonly used nonparametric procedures. These priors are relatively

easy to elicit. They can be chosen to have large support and thus capture the non-

parametric aspect. In addition they have tractable posterior and nice consistency

properties. These processes are not an answer to all Bayesian nonparametric or semi-

parametric problems but they are important as both a large class of interpretable

priors and a point of departure for more complex prior distributions.

The Dirichlet process arises naturally as an inﬁnite-dimensional analogue of the

ﬁnite-dimensional Dirichlet prior, which in turn has its roots in the one-dimensional

beta distribution . We will begin with a review of the ﬁnite-dimensional case.

3.1.1 Finite Dimensional Dirichlet Distribution

In this section we summarize some basic properties of the Dirichlet distribution,

especially those that arise when the Dirichlet is viewed as a prior on M(X) -the set of

88 3. DIRICHLET AND POLYA TREE PROCESS

probability measures on X. Details are available in many standard texts, for example

Berger [13].

First consider the simple case when X={1,2}. Then

M(X)=p=(p1,p

2):p1≥0,p

2≥0,p

1+p2=1



Because p2=1−p1and 0 ≤p1≤1, any probability measure on [0,1] deﬁnes

a prior distribution on M(X). In particular say that phas a beta(α1,α

2)priorif

α1>0,α

2>0 and if the prior has the density

Π(p1)= Γ(α1+α2)

Γ(α1)Γ(α2)pα1−1

1(1 −p1)α2−10≤p1≤1

It is easy to see that

E(p1)= α1

α1+α2

V(p1)= α1(α1+1)

(α1+α2)(α1+α2+1) −α1

(α1+α2)2

=α1α2

(α1+α2)2(α1+α2+1)

We adopt the convention of setting the beta prior to be degenerate at p1=0if

α1= 0 and degenerate at p2=0ifα2= 0. Note that the convention goes well with

the expression for E(p1). In fact the following proposition provides more justiﬁcation

for this convention.

Proposition 3.1.1. If α1n→0and α2n→c, 0<c<∞, then beta(α1n,α

2n)

converges weakly to δ0.

Proof. If pnis distributed as beta(α1n,α

2n), then Epn→0,V(pn)→0 and hence

pn→0 in probability.

The following representation of the beta is useful and well known. Let Z1,Z

2be

independent gamma random variables with parameters α1,α

2>0, i.e., the density is

given by

f(zi)= 1

Γ(αi)e−zizαi−1

izi>0

then Z1/(Z1+Z2) is independent of Z1+Z2and is distributed as beta(α1,α

2).

If we deﬁne a gamma distribution with α= 0 to be the measure degenerate at 0,

then the representation of beta random variables remains valid for all α1≥0,α

2≥0

as long as one of them is strictly positive.

3.1 DIRICHLET DISTRIBUTION 89

Suppose X1,X

2,...,X

nare X-valued i.i.d. random variables distributed as p, then

beta priors are conjugate in the sense that if phas a beta(α1,α

2) prior distribution

then the posterior distribution is also a beta, with parameters α1+δXi(1) and

α2+δXi(2), where δxstands for the degenerate measure δx(x) = 1. Moreover, the

marginal distribution of X1,X

2,...,X

nis exchangeable with marginal probability

λ(X1=i)=αi/(α1+α2).

Next we move on to the case where X={1,2,...,k,}. The set M(X) of probability

measures on X, is now in 1-1 correspondence with the simplex

Sk=p=(p1,p

2,...,p

k−1):pi≥0fori=1,2,...,k −1,pi≤1

and as before we set pk=1−k−1

1pi.A prior is speciﬁed by specifying a probability

distribution for (p1,p

2,...,p

k−1). This distribution determines the joint distribution of

the 2kvectors {P(A):A⊂X}through P(A)=

i∈A

pi. The k- dimensional Dirichlet

distribution is a natural extension of the beta distribution.

Deﬁnition 3.1.1. Let α=(α1,α

2,...,α

k)withαi>0fori=1,2,...,k.p=

(p1,p

2,...,p

k)issaidtohaveDirichlet distribution with parameter (α1,α

2,...,α

k),

if the density is

Π(p1,p

2,...,p

k−1)= Γ(k

1αi)

Γ(α1)Γ(α2),...,Γ(αk)pα1−1

1pα2−1

2pαk−1−1

k−1(1 −

k−1



pi)αk−1

for (p1,p

2,...,p

k−1)inSk.

(3.1)

Convention If any αi= 0, we still a deﬁne a Dirichlet by setting the corresponding

pi= 0 and interpreting the density (3.1.1) as a density on a lower-dimensional set.

The Dirichlet distribution with the vector (α1,α

2,...,α

k) as parameter will be

denoted by D(α1,α

2,...,α

k). So we have a Dirichlet distribution deﬁned for all

(α1,α

2,...,α

k),as long as αi>0. Following are some properties of the Dirichlet

distribution.

Properties.

1. Like the beta distribution, Dirichlet distributions admit a useful representation

in terms of gamma variables. If Z1,Z

2,...,Z

kare independent gamma random

variables with parameter αi≥0, then

90 3. DIRICHLET AND POLYA TREE PROCESS

(a)

⎛

⎜

⎝



,Z2



,..., Zk



⎞

⎟

⎠

(3.2)

is distributed as D(α1,α

2,...,α

k);

(b)

⎛

⎜

⎝



,Z2



,..., Zk



⎞

⎟

⎠

(3.3)

is independent of



Ziand

2,...,p

k) is distributed as D(α1,α

2,...,α

k), then for any

partition A1,A

2...,A

mof X, the vector (P(A1),P(A2),...,P(Am)) =



i∈A1

pi,

i∈A2

pi,..., 

i∈Am

piis a D(α

1,α



2,...,α



where α

i=

j∈Ai

αj. In particular, the marginal distribution of piis beta with

parameters (αi,

i=j

αj).

This property suggests that it would be convenient to view the parameter

(α1,α

2,...,α

k) as a measure α(A)=

i∈A

αi. Thus every non-zero measure αon

Xdeﬁnes a Dirichlet distribution and the last property takes the form

(P(A1),P(A2),...,P(Am)) is D(α(A1),α(A2),...,α(Am))

2. (Tail Free Property) Let M1,M

2,...,M

kbe a partition of X.Fori=1,2,...,k

with α(Mi)>0, let P(.|Mi) be the conditional probability given Mideﬁned by

P(j|Mi)= P(j)

P(Mi):forj∈Mi

3.1 DIRICHLET DISTRIBUTION 91

If α(Mi)=0thentakeP(.|Mi) to be an arbitrary ﬁxed probability for all P.

If Pthe probability on Xis D(α) then

(i) (P(M1),P(M2),...,P(Mk)) ,P(.|M1),P(.|M2),...,P(.|Mk) are indepen-

dent;

(ii) if α(Mi)>0 then P(.|Mi)isD(αMi),where αMiis the restriction of αto

Mi,and

(iii) (P(M1),P(M2),...,P(Mk)) is Dirichlet with parameter

(α(M1),α(M2),...,α(Mk))

To see this, let X={1,2,...,n}and let {Yi:1≤i≤n}be independent

gamma random variables with parameter α(xi). The gamma representation of

the Dirichlet immediately shows that

P(.|M1),P(.|M2),...,P(.|Mk) (3.4)

are independent. Further if Zj=i∈MjYi, then

Z1,Z

2,...,Z

are independent, and using (3.4) it is easy to see that (Z1,Z

2,...,Z

k) and hence

jZjis independent of

P(.|M1),P(.|M2),...,P(.|Mk)

Because P(Mj)=Zj/jZjthe result follows.

3. (Neutral to the right property) Let B1⊃B2⊃...B

k. Then we have the

independence relations given by

P(B1)⊥P(B2|B1)⊥...⊥P(Bk|Bk−1)

This follows from the tail free property by successively considering partitions

B1,Bc

1,B

2,B

1∩Bc

2;...

4. Let α1,α

2be two measures on Xand P1,P

2be two independent k-dimensional

Dirichlet random vectors with parameters α1,α

2.IfYindependent of P1,P

2is

distributed as beta(α1(X),α

2(X)), then YP

1+(1−Y)P2is D(α1+α2).

92 3. DIRICHLET AND POLYA TREE PROCESS

To see this, let Z1,Z

2,...,Z

kbe independent random variables with Zi∼

gamma(α1{i}). Similarly for i=1,2,...k let Zk+i∼Gamma(α2{i}) be inde-

pendent gamma random variables. Then

k

1Zi

2k

1ZiZ1

k

1Zi

,..., Zk

k

1Zi+k

k+1 Zi

2k

1ZiZk+1

k

1Zi

,..., Z2k

k

1Zi

has the same distribution as YP

1+(1−Y)P2. But then the last expression is

equal to

Z1+Zk+1

k

1Zi

,...,Zk+Z2k

k

1Zi

which is distributed as D(α1+α2). Note that the assertion remains valid even

if some of the α1{i},α

2{j}are zero. An interesting consequence is: If Pis D(α)

and Yis independent of Pand distributed as Beta(c, α(X)), then

Yδ

(1,0,...,0) +(1−Y)P∼D(α{1}+c, α{2},...,α{k})

This follows if we think of δ1,0,...,0as Dirichlet with parameter (c, 0,...,0). A

corresponding statement holds if (1,0,...,0) is replaced by any vector with a 1

at one coordinate and 0 at the other coordinates.

5. For each pin M(X) , let X1,X

2,...,X

nbe i.i.d. Pand let Pitself be D(α).

Then the likelihood is proportional to



pαi−1+ni

where ni=#{j:Xj=i}. Hence the posterior distribution of Pgiven

X1,X

2,...,X

ncan be conveniently written as D(α+δXi).

6. The marginal distribution of each Xiis ¯αwhere ¯α(i)=α(i)/α(X)andalso

E(P)=¯α. To see this, note that for each A⊂X,P(A)isbeta(α(A),α(Ac))

and hence E(P(A)=α(A)/(α(A)+α(Ac)).

Property 5 immediately leads to

D(α)(P∈C)=



α(i)

α(X)D(α+δi)(C)

3.1 DIRICHLET DISTRIBUTION 93

This follows from D(α)(P∈C)=E(E(P∈C|X1)); E(P∈C|X1)isbyprop-

erty 5, D(α+δX1)(C), and the marginal of X1is ¯α.

8. Let Pbe distributed as D(α)andXindependent of Pbe distributed as ¯α.

Let Ybe independent of Xand Pbe a beta(1,α(X)) random variable. Then

Yδ

X+(1−Y)Pis again a D(α) random probability.

This follows from properties 4 and 7 by conditioning on x=i, interpreting δi

as a D(δi) distribution, and then using properties 4 and 7.

9. The predictive distribution of Xn+1 given X1,X

2,...,X

nis

α+n

1δXi

α(X)+n

10. α1=α2implies D(α1)=D(α2), except when α1,α

2are degenerate and put all

their masses at the same point.

This can be veriﬁed by choosing an isuch that α1(i)=α2(i). Then P(i)hasa

nondegenerate beta distribution under at least one of α1,α

2. Next use the fact

that a beta distribution is determined by its ﬁrst two moments.

11. It is often convenient to write a ﬁnite measure αon Xas α=c¯α,where ¯αis

a probability measure. Let αn=cn¯αnbe a sequence of measures on X. Then

D(cn¯αn) is a sequence of probability measures on the compact set Skand hence

has limit points. The following convergence results are useful.

(a) If ¯αn→¯αand cn→c, 0<c<∞, then D(cn¯αn)→D(c¯α)weakly.

If ¯α{i}>0 for all i, then the density of D(cn¯αn) converges to that of

D(c¯α). If ¯α{i}=0forsomeoftheis, then the result can be veriﬁed by

showing that the moments of D(cn¯αn) converge to the moments of D(c¯α).

(b) Suppose that ¯αn→¯αand cn→0. Then D(cn¯αn) converges weakly to the

discrete measure µwhich gives mass ¯αito the probability degenerate at i.

To see this note that ED(cn¯αn)pi=¯αn{i}→¯α{i}, and it follows from

simple calculations that ED(cn¯αn)p2

ialso converges to ¯α{i}.Thuseachpiis

0 or 1 almost surely with respect to any limit point of D(cn¯αn). In other

words, any limit point of D(cn¯αn) is a measure concentrated on the set of

degenerate probabilities on X. It is easy to see that any two limit points

have the same expected value and this together with the fact that they are

both concentrated on degenerate measures shows that D(cn¯αn)converges.

94 3. DIRICHLET AND POLYA TREE PROCESS

However Var

D(cn¯αn)pi→0, and hence D(cn¯αn) converges to the measure

degenerate at ¯α.

3.1.2 Dirichlet Distribution via Polya Urn Scheme

The following alternative view of the Dirichlet process is both interesting and a pow-

erful tool. For a recent use of this approach, see Mauldin et al.[133].

Consider a Polya urn with α(X) balls of which α(i)areofcolori;i=1,2,...,k.[For

the moment assume that α(i) are whole numbers or 0]. Draw balls at random from

the urn, replacing each ball drawn by two balls of the same color. Let Xi=jif the i

th ball is of color j. Then

P(X1=j)= α(j)

α(X)(3.5)

P(X2=j|X1)=α(j)+δX1(j)

α(X)+1 (3.6)

andingeneral

(3.7)

P(Xn+1 =j|X1,X

2,...,X

n)=α(j)+n

1δXi(j)

α(X)+n(3.8)

Thus we are reproducing the joint distribution of X1,X

2,... that would be ob-

tained from property 9 in the last section. The joint distribution of X1,X

2,... is

exchangeable. In fact, if λαdenotes the joint distribution

λα(X1=x1,X

2=x2,...,X

n=xn)

=α(x1)

α(X)

n−1



i=1

α+δi−1

1xj

i−1

1(α(X)+j)(xi+1)

setting ni=#{Xj=i}

={α(1)(α(1) + 1) ...(α(1) + n1−1)}{α(2)(α(2) + 1) ...(α(2) + n2−1)}...

α(X)(α(X)+1)...(α(X)+n−1)

=[α(1)][n1]...[α(k)][nk]

[α(X)][n]

(3.9)

3.1 DIRICHLET DISTRIBUTION 95

where m[n]is the ascending factorial given by m[n]=m(m+1)...(m+n−1).

It is clear that (3.5) deﬁnes successive conditional distributions even when α{i}is

not an integer but only ≥0. The scheme (3.5) thus leads to a sequence of exchangeable

random variables and the corresponding mixing measure Π coming out of De Finetti’s

theorem is precisely Dα. What we need to show is that if Dαis the prior on M(X)and

if given P,X1,X

2,... are i.i.d P, then the sequence X1,X

2,... has the distribution

given in (3.9). In fact, (3.9) is equal to

M(X)

[P(1)]n1...[P(k)]nkDα(dP )

which is equal to

M(X)

[P(1)]n1...[P(k)]nkΠ(dP )

Since the ﬁnite-dimensional Dirichlet is determined by its moments, this shows Π =

Dα.

The posterior given X1,X

2,...,X

ncan also be recovered from this approach. For

agivenX1, (3.5) deﬁnes a scheme of conditional distributions with αreplaced by

α+δX1. Once again DeFinetti’s theorem leads to the prior D(α+δX1), this is also

the posterior given X1.

We end this section with the question of interpretation and elicitation of α. From

property 6, ¯α=α(·)/α(X)=E(P). So ¯αis the prior guess about the expected P.

If we rewrite property 10 in terms of the Bayes estimate E(pi|X1,X

2,...,X

n)ofpi

given X1,X

2,...,X

E(pi|X1,X

2,...,X

n)= α(X)

α(X)+n¯α(i)+ n

α(X)+n(ni

which shows the Bayes estimate can be viewed as a convex combination of the “prior

guess” and the empirical proportion. Because the weight of the “prior guess” is de-

termined by α(X), this suggests interpreting α(X) as a measure of strength of the

prior belief. This ease in interpretation and elicitation is a consequence of the fact

that Dirichlet is a conjugate prior for i.i.d. sampling from X. We will show that all

these properties hold when X=R. The fact that variability of Pis determined by a

single parameter α(X) can be a problem when k>2.

96 3. DIRICHLET AND POLYA TREE PROCESS

3.2 Dirichlet Process on M(R)

3.2.1 Construction and Properties

Dirichlet process priors are a natural generalization to M(R) of the ﬁnite-dimensional

distributions considered in the last section. Let (R,B) be the real line with the Borel

σ-algebra Band let M(R) be the set of probability measures on R, equipped with

the σ-algebra BM.

The next theorem asserts the existence of a Dirichlet process and also serves as a

deﬁnition of the process Dα.

Theorem 3.2.1. Let αbe a ﬁnite measure on (R,B). Then there exists a unique

probability measure Dαon M(R)called the Dirichlet process with parameter αsat-

isfying

For every partition B1,B

2,...,B

kof Rby Borel sets

(P(B1),P(B2)...,P(Bk)) is D(α(B1),α(B2)...,α(Bk))

Proof. The consistency requirement in Theorem 2.3.4 follows from property 2 in the

last section. Continuity requirement 3 follows from the fact that if Bn↓Bthen

α(Bn)↓α(B) and from property 11 of the last section.

Note that ﬁnite additivity of αis enough to ensure the consistency requirements.

The countable additivity is required for the continuity condition.

Assured of the existence of the Dirichlet process, we next turn to its properties.

These properties motivate other constructions of Dαvia De Finetti’s theorem and an

elegant construction due to Sethuraman. These constructions are not natural unless

one knows what to expect from a Dirichlet process prior.

If P∼D(α), then it follows easily that E(P(A)) = ¯α(A)=α(A)/α(R). Thus one

might write E(P)=¯αas the prior expectation of P.

Theorem 3.2.2. For each Pin M(R), let X1,X

2,...,X

nbe i.i.d. Pand let P

itself be distributed as Dα, where αis ﬁnite measure. (A version of) the posterior

distribution of Pgiven X1,X

2,...,X

nis Dα+n

1δXi.

Proof. We prove the assertion when n= 1; the general case follows by repeated

application. A similar proof appears in Schervish[144].

To sho w that Dα+δXis a version of the posterior given X, we need to verify that

for each B∈Band Ca measurable subset of M(R),

B

Dα+δx(C)¯α(dx)=C

P(B)Dα(dP )

3.2. DIRICHLET PROCESS ON M(R)97

As Cvaries each side of this expression deﬁnes a measure on M(R), and we shall argue

that these two measures are the same. It is enough to verify the equality on σ-algebras

generated by functions P→ (P(B1),P(B2)...,P(Bk)), where B1,B

2,...,B

kis a

measurable partition of R. We do this by showing that the moments of the vector

(P(B1),P(B2)...,P(Bk)) are same under both measures.

First suppose that α(Bi)>0fori=1,2,...,k. For any nonnegative r1,r

2,...,r

look at

Bk



[P(Bi)]riDα+δx(dP )¯α(dx) (3.10)

If we denote by Dα+δiand Dαthe k-variate Dirichlet distributions with parameters

(α(B1),...,α(Bi)+1,...,α(Bk)) and (α(B1),...,α(Bi),...,α(Bk)),then (3.10) is

equal to



α(B∩Bi)

α(B)yr1

1...y

i...y

kDα+δi(dy1...dy

k−1).

whichinturnisequalto



α(B∩Bi)

α(B)yr1

1...y

ri+1

i...y

kDα(dy1...dy

k−1).

On the other hand because P(B)=P(B∩Bi),

k



[P(Bi)]riP(B)Dα(dP )



1k



[P(Bi)]riP(B∩Bi)Dα(dP )



1P(B1)r1...P(Bi)ri+1 ...P(Bk)rk...P(B∩Bi)

P(Bi)Dα(dP )

Since P(B∩Bi)

P(Bi)is a Beta random variable and independent of (P(B1),P(B2)...,P(Bk)) ,

the preceding equals



α(Bi)∩B

α(B)P(B1)r1...P(Bi)ri+1 ...P(Bk)rk...D

α(dP )

98 3. DIRICHLET AND POLYA TREE PROCESS

which is equal to the expression obtained earlier. To take care of the case when some

of the α(Bi) may be 0, consider the simple case when, say α(B1)=0,r

1>0andthe

rest of the α(Bi) are positive. In this case

Bk



[P(Bi)]riDα+δx(dP )¯α(dx)=0

Because in k

1(α(B∩Bi)/α(B)) yr1

1...y

i...y

kDα+δi(dy1...dy

k−1), α(B∩B1)=

0 and for i=1,y1=0a.e.Dα+δi,

yr1

1...y

i...y

kDα+δi(dy1...dy

k−1)=0

A Similar argument applies when α(Bi)is0formorethanonei.

Remark 3.2.1 (Tail Free Property). Fix a partition B1,B

2,...,B

kof X. Consider a

sequence {T}n:n≥1of nested partitions with T1={B1,B

2,...,B

k}and σ{{T}n:n≥1}=

B. Then Dαis tail free with respect to this partition. And we leave it to the reader

to verify that with Dirichlet as the prior and with given P,X∼P,

(P(B1),P(B2)...,P(Bk)) and X

are conditionally independent given {IBi(X); 1 ≤i≤k}. Consequently, the condi-

tional distribution of the vector (P(B1),P(B2)...,P(Bk)) given Tnisthesamefor

all nand is equal to the marginal distribution of

(P(B1),P(B2)...,P(Bk))

under the measure Dα+δX.

The last remark provides an alternative and more natural approach to demonstrate

that Dα+δXis indeed the posterior given X. For, by the martingale convergence

theorem, the conditional distribution of (P(B1),P(B2)...,P(Bk)) given Tnconverges

to the conditional distribution of (P(B1),P(B2)...,P(Bk)) given X, and this limit

is the marginal distribution of the vector (P(B1),P(B2)...,P(Bk)) arising out of

Dα+δX. This is true for any partition B1,B

2,...,B

kand since a measure on M(R)

is determined by the distribution of ﬁnite partitions, we can conclude that Dα+δXis

indeed the posterior.

3.2. DIRICHLET PROCESS ON M(R)99

Remark 3.2.2 (Neutral to the Right property). Another useful independence prop-

erty follows immediately from Property 4 of the last section. If t1<t

2,... < t

then

(1 −F(t1)),1−F(t2)

1−F(t1),..., 1−F(tk)

1−F(tk−1)

are independent.

Many of the properties of the Dirichlet process on M(R) either easily follow from,

or are suggested by the corresponding property for the ﬁnite-dimensional Dirichlet

distribution. One major diﬀerence is that in the case of M(R) the measure αcan be

continuous. This leads to some interesting consequences, some of which are explored

next.

Denote by λαthe joint distribution of P, X1,X

2,.... Suppose P∼D(α) and given

P,X1,X

2,... are i.i.d. P. From Theorem 3.2.2 it immediately follows that the

predictive distribution of Xn+1 given X1,X

2,...,X

nis

α+n

1δXi

α(R)+n

and hence that

X1is distributed as ¯α

Conditional distribution of X2given X1is α+δX1

α(R)+1

Conditional distribution of X3given X1,X

2is α+δX1+δX2

α(R)+2

Conditional distribution of Xn+1 given X1,X

2,...,X

nisα+n

1δXi

α(R)+n,etc.

Suppose that αis a discrete measure and let X0be the countable subset of Rsuch

that α(X0)=α(R)andα{x}>0 for all x∈X

0.Dαcan then be viewed as a prior

on M(X0). Further the joint distribution of X1,X

2,...,X

ncan be written explicitly.

For ea ch (x1,x

2,...,x

n) and for each x∈X

0,letn(x) be the number of issuch

that xi=x.Notethatn(x) is nonzero for at most nmany xs. If αndenotes the joint

distribution of X1,X

2,...,X

n, then

αn(x1,x

2,...,x

n)= 

x∈X0

α(x)[n(x)] (3.11)

where a[b]=a(a+1)...(a+b−1).

The case when αis continuous is a bit more involved. Even if αhas density with

respect to Lebesgue measure, for n≥2, because P{X1=X2} =0,α2is no longer

100 3. DIRICHLET AND POLYA TREE PROCESS

absolutely continuous with respect to the two-dimensional Lebesgue measure. To see

this formally, note that

α2{X1=X2}=(α+δx1)

α(R)+1{x1}d¯α(x1)= 1

α(R)+1

On the other hand the Lebesgue measure of {(x, x):x∈R}is 0.

While αnis not dominated by the n-dimensional Lebesgue measure, it is dominated

by a measure λ∗

ncomposed of Lebesgue measure in lower-dimensional spaces, and with

respect to this measure, it is possible to obtain a fairly explicit form of the density

of αn. We will look at the case n= 3 in some detail and then extend these ideas to

general n.

We will begin by calculating αn(A×B×C) when αis a continuous measure. Let

R1,2,3={(x1,x

2,x

3):x1,x

2,x

3are all distinct }

Then

α3((A×B×C)∩R1,2,3)

=α3{X1∈A, X2∈B−{X1},X

3∈C−{X1,X

2}}

=α(A)

α(R)

α(B)

(α(R)+1)

α(C)

(α(R)+2)

where the last equality follows from the fact that for each x1,bycontinuityofα,

α(B−{x1})=α(B)andδx1(B−{x1}) = 0. Consequently

Pr{X2∈B−{x1}=[α+δx1]

α(R)+1

(B−{x1})= α(B)

α(R)+1

Similarly for Pr{X3∈C−{x1,x

2}.

Next, let

R12,3={(x, x, x3):x=x3}

Then

α3((A×B×C)∩R12,3)

=α3{X1∈A, X2={X1},X

3∈C−{X1}}

3.2. DIRICHLET PROCESS ON M(R) 101

Because Pr{X2=x|X1=x}=[α+δx]/(α(R)+1)({x})=1/(α(R) + 1), again by

continuity of α, we have the preceding is equal to

α(A∩B)

α(R)

(α(R)+1)

α(C)

(α(R)+2)

Similarly, if

R13,2={(x, x2,x):x=x2}

then by exchangeability

αn(A×B×C∩R13,2)=αn(A×C×B∩R12,3)

=α(A∩C)

α(R)

(α(R)+1)

α(B)

(α(R)+2)

A similar expression holds for R1,23.

Let R123 ={(x, x, x)}. Then A×B×C∩R123 ={(x, x, x)x∈A∩B∩C}.We

then have

αn(A×B×C∩R123)=2α(A∩B∩C)

α(R)

(α(R)+1)

α(B)

(α(R)+2)

where the factor 2 in the numerator arises from P(X3=x|X1=X2=x)=(δx+

δx)α(B)/(α(R)(α(R)+1)(α(R) + 2))(x).

Suppose that αhas a density ˜αwith respect to Lebesgue measure. Deﬁne a measure

λ∗

3as follows:

λ∗

3restricted to R1,2,3is the three-dimensional Lebesgue measure

λ∗

3restricted to R12,3is the two-dimensional Lebesgue measure obtained from R2via

the map (x, y)→ (x, x, y).

Deﬁne the restriction on R1,23 and R13,2similarly.

λ∗

3restricted to R12,3is the one-dimensional Lebesgue measure obtained from x→

(x, x, x).

Note that the function on R1,2,3deﬁned by

˜α3(x1,x

2,x

3)= ˜α(x1)˜α(x2)˜α(x3)

α(R)(α(R)+1)(α(R)+2)

when viewed as a density with respect to λ∗

3restricted to R1,2,3gives, for any (A×

B×C), αn(A×B×C∩R1,2,3). Similarly the function on R12,3deﬁned by

˜α3(x1,x

1,x

3)= ˜α(x1)˜α(x3)

α(R)(α(R)+1)(α(R)+2)

102 3. DIRICHLET AND POLYA TREE PROCESS

corresponds to the density of α3with respect to λ∗

3restricted to R12,3and

˜α3(x1,x

1,x

1)= 2˜α(x1)

α(R)(α(R)+1)(α(R)+2)

corresponds to the density of α3with respect to λ∗

3restricted to R123.

The general case is similar but notationally cumbersome. For a partition {C1,...,C

of {1,2,...,n},let

RC1,C2,...,Ck={(x1,x

2,...,x

n):xi=xjiﬀ i, j ∈Cmfor some m, 1≤m≤k}

The measure λ∗

nis deﬁned by setting its restriction on RC1,C2,...,Ckto be the k-

dimensional Lebesgue measure. As before if we set I1=1and

Ij=1 if,x

j∈{x1,x

2,...,x

0 otherwise.

the density of αnwith respect to λ∗

non RC1,C2,...,Ckis given by

˜αn(x1,x

2,...,x

n)=j˜α(xj)Ij(ej−1)!

(α(R))[n](3.12)

where ej=#cj.

The veriﬁcation follows essentially the same ideas, for example

αn(A1×A2×...×An∩RC1,C2,...,Ck)= α(B1)α(B2)...α(Bk)

(α(R))[n]

where Bj=∩i∈CjAi.

Theorem 3.2.3. Dα{P:Pis discrete }=1.

Proof. Let ˜

E={(P, x):P{x}>0}. Note that Pis a discrete probability measure if

{x:(P,x)∈˜

E}P(x) = 1. We saw in the last chapter that ˜

Eis a measurable set. Let

Ex={P:P{x}>0}˜

EP={x:P{x}>0}

Then

λα(˜

E)=Eλαλα(˜

E|X1

=Eλαλα(˜

EX1|X1=EλαDα+δX1(˜

EX1

3.2. DIRICHLET PROCESS ON M(R) 103

Because P{x1}is beta with positive parameter α{x1}+1, P{x1}>0 with probability

1. Now

λα(˜

E)=Eλαλα(˜

E|P=EλαP(˜

EP)=1

so P(˜

EP) = 1 almost everywhere Dα.

The preceding proof is based on a presentation in Basu and Tiwari[10] . A variety of

proof for this interesting fact is available. See Blackwell & Mcqueen [25], and Blackwell

[23], Berk and Savage [17]. Another nice proof is due to Hjort [99]

3.2.2 The Sethuraman Construction

Sethuraman [148] introduced and elaborated on a useful and clever construction of

Dα, which provides insight into these processes and helps in simulation of the process.

As before let αbe a ﬁnite measure and ¯α=α/α(R).Let Ω be a probability space

with a probability µsuch that

θ1,θ

2,... deﬁned on Ω are i.i.d. beta(1,α(R))

Y1,Y

2,... are also deﬁned on Ω such that they are i.i.d. ¯αand independent of the θis

Set p1=θ1and for n≥2, let pn=θnn−1

1(1 −θi). Easy computation shows that

∞



pn= 1 almost surely. Now deﬁne an M(R) valued random variable on Ω by

P(ω, A)= ∞



pn(ω)δYn(ω)(A) (3.13)

Because ∞



pn= 1, the function ω→ P(ω, ·) takes values in M(X). It is not

hard to see that this map is also measurable. This random measure is a discrete

measure that puts weight pion Yi. Sethuraman showed that this random measure is

distributed as Dα. Formally, if Π is the distribution of ω→ P(ω,·) then Π = Dα.We

will establish this by showing that for every partition B1,B

2,...,B

kof Rby Borel

sets (P(ω, B1),P(ω,B2),...,P(ω, Bk)) is distributed as D(α(B1),...,α(Bk)).

Denote by δk

Yithe element of Skgiven by (IB1(Yi),I

B2(Yi),...,I

Bk(Yi)). Then for

each ω,(P(ω, B1),P(ω, B2),...,P(ω, Bk)) can be written as ∞

1pi(ω)δk

Yi(ω).

Let Pbe an Skvalued random variable, independent of the Ysandθs, and dis-

tributed as D(α(B1),...,α(Bk)).

104 3. DIRICHLET AND POLYA TREE PROCESS

Consider the Skvalued random variable

P1=p1δk

Y1+(1−p1)P

where Y1∈Bi,δk

Yiis the vector with a 1 in the ith coordinate and 0 elsewhere. Hence

by property 4 from Section 3.1, given Y1∈Bi,P1is distributed as a Dirichlet with

parameter (α(B1),...,α(Bi)+1,...,α(Bk)). Since µ(Y1∈Bi)=α(Bi), by property

8inSection3.1,P1is distributed as D(α(B1),...,α(Bk)).

It follows by easy induction that for all n,1−n

1pi=n

1(1 −θi). Using this fact,

a bit of algebra gives



piδk

Yi+(1−



pi)P

n−1



piδk

Yi+(1−

n−1



pi)(θnδk

Yn+(1−θn)P)

Because our earlier argument showed that θnδk

Yn+(1−θn)Phas the same distribution

as P, a simple induction argument shows that, for all n,



piδk

Yi+(1−



pi)P

is distributed as D(α(B1),...,α(Bk)).Letting n→∞and observing that (1−n

1pi)

goes to 0, we get the result.

Note that we have not assumed the existence of a Dαprior. Because P(ω, ·)is

M(X) valued, the argument also shows the existence of the Dirichlet prior.

3.2.3 Support of Dα

We begin by recalling that M(R) under the weak topology is a complete separable

metric space, and hence for any probability measure Π on M(R) the support—the

smallest closed set of measure 1— exists. Note that support is not meaningful if we

consider the total variation metric or setwise convergence.

Theorem 3.2.4. Let αbe a ﬁnite measure on Rand let Ebe the support of α.Then

Mα={P:support of P⊂E}

is the weak support of Dα

3.2. DIRICHLET PROCESS ON M(R) 105

Proof. Mαis a closed set by the Portmanteau theorem, since Eis closed and if Pn→P

then P(E)≥lim supnPn(E). Further, because P(E)isbeta(α(R),0), Dα(Mα)=1.

Let P0belong to Mαand let Ube a neighborhood of P0. Our theorem will be

proved if we show that Dα(U)>0.

Choose points a0<a

1< ... < a

T−1<a

Tand let Wj=(aj,a

j+1]∩Eand J=

{j:α(Wj)>0}.Then depending on whether α(∪j∈JWj)=α(R)orα(∪j∈JWj)<

α(R), (P(Wj):j∈J)orP(Wj):j∈J, 1−j∈JP(Wj)has a ﬁnite-dimensional

Dirichlet distribution with all parameters positive. And in either case, for any η>0,

Dα{P∈M(R):|P(Wj)−P0(Wj)|<δ:j∈J}>0

By Propositon 2.5.2 for small enough δ,Ucontains a set of the above form. Hence

Dα(U)>0.

3.2.4 Convergence Properties of Dα

Many of the theorems in this section are adapted from Sethuraman and Tiwari [149].

Because under Dα,E(P)=¯α, Theorem 2.5.1 in Chapter 2 immediately yields the

following.

Theorem 3.2.5. Let {αt:t∈T}be a family of ﬁnite measures on R. Then the

family {Dαt:t∈T}is tight iﬀ {¯αt:t∈T}is tight.

Theorem 3.2.6. Suppose {αm},α are ﬁnite measures on Rsuch that ¯αm→¯α

weakly.

(i) If αm(R)→α(R)where 0<α(R)<∞, then Dαm→Dαweakly.

(ii) If αm(R)→0. Then Dαmconverges weakly to D∗, where

D∗{P:Pis degenerate}=1

(iii) If α(R)→∞then Dαmconverges weakly to δα.

Proof. By Theorem 3.2.5, {Dαm}is tight and hence any subsequence has a further

subsequence that converges to, say, D∗.

(i) We will argue that the limit D∗is Dαand is the same for all subsequences. By

(iii) of Theorem 2.5.2 and (a) of property 11 of the ﬁnite-dimensional Dirichlet

(see Section 3.1) it follows that D∗=Dα.

106 3. DIRICHLET AND POLYA TREE PROCESS

(ii) From property 11 for any ¯αcontinuity set A,D∗{P:P(A)=0,or 1}=1.

By using a countable collection of ¯αcontinuity sets that generate the Borel

σ-algebra, the result follows.

(iii) (iii) Recall that E(P(A)) = ¯α(A). Because αn(R)→∞,Var(P(A)) →0for

all A. Hence P(A) converges in probability to ¯α(A). This holds for any ﬁnite

collection of sets. The result now follows as in the preceding case.

As a consequence of the theorem we have the following results.

Theorem 3.2.7. (i) Let αbe a ﬁnite measure. Then for each P0the posterior

Dα+n

1δXi→δP0weakly, almost surely P0.

(ii) As α(R)goes to 0, the posterior converges weakly to Dn

1δXi.

Proof. Because a.e. P0,α+n

1δXi=αnsatisﬁes ¯αn→P0and αn→∞, (iii) of

Theorem 3.2.6 yields the result.

Remark 3.2.3.Note that posterior consistency holds for all P0, not necessarily in

theweaksupportofDα. This is possible because the version of the posterior chosen

behaves very nicely. This version is not unique even for P0in the weak support of Dα.

One suﬃcient condition for uniqueness up to P0null sets is that P0be dominated by

α.

Remark 3.2.4.Assertion (ii) has been taken as a justiﬁcation of the use of Dn

1δXi

as a noninformative (completely nonsubjective in the terminology of Chapter 1) pos-

terior. Note that Theorem 3.2.6 shows that the corresponding prior is far from a

noninformative prior.

The posterior Dn

1δXihas been considered as a sort of Bayesian bootstrap by Rubin

[142]. For an interesting discussion of the Bayesian bootstrap and Efron’s bootstrap,

see Schervish [144].

We would like to remark that all the theorems in this section go through if Ris

replaced by any complete separable metric space. The existence aspect of the Dirichlet

process can be handled via the famous Borel isomorphism theorem, which says that

there is a 1-1, bimeasurable function form Ronto X. The proofs of other results

require only trivial modiﬁcations.

3.2. DIRICHLET PROCESS ON M(R) 107

3.2.5 Elicitation and Some Applications

We have seen that with a Dαprior the posterior given X1,X

2,...,X

nis Dα+δXi.

As α(R)goesto0,(α+δXi)/(α(R)+n) converges to δXi/n, the empirical

distribution, further α(R)+nconverges to n. Hence as observed in the last section

Dα+δXiconverges weakly to DδXi. In particular if the X1,X

2,...,X

nare distinct

then DδXiis just the uniform distribution on the n-dimensional probability simplex

S∗

n. This phenomenon suggests an interpretation of α(R) goes to 0, as leading to

a “noninformative”prior. In this section we investigate a few examples, all taken

from Ferguson [61], where as α(R) goes to 0, the Bayes procedure converges to the

corresponding frequentist nonparametric method.

While these examples corroborate the feeling that α(R) goes to 0 leads to a non-

informative prior, (ii) of Theorem 3.2.6 points out the need to be careful with such

an interpretation. As α(R) goes to 0 the posterior leads to an intuitive noninforma-

tive limit. However the corresponding prior cannot be considered noninformative. We

believe these applications are justiﬁed in the completely non-parametric context of

making inference about Pbecause the Dirichlet is conjugate in that setting. Similar

assessments of conjugate prior in ﬁnite-dimensional problems is well known.

However, the Dirichlet is often used in problems where it is not a conjugate prior.

In such problems the interpretation of α(R) as a sort of sample size or a measure of

prior variability is of doubtful validity. See Newton et al. [136] in this connection.

Estimation of F . Suppose that we want to estimate the unknown distribution

function under the loss L(F, G)=(F(t)−G(t))2dt. If Π is a prior on M(R),

equivalently on the space of distribution functions Fon R,itiswellknownthat

the no-sample Bayes estimate is given by ˆ

FΠ(t)=F(t)dΠ(F). If Π is Dαthen

because the posterior is Dα+δXi, the Bayes estimate of Fgiven X1,X

2,...,X

nis

(α+δXi)(−∞,t]/(α(R)+n).Setting Fnas the empirical distribution, we rewrite

this as

α(R)

α(R)+n¯α(−∞,t]+ n

α(R)+nFn

which is a convex combination of the prior guess and a frequentist nonparametric

estimate.

This property makes it clear how αis to be chosen. If the prior guess of the distri-

bution of Xis, say, N(0,1) then that is ¯α. The value of α(R) determines how certain

one feels about the prior guess. This interpretation of α(R) as a measure of one’s faith

in a prior guess is endorsed by the fact that if α(R)→∞then the prior goes to δ¯α.

108 3. DIRICHLET AND POLYA TREE PROCESS

If α(R)→0 the Bayes estimate of Pconverges to the empirical distribution and

the posterior converges weakly to DnFn. Since the prior has no role any more, DnFn

is called a noninformative posterior and Fnthe corresponding noninformative Bayes

estimate. These intuitive ideas are helpful in calibrating α(R) as a cost of sample size

and α(R) = 1 is sometimes taken as a prior with low information.

Estimation of mean of F. The problem here is to estimate the mean µFof the

unknown distribution function F, the loss function being the usual squared error

loss, i.e., L(F, a)=(µF−a)2. If Π is a prior on Fsuch that ˆ

FΠhas ﬁnite mean, then

the Bayes estimate ˆµis µFdΠ(F) and with probability 1 this is the same as the

mean of ˆ

FΠ. This follows because

xdF Π(dF )

= lim xI[0,n]dF Π(dF )

=xd ˆ

FΠ(x)=xd ˆ

FΠ(x)<∞

Thus if αhas ﬁnite mean then

Dα{F:Fhas ﬁnite mean}=1

and given X1,X

2,...,X

n, the Bayes estimate of µFis the mean of α+δXi. This

is easily seen to be a convex combination of the mean of ¯αand ¯

Xand goes to ¯

Xas

α(R)→0.

Estimation of median of F. We next turn to the estimation of the median of the

unknown distribution F. For any F∈F,tis a median if

F(t−)≤1

2≤F(t)

If αhas support [K1,K

2],−∞ ≤ K1<K

2≤∞then with Dαprobability 1, F

has unique median. If t1<t

2are both medians of F, then for any rational a, b;t1<

a<b<t

2we have F(a)=F(b).On the other hand Dα{F:F(a)=F(b)}=0.By

considering all rationals a, b in the interval (K1,K

2) we have the result.

In the context of estimating the median the absolute deviation loss is more natural

and convenient than the squared error loss. Formally, L(F, m)=|mF−m|. If Π is a

prior on Fthen the “no-sample” Bayes estimate is just the median of the distribution

of mF.

3.2. DIRICHLET PROCESS ON M(R) 109

If the prior is Dαthen any median of mFis also a median of ¯α. This may be seen

as follows: tis a median of mFiﬀ

Dα{mF<t}≤1

2≤Dα{mF≤t}

Now mF≤tiﬀ F(t)≥1/2. Because F(t)isbeta(α(−∞,t],α(t, ∞), Dα{F(t)≥

1/2}≥1/2iﬀα(t, ∞)/α(R)≥1/2 (see exercise 11.0.2 ). On the other hand mF<t

iﬀ F(t−)>1/2 . This yields α(−∞,t)/α(R)≤1/2andsuchatis a median of ¯α.

Consequently, the Bayes estimate of the median given X1,X

2,...,X

nis a median of

(α+δXi)/(α(R)+n)).If ¯αis continuous then the median of (α+δXi)/(α(R)+n))

is unique. As α(R) goes to 0 the limit points of the Bayes estimates of mFare medians

of the empirical distribution.

Testing for median of F. Consider the problem of testing the hypotheses that the

median of Fis less than or equal to 0 against the alternative that the median is

greater than 0. If we view this as a decision problem with 0-1 loss, for a Dαprior on

Fthe Bayes rule is

decide median is ≤0ifDα{F(0) >1

2}>1

Because Dα{F(0) >1/2}=1/2 iﬀ the two parameters are equal this reduces to

“accept the hypotheses that the median is 0 iﬀ

α(−∞,0]

α(R)>1



Given X1,X

2,...,X

nthis condition becomes “accept the hypotheses that the me-

dian is 0 iﬀ

Wn>1

2n+α(R)1

2−¯α(−∞,0)

where Wnis the number Xi≤0.

Estimation of P(X≤Y). Suppose that X1,...,X

nare i.i.d. Fand Y1,...,Y

are independent of the Xis and are i.i.d G. We want to estimate P(X1≤Y1)=

F(t)dG(t) under squared error loss. Suppose that the prior for (F, G)isofthe

form Π1×Π2. The Bayes estimate is then ˆ

FΠ1(t)dˆ

FΠ2(dt), where for i=1,2,

FΠi(t) is the distribution function F(t)dΠi(t).

If the prior is Dαthen the Bayes estimate given X1,X

2,...,X

nbecomes

(α1+δXi)

α1(R)+n(−∞,t]dα2+δYi

α2(R)+n(dt)

110 3. DIRICHLET AND POLYA TREE PROCESS

This can be written as

p1,np2,m ¯α1(−∞,t)d¯α2(t)+p1,n(1 −p2,m)1



¯α1(−∞,Y

+(1−p1,n)p2,m )1



(1 −¯α2(−∞,X

i)) + (1 −p1,n)(1 −p2,m )1

mnU

where p1,n =α1(R)/(α1(R)+n), p2,m =α2(R)/(α2(R)+m)andU, is the number of

pairs for which Xi≤Yj, i.e.,



I(∞,Yj](Xi).

As α1(R)andα2(R) go to 0, the nonparametric estimate converges to (mn)−1U,

which is the familiar Mann-Whitney statistic.

3.2.6 Mutual Singularity of Dirichlet Priors

Asbefore,wehaveaDαprior on M(R), given P,X1,X

2,...,X

nis i.i.d. P,andλα

is the joint distribution of Pand X1,X

2,... . The main result in this section is ‘ If

α1and α2are two nonatomic measures on R, then λα1and λα2are mutually singular

and hence so are Dα1and Dα2’. Mutual singularity of all priors in a family being used

is undesirable. It shows that the family is too small to be ﬂexible enough to represent

prior opinion, which is based on information and judgment and is independent of

the data. To clarify, consider a simple example of this sort. Let X1,X

2,...,X

nbe

i.i.d. N(θ, 1) and suppose we are allowed only N(µ, 1) priors and the only values of µ

allowed are ﬁnite and widely separated as 0 and 10. Then for a large nif we get ¯

it is clear that with high probability the data can be reconciled with only one prior

in the family. The result proved next is of this kind but stronger. It follows from a

curious result of Korwar and Hollander [116], who show that the prior Dαcan be

estimated consistently from X1,X

2,... . We begin with their result.

Lemma 3.2.1. Deﬁne τ1,τ

2,... and Y1,Y

2,... by τ1=1and τn=kif the number

of distinct elements in {X1,X

2,...,X

k}is n and the number of distinct elements in

{X1,X

2,...,X

k−1}is n−1. In other words, τnis the number of observations needed

to get ndistinct elements.

3.2. DIRICHLET PROCESS ON M(R) 111

Set Yn=Xτnand set

Dn=1if Xn∈{X1,X

2,...,X

n−1}

0otherwise

Note that n

1Diis the number of distinct units in the ﬁrst nobservations. If αis

nonatomic then

(i) for any Borel set U, 1/n n

1δYn(U)→¯α(U)a.e. λα;

(ii) 1/log nn

1(Di−E(Di)) →0a.e. λα; and

(iii) 1/log nn

1E(Di)→α(X).

Proof. Note that τi<∞a.e.

To prove (i) it is enough to show that Y1,Y

2,... are i.i.d. ¯α.

We start with a ﬁner conditioning than Y1,...,Y

n−1. Consider for t1<t

2,... <

tn−1,t

n,,

PrYn∈A|X1,X

2,...X

tn−1,τ

n−1=tn−1,τ

n=tn

=PrYn∈|X1,...X

tn−1,τ

n−1=tn−1,τ

n≥tn

Prτn=tn|X1,...X

tn−1,τ

n−1=tn−1,τ

n≥tn(3.14)

After cancelling out α(X)+tn−1 from the numerator and denominator this becomes

α+tn−1

1δXi(A−{Y1,...,Y

n})

α+tn−1

1δXi(X−{Y1,...,Y

n})

and by nonatomicity this reduces to ¯α.ThusY1,Y

2,... are i.i.d and (i) follows.

For the second assertion, it is easy to see that the Dnare independent with λα(Dn=

1) = α(R)/(α(R)+n−1).

By Kolomogorov’s SLLN for independent random variables

log n



(Di−E(Di)) →0 a.s. λαif ∞



V(Di)

(log i)2<∞

Here V(Di)=α(R)(i−1)/((α(R)+i−1)2) and the preceding condition holds.

112 3. DIRICHLET AND POLYA TREE PROCESS

Moreover

log n



E(Di)= 1

log n



α(R)

α(R)+i−1→α(R)

because n



α(R)

α(R)+i−1=



α(R)

i−1−α(R)



α(R)

α(R)+i−1

i−1

and as n→∞, the second term on the right converges, so that



α(R)

α(R)+i−1=α(R) [log n+O(1)]

Theorem 3.2.8. If α1and α2are two nonatomic measures on R,α1=α2, then

λα1and λα2are mutually singular and hence so are Dα1and Dα2.

Proof. Let Ube a Borel set such that α1(U)=α2(U), and set

E=ω:1



δYi(U)→¯α1(U)and 1

log n



Di→α1(R)

By Lemma 3.2.1, λα1(E)=1andλα2(E)=0.

Further, because E⊂R∞, we also have

λα1(E)=P∞(E)Dα1(dP )=1

so that, Dα1{P:P∞(E)=1}= 1. Similarly Dα2{P:P∞(E)=1}=0.

Remark 3.2.5.To handle the general case, consider the decomposition of α1,α

2into

αi=αi1+αi2,whereαi1is the nonatomic part of αiand αi2is the discrete part.

Let M1,M

2be the support of α12 and α22 . Then if α11 =α21 but M1=M2, then

also λα1and λα2are singular.

If α11 =α21 and M1=M2;λα1and λα2may not be orthogonal. Sethuraman gives

necessary and suﬃcient condition for the orthogonality using Kakutani’s well-known

criteria based on Hellinger distance.

3.2. DIRICHLET PROCESS ON M(R) 113

Remark 3.2.6.The Theorem 3.2.8 shows that Dirichlet process used as priors dis-

play a curious behavior. Suppose αis a continuous measure, then for every sample

sequence X1,X

2,... the continuous part of the successive posterior base measures

changes from α/α(X)+nto α/(α(X)+n+1) and hence the sequence of the posteriors

are mutually singular.

3.2.7 Mixtures of Dirichlet Process

Dirichlet process requires speciﬁcation of the base measure α, which itself can be

viewed as consisting of the prior expectation ¯αand the strength of the prior belief

α(R). In order to achieve greater ﬂexibility Antoniak [4] proposed mixtures of Dirich-

let process which arise by considering a family αθof base measures indexed by a

hyperparameter θand a prior for the parameter θ.

Because the Dirichlet processes sit on discrete measures, so does any mixture of

these and hence they are unsuitable as priors for densities. For the same reason, it is

also inappropriate for the parametric part of a semiparametric problem. For example,

Diaconis and Freedman [46] show that the Dirichlet prior in a location parameter

problem can lead to pathologies as well as inconsistency of the posterior for even

reasonable “true” densities.

Usually one will not have as a prior a completely speciﬁed ¯αbut an αθ—like

N(η, σ2)—with θ=(η, σ2) unknown but having a prior µso that the distribution

of Pgiven θis Dαθ. Suppose that X1,X

2,...,X

nare—given P—i.i.d P. Because

given θ,X1,X

2,...,X

n;Pis distributed as Dαθ+δXi, the distribution of Pgiven

X1,X

2,...,X

nis obtained by integrating Dαθ+δXiwith the conditional distribution

of θgiven X1,X

2,...,X

For simplicity let Θ= Rlet µbe the prior on Θ with density ˜µ;foreachθ,αθ

is a ﬁnite measure on Rwith density ˜αθwith respect to Lebesgue measure. Using

equation (3.12) the joint density of θand X1,X

2,...,X

nis

˜µ(θ)j˜αθ(xj)Ij(ej−1)!

(αθ(R))[n](3.15)

The conditional density of θgiven X1,X

2,...,X

nis thus

C(x1,x

2,...,x

n)˜µ(θ)j˜αθ(xj)Ij(ej−1)!

(αθ(R))[n](3.16)

114 3. DIRICHLET AND POLYA TREE PROCESS

If the “true” distribution P0is continuous then with probability 1, the Ijsareall

equal to 1 and the conditional density becomes

C(x1,x

2,...,x

n)˜µ(θ)n

1˜α(θ)(xj)

(α(θ)(R))[n](3.17)

Newton et al. [137] provides an interesting heuristic approximation to Bayes esti-

mates in this context.

3.3 Polya Tree Process

Polya tree process are a large class of priors that include Dirichlet processes and

provide a ﬂexible framework for Bayesian analysis of nonparametric problems. Like

the Dirichlet, Polya tree priors form a conjugate class with a tractable expression for

the posterior. However they diﬀer from Dirichlet process in two important aspects.

The Polya tree process are determined by a large collection of parameters and thus

provide means to incorporate a wide range of beliefs. Further, by suitably choosing the

parameters, the Polya tree priors can be made to sit on continuous, even absolutely

continuous, distributions.

Polya tree priors were explicitly constructed by Ferguson [62] as a special case of tail

free processes discussed in the Chapter 2. A formal mathematical development using

De Finetti’s theorem is given in Mauldin et al. [133], Lavine [118, 119], indicates

the construction and discusses the choice of various components that go into the

construction of Polya tree priors. Here we brieﬂy explore the properties of Polya tree

priors. The basic references for these are Ferguson [62], Mauldin et al. [133] and

Lavine[118, 119].

3.3.1 The Finite Case

The construction in this section is a special case of the discussion in Section 2.3.1.

To brieﬂy recall, let X={x1,x

2,...,x

2k}.LetB0={x1,x

2,...,x

2k−1}and B1=

{x2k−1,...,x

2k}be a partition of X. For any jlet Ejstand for all sequences of 0s and

1s of length jand E∗

j=∪i≤jEi.Foreachj≤k, we consider a partition {B:∈Ej}

of Xsuch that B0,B1is a partition of B.If∈Ek, clearly Bis a singleton.

Deﬁnition 3.3.1. ApriorΠonM(X)issaidtobeaPolya tree prior with pa-

rameter α={α:∈E∗

k}if α≥0and

(i) P(B0|B):∈E∗

k−1are all independent and

3.3. POLYA TREE PROCESS 115

(ii) P(B0|B)isabeta(α0,α

1) random variable

when =∅take, P(B0|B)tobeP(B0). (i) and (ii) uniquely determine a Π, for

if x=B1,2,...,k, then

P(x)=P(B12...k)= 

i:i=0

PB12...i−10|B12...i−1

i:i=1 PB12...i−11|B12,...i−1

(3.18)

Because

PB1,2,...,i−11|B1,2,...,i−1=1−PB1,2,...,i−10|B1,2,...,i−1

P(B0|B):∈E∗

k−1determines the distribution of P(x).

Suppose Π is a Polya tree prior on M(X) and given P,Xis distributed as P.For

any xlet 1(x)=0ifx∈B0and 1 otherwise, and let k(x)=0ifx∈B1(x)...k−1(x)0

and 1 otherwise. The joint density of Pand Xis given, up to a constant, by



∈E∗

[P(B0|B)]α0−1[1 −P(B0|B)]α1−1

i:i(x)=0

P(B0|B)

i:i(x)=1

(1 −P(B0|B))

=

∈E∗

[P(B0|B)]α

0−1[1 −P(B0|B)]α

1−1

where

α

=1+αif x∈B

αotherwise

We summarize this discussion as the following theorem

Theorem 3.3.1. If the prior on M(X)is PT(α)where α={α:∈E∗

k}and if

given P,X1,X

2,...,X

nare i.i.d. P, then

(i) the posterior distribution on M(X)given X1,X

2,...,X

nis a Polya tree with

parameters {α,X1,X2,...,Xn:∈E∗

k}where

α,X1,X2,...,Xn=α+



IB(Xi)

(ii) the marginal distribution of X1is given by

Pr{X=x}=



α1(x)2(x)...i(x)

α1(x)2(x)...i−1(x)0 +α1(x)2(x)...i−1(x)1

116 3. DIRICHLET AND POLYA TREE PROCESS

and the predictive distribution of Xn+1 given X1,X

2,...,X

nis of the same form

with αreplaced by α,X1,X2,...,Xn.

To prove (ii), Pr(X=x)=P(x)dΠ(P) and is the integral of the terms in (3.18).

The components in the product are independent beta random variables and a direct

computation yields the result.

The distribution of X1,X

2,...,X

ndeﬁned via Theorem 3.3.1 can be thought of as

a Polya urn scheme, though not as easy to describe as that for a Dirichlet. This is

done in Mauldin et al. (92) and we refer the interested reader to their paper.

Remark 3.3.1.The assumption that Xcontains 2kelements and that partitions

are into two halves is not really necessary. All we need is Ti={B:∈Ei}for

i=1,2,...,k be a nested sequence of partitions. The equal halves can be relaxed by

allowing empty sets to be in the partition and setting the corresponding parameter

to be 0.

Remark 3.3.2.The form of the posterior distribution shows that Xand the vector

{P(B0|B):∈E∗

i}are conditionally independent given {IB:∈Ei}.

3.3.2 X=R

Motivated by the Xis ﬁnite case, we deﬁne a Polya tree prior on M(R) as follows:

Recall that Ejis the set of all sequences of 0s and 1s of length jand E∗=∪jEjis

all sequences of 0s and 1s of ﬁnite length. Also Eis the set of all inﬁnite sequences

of 0s and 1s.

Deﬁnition 3.3.2. Fo r each n,letTn={B:∈En}be a partition of Rsuch that

for all in E∗,B0,B1is a partition of B.

Let α={α:∈E∗}be a set of nonnegative real numbers.

ApriorΠonM(R) is said to be a Polya tree (with respect to the partition T=

{Tn}n≥1) with parameter α, denoted by PT(α), if under Π

1. {P(B0|B):∈E∗}are a set of independent random variables

2. for all ∈E∗,P(B0|B)isbeta(α0,α

1).

The ﬁrst question, of course, is do such priors exist? We have already discussed this

in Chapter 2.

Theorem 3.3.2. A Polya tree with parameter α={α:∈E∗}exists if for all

∈E∗α0

α0+α1 α00

α00 +α01  α000

α000 +α001 ...= 0 (3.19)

3.3. POLYA TREE PROCESS 117

and α10

α10 +α11  α110

α110 +α111 ...=0

Proof. This is an immediate consequence of Theorem 2.3.5 . We noted there that if

we set Y=P(B0),Y

=P(B0|B) then {Y:∈E∗}induces a measure on M(R)- if

it satisﬁes the continuity condition

YY0Y00 ...= 0 almost surely

Because n

1Y0...0is decreasing in nand bounded by 0 and 1, this happens iﬀ

E(n

1Y0...0)→0. The Yare independent beta random variables and the condition

translates precisely to (3.19).

Marginal distribution of X Let P∼PT(α) and given P,Xbe distributed as Pand

let mbe the marginal distribution of X. Because the ﬁnite union of sets in

∪nTnis an algebra it is enough to calculate m(B) for all in E∗.

If =12...

m(X∈B12...,k)=E

i≤k−1

PB12...i−1i|B12...i−1

=

{i:i=0,i≤k−1}

Y12...i−1

{i:i=1,i≤k−1}

(1 −Y12...i−1) (3.20)

The factors inside the expectation are independent beta random variables, and

hence we have



α12...i

α12...i0+α12...i1

Theorem 3.3.3. Suppose that Xis distributed as Pand Pitself has a PT(α)prior.

Then the posterior distribution of Pgiven Xis PT(αX), where αX=α+IB(X).

Based on the corresponding result for the ﬁnite case, it is reasonable to expect the

posterior to be PT(αX). In fact the posterior distribution of {P(B):∈En}given X

is same as the posterior of {P(B):∈En}given {IB(X):∈En}. The calculation

in the ﬁnite case done in the last section shows that this posterior distribution is a

Polya tree with parameters {α,X =α+IB(X):∈∪

1Ei}. The proof is completed

118 3. DIRICHLET AND POLYA TREE PROCESS

by recognizing that posterior distributions of {P(B):∈En:n≥1}determine the

posterior distribution Π(·|X).

Repeatedly applying the last theorem we get the following.

Theorem 3.3.4. If PT(α)is the prior on M(R)and given P;ifX1,X

2,...,X

are i.i.d. P, then the posterior distribution of Pgiven X1,X

2,...,X

nis a polya tree

with parameter αX1,X2,...,Xnwhere

α,X1,X2,...,Xn=α+



IB(Xi)

Predictive distribution and Bayes estimate

It is immediate from the last two properties that if X1,X

2,...,X

nare i.i.d. Pgiven P,

and Pis has PT(α) prior, then the predictive distribution of Xn+1 given X1,X

2,...,X

P{Xn+1 ∈B12...k}

α1+n

1IB1(Xi)

α0+α1+n

α12+n

1IB12(Xi)

α10+α11+n1...

α1...k+n

1IB1...k(Xi)

α1...k−10+α1...k−11+n1...k−1

where nis the number of Xis falling in B.

In view of the calculations done so far, ˆ

P=E(P|X1,X

2,...,X

n) is the measure

satisfying

P(B12...k)=



α1...j+n

1IB1...j(Xi)

α1...j0+α1...j1+n1...j−1

Like the Dirichlet, here also the posterior is consistent. Formally, we have the

following theorem.

Theorem 3.3.5. Let Pbe distributed as PT(α)and given P, let X1,X

2,...,X

be i.i.d. P. Then for any P0,asn→∞, the posterior

PT(αX1,X2,...,Xn)→δP0weakly a.s P0

3.3. POLYA TREE PROCESS 119

The result would follow as a particular case of a more general theorem proved later

for tail free priors. However one can give proof along the same lines as that for the

Dirichlet process and follows from the following lemmas.

Lemma 3.3.1. Let ¯αm=Eαm(P), where E¯αmis the expectation taken under PT(αm).

If {¯αm:m∈M}is tight, then so is {PT(αm):m∈M}.

Proof. Easily follows from corollary to Theorem 2.5.1

Lemma 3.3.2. If ¯αm→P0and if for all ∈E∗,

L(P(B)|PT(αm)) →L(P(B)|δP0))

then PT(αm)converges weakly to δP0.

Proof. The tightness of αmensures that PT(αm) has a limit point. This limit point

can be identiﬁed as δP0using calculations similar to Theorem 3.2.6.

To prove the theorem, let Ω = {ω:1/n n

1IB(Xi)→P0(B) for all ∈E∗}.

P0(Ω) = 1, and further for each ω∈Ωitiseasilyveriﬁedthat¯αm=α,X1,X2,...,Xn(ω)

satisﬁes the assumptions of the Lemma 3.3.2.

Support of PT(α)

Our next theorem is on the topological support of PT(α). Recall that the support is

the smallest closed set of PT(α) measure 1. Here we assume that {a:∈E∗}is a

dense set of numbers and induce a nested sequence of partitions.

Theorem 3.3.6. PT(α)has all of M(R)as support iﬀ α>0for all ∈E∗.

Proof. The proof follows along the same lines as for the Dirichlet (see Theorem 3.2.4).

Mauldin et al. [133] show that, unlike the Dirichlet, we can ﬁnd αwhich will ensure

that PT(α) sits on the space of continuous measures. Because Polya tree priors are tail

free, we can use Theorem 2.4.3 to show that by suitably choosing the partitions and

parameters the Polya tree can be made to sit on, not just continuous distributions but

even absolutely continuous distributions. The theorem is an application of Theorem

2.4.3 to Polya tree processes. The proof is just a veriﬁcation of the conditions of

Theorem 2.4.3.

120 3. DIRICHLET AND POLYA TREE PROCESS

Theorem 3.3.7. Let λbe a continuous probability measure on Rwith distribution

function λ. Deﬁne B12...i=λ−1(i

2i,i

2i+1

2i).Ifα12...i=ai, and a−1

i<∞

then PT(α)(L(λ)) = 1.

In particular when α12...i=i2, the polya tree gives mass 1 to probabilities that are

absolutely continuous with respect to λ.

A few concluding remarks about Polya tree priors: The Polya tree prior depends on

the underlying partition T={Tn}n≥1and is tail free with respect to this partition.

In fact a prior which is tail-free with respect to every sequence of partitions is, except

for trivial cases, a Dirichlet process [ Doksum [48]].

We have seen that Polya tree priors, unlike the Dirichlet, can be made to sit on

densities. One unpleasant feature of this construction is that absolute continuity of P

is ensured by controlling the variability of Paround the chosen absolutely continuous

λ. We have seen that for the Dirichlet the prior and posterior are mutually singular.

Dragichi and Ramamoorthi [56] have shown that if the parameters are as in the

Theorem 3.3.7, then the posterior given distinct observations is absolutely continuous

with respect to the prior.

Lavine suggests that, if the prior expectation is F, then the partitions of the form

F−1(i/2i,i/2i+1/2i) would be appropriate. For then the ratios

α12...i−10/(α12...i−10+α12...i−11)=1/2

and this would ensure that the marginal of Xis F, which may then be treated as

a “prior guess” of the “mean” of the random P. As to the magnitude of the αs

(as distinct from their ratios), their role is somewhat similar to that of α(R)forthe

Dirichlet, except that the availability of more parameters introduces more ﬂexibility.

It is expected that for moderate ka choice of the magnitude would be on the basis

of prior belief and for higher k, a conventional choice would be made. A conventional

choice might be to ensure that the prior sits on densities. For example, one may take

α1,...,k=1/k2. Lavine [118] has expressed well what the main considerations are; we

refer the reader to his paper.

Consistency Theorems

4.1 Introduction

We brieﬂy discussed consistency of the posterior in Chapters 1, 2 and 3. To recall,

our setup consists of:

a (unknown) parameter θthat lies in a parameter space Θ;

a prior distribution Π for θ, equivalently, a probability measure on Θ; and

X1,X

2,...,X

n, which are given θ, i.i.d. with common distribution Pθ.

Our interest centers on the consistency of the posterior distribution, and as dis-

cussed in Chapter 1, this is a requirement that if indeed θ0is the “true ” distribution

of X1,X

2,...,X

nthen the posterior should converge to δθ0almost surely. In other

words, as n→∞, the posterior probability of every neighborhood of θ0should go to

1withPθ0probability 1.

We noted that posterior consistency can be viewed as

•a sort of frequentist validation of the Bayesian method;

•merging of posteriors arising from two diﬀerent priors; and

•as an expression of “data eventually swamps the prior”.

In Chapter 1 we saw that when Θ is a subset of a ﬁnite-dimensional Euclidean space

and if θ→ Pθis smooth, then for smooth priors the posterior is consistent in the

122 4. CONSISTENCY THEOREMS

support of the prior. In Chapter 1 we also saw an example showing that inconsistency

cannot be ruled even when Θ = R.

The example in Chapter 1 may be dismissed as a technical pathology, but in the

nonparametric case inconsistency can scarcely be called pathological. This has led

some to question the role of consistency in Bayesian inference. The argument is that it

is well known that the prior and the posterior given by Bayes theorem are imperatives

arising out of axioms of rational behavior–and since we are already rational why

worry about one more criteria? In other words inconsistency does not warrant the

abandonment of a prior. We would argue that in the nonparametric context typically

one would have many priors that would be consistent with one’s prior beliefs, and it

does make sense to choose among these priors that are consistent at a large number

of parameter values, among which we expect the true parameter to lie.

In the nonparametric context Θ is M(R) or large subset of it. M(R) has various

kinds of convergence, namely, total variation, setwise , weak, etc. Each of these leads to

a corresponding notion of consistency. The issue of consistency has been approached

from diﬀerent point of view by [143]. We begin with a formal deﬁnition of these.

4.2 Preliminaries

Deﬁnition 4.2.1. {Π(·|X1,X

2,...,X

n)}is said to be strongly or L1-consistent at

P0ifthereisaΩ

0⊂Ω such that P∞

0(Ω0) = 1 and for ω∈Ω0

Π(U|X1,X

2,...,X

n)→1

for all total variation neighborhoods of P0.

Deﬁnition 4.2.2. {Π(·|X1,X

2,...,X

n)}is said to be weakly consistent at P0if

thereisaΩ

0⊂Ω such that P∞

0(Ω0) = 1 and for ω∈Ω0

Π(U|X1,X

2,...,X

n)→1

for all weak neighborhoods of P0.

Before we proceed to the study of consistency, we note that Bayes estimates inherit

the convergence property of the posterior. Recall that we denote X1,X

2,...,X

nby

Xn.

Proposition 4.2.1. Deﬁne the Bayes estimate ˆ

Pn(·|Xn)to be the probability mea-

sure ˆ

Pn(A|Xn)=P(A)Π(dP |X1,X

2,...,X

n)=E(P(A)|Xn).Then

4.2. PRELIMINARIES 123

1. if {Π(·|X1,X

2,...,X

n)}is strongly consistent at P0, then || ˆ

Pn−P0|| → 0, almost

surely P0.

2. If {Π(·|X1,X

2,...,X

n)}is weakly consistent at P0, then ˆ

Pn→P0weakly, almost

surely P0.

Proof. By Jensen’s inequality

|| ˆ

Pn−P0|| ≤ ||P−P0|| Π(dP |Xn)

=U||P−P0|| Π(dP |Xn)+Uc||P−P0|| Π(dP |Xn)

and if U={P:||P−P0|| <}then

≤Π(U|Xn)+Π(Uc|Xn)≤+◦(1)

as n→∞.

A similar argument works for assertion (ii) by considering |fd

Pn−fdP

0|,f

bounded continuous.

It is worth pointing out that the conventional Bayes estimate considered earlier is

a Bayes estimate only for the squared error loss for P(A). The Bayes estimate ˜

P(A)

for, say, the absolute deviation loss, will be the posterior median. Unfortunately, the

P(·) so obtained will not be a probability measure.

As far as the prior is on M(R), weak consistency is intimately related to the con-

sistency of the Bayes estimates of F(t).

Theorem 4.2.1. Suppose Πis a prior on F, the space of distribution functions on

R, and X1,X

2,...,X

nbe given F; i.i.d. F. Then the posterior is weakly consistent

at F0, iﬀ there is a dense subset Qof Rsuch that for tin Q

(i) lim

n→∞ E(F(t)|Xn)=F0(t); and

(ii) lim

n→∞ V(F(t)|Xn)=0.

Proof. If (i) and (ii) hold, it follows from a simple use of Chebychev’s inequality that

for every tin Q,

Π((F(t)−F0(t)|<δ)|Xn)→1a.sF0

124 4. CONSISTENCY THEOREMS

and hence it follows that

Π((F(ti)−F0(ti)|<δ|Xn)for1≤i≤k)→1a.sF0

By Proposition 2.5.2, any weak neighborhood Uof F0contains a set of the above

form for a suitable δ. Hence Π(U|Xn)→1 a.e. F0.

On the other hand, if the posterior is weakly consistent, then it is easy to see that

(i) and (ii) hold for any tthat is a continuity point of F0.

Since strong consistency is desirable, it is natural to seek a prior Π for which the

posterior would be strongly consistent at all Pin M(R). Such a prior can be thought

of as more diﬀuse than priors that do not have this property. However such a prior

does not exist. If it did, the corresponding Bayes estimates, by the last Proposition

4.2.1, would give a sequence of estimates of Pthat is consistent in the total variation

metric and such estimates do not exist [41]. On the other hand the Dirichlet priors

considered earlier provide an example of a prior that is weakly consistent at all P.

We note that Doob’s theorem is applicable also to strong consistency.

If Uis a neighborhood of P0with prior probability 0, then any reasonable version

of the posterior will assign mass 0 to Uand consequently it is unreasonable to expect

consistency at such a P0. Thus it is appropriate to conﬁne the search for points of

consistency to the (topological) support of the prior.

4.3 Finite and Tail free case

When Xis a ﬁnite set, M(X) is a subset of the Euclidean space, and all the topologies

coincide on M(X) , and we have the following pleasing theorem. This theorem can

also be proved from Theorem 1.3.4. Here is a direct proof that in a way is related to

the Schwartz theorem discussed later in this chapter.

Theorem 4.3.1. Let Πbe a prior on M(X), where X={1,2,...,k}. Then the

posterior is consistent at all points in the support of Π.

Proof. Let

V={P:P−P0<δ}

be a neighborhood of P0.

Π(Vc|Xn)=Vce−nk

1(ni/n)log(P0(i)/P (i)) dΠ(p)

Xe−nk

1(ni/n)log(P0(i)/P (i)) dΠ(p)

4.3. FINITE AND TAIL FREE CASE 125

where niis the number of Xnequal to i. Writing it as

I1(Xn)

I2(Xn)

we will show that

(i) for all β>0, lim inf n→∞ enβI2(Xn)=∞a.s P0;and

(ii) there exists a β0>0 such that enβ0I1(Xn)→0a.sP0.

condition (i) follows from the strong law of large numbers.

As for (ii)



nlog P0(i)

P(i)=



nlog ni/n

P(i)+



nlog P0(i)

ni/n

which gives

lim

n→∞



nlog P0(i)

P(i)= lim

n→∞



nlog ni/n

P(i)

If Fnstands for the empirical distribution



nlog ni/n

P(i)=K(Fn,P)

and by Proposition 1.2.2

K(Fn,P)≥||Fn−P||2

4=(||P−P0||−||Fn−P0||)2

If P∈Vcand nis large so that ||Fn−P0|| <δ/2, we have

K(Fn,P)≥(δ−δ/2)2

4=δ0

In other words,

inf

P∈VcK(Fn,P)>δ

0a.s P0

Consequently

lim

n→∞ enβI1(Xn)≤lim

n→∞ enβ Vc

e−nK(Fn,P )dΠ(p)≤en(β−δ0)

which goes to 0 if β<δ

0. The proof of the theorem is easily completed by taking

β0<δ

126 4. CONSISTENCY THEOREMS

When Xis inﬁnite, even weak consistency can fail to occur in the weak support

of Π. Freedman [69] provided dramatic examples when X={1,2,3,...,}.Another

elegant example, due to Ferguson, is described in [65].

Theorem 4.3.2. For k=1,2,..., let Tk={B:∈Ek}be a partition of Rinto

intervals. Further assume that {T k:k≥1}are nested. If Πis a prior on M(R),

tail free with respect to {T k:k≥1}and with support all of M(R)then (there exits

a version of) the posterior which is weakly consistent at every P0.

Proof. By Theorem 2.5.2, enough to show that for each nthe posterior distribution

of {P(B):∈En}given Xnconverges a.e. P0to {P0(B):∈En}. Proposition

2.3.6 ensures that the posterior distribution of {P(B):∈En}given X1,X

2,...,X

isthesameasthatgiven{n:∈En},wherenis the number of X1,X

2,...,X

in B. A little reﬂection will show that we are now in the same situation as Theorem

4.3.1.

4.4 Posterior Consistency on Densities

4.4.1 Schwartz Theorem

In the last section we looked at priors on M(R). An important special case is when the

prior is concentrated on Lµ, the space of densities with respect to a σ-ﬁnite measure

µon R. This case is important because of its practical relevance. In addition this

is a situation when one has a natural posterior given by the Bayes theorem. Our

(conventional) Bayes estimate is the expectation of fwith respect to the posterior.

We begin the discussion with a theorem of Schwartz [145]. Our later applications

will show that Schwartz’s theorem is a powerful tool in establishing posterior consis-

tency. Barron [8] provides insight into the role of Schwartz’s theorem in consistency.

Our setup, then, is Lµ=f:fis measurable,f ≥0,fdµ=1

.We tacitly iden-

tify the µequivalence classes in Lµand equip Lµwith the total variation or L1-metric

||f−g|| =|f−g|dµ.Everyfin Lµcorresponds to a probability measure Pf, and it

is easy to see that the Borel σ-algebra generated by the L1-metric and the σ-algebra

BM∩Lµare the same.

Let Π be a prior on Lµ. Recall that K(f,g) stands for the Kullback-Leibler diver-

gence flog(f/g)dµ.K(f) will stand for the neighborhood {g:K(f,g)<}.

Deﬁnition 4.4.1. Let f0be in Lµ.f0is said to be in the K-L support of the prior

Π, if for all >0, Π(K(f0)) >0.

4.4. POSTERIOR CONSISTENCY ON DENSITIES 127

As before, X1,X

2,... are given f, i.i.d. Pf.Pn

fwill stand for the joint distribution

of X1,X

2,...,X

nand P∞

ffor the joint distribution of the entire sequence X1,X

2,...

. We will, when needed, view P∞

fas a measure on Ω = R∞.

Let Ube a set containing f0. In order for the posterior probability of Ugiven Xn

to go to 1, it is necessary that f0and Uccan be separated. This idea of separation

is conveniently formalized through the existence of appropriate tests for testing H0:

f=f0versus H1:f∈Uc. Recall that a test function is a nonnegative measurable

function bounded by 1.

Let {φn(Xn):n≥1}be a sequence of test functions.

Deﬁnition 4.4.2. {φn(Xn):n≥1}is uniformly consistent for testing H0:f=f0

versus H1:f∈Uc,ifasn→∞,

Ef0(φn(Xn)) →0

inf

f∈UcEf(φn(Xn)) →1

Deﬁnition 4.4.3. A test φ(Xn)isstrictly unbiased for H0:f=f0versus H1:

f∈Uc,if

Ef0(φn(Xn)) <inf

f∈UcEf(φn(Xn))

Deﬁnition 4.4.4. {φn(Xn):n≥1}is uniformly exponentially consistent for test-

ing H0:f=f0versus H1:f∈Uc, if there exist C, β positive such that for all

Ef0(φn(Xn)) ≤Ce−nβ

and

inf

f∈UcEf(φn(Xn)) ≥1−Ce−nβ

The next proposition relates these three deﬁnitions. The proposition is itself inter-

esting, and the ideas involved in the proof surface again in later arguments.

Proposition 4.4.1. The following are equivalent

(i) There exists a uniformly consistent sequence of tests for testing H0:f=f0

versus H1:f∈Uc.

(ii) for some n≥1, there exists a strictly unbiased test φ(Xn)for H0:f=f0

versus H1:f∈Uc.

(iii) There exists a uniformly exponentially consistent sequence of test functions for

testing H0:f=f0versus H1:f∈Uc.

128 4. CONSISTENCY THEOREMS

Proof. Clearly, (i) implies (ii) and (iii) implies(i). So all that needs to be established

is that (ii) implies (iii).

Consider ﬁrst the simple case when m= 1, i.e., there exists φ(X) such that Ef0φ=

α< inf

f∈UcEfφ=γ.

Let

Ak=(x1,x

2,...,x

k):1

kφ(Xi)>(α+γ)

2

Then Pk

f0(Ak)=Pk

f0(φ(Xi)−kEf0φ>k(γ−α)/2), and by Hoeﬀeding’s inequal-

ity,

f0φ(Xi)−kEf0φ>k(γ−α)

2≤e−k2(γ−α)2

4k=e−k(γ−α)2

On the other hand, for f∈Uc

f(Ak)≥Pk

fφ(Xi)−kEfφ>k(α−γ)

2

Because α−γ<0, by applying Hoeﬀeding’s inequality to −φ,weget

Pf(Ak)≥1−e−k(γ−α)2

and thus φk=IAkprovides the required sequence of tests.

To move on to the general case, suppose

Ef0φm(X1,X

2,...,X

m)=α< inf

f∈UcEfφm(X1,X

2,...,X

m)=γ

From what we have just seen, if n=km, then there is a set Akwith Pn

f0(Ak)≤

e−n(γ−α)2/4m.Ifkm < n ≤(k+1)m, then

f0(Ak)≤e−nkm(γ−α)2

n4m

≤e−nk(γ−α)2

(k+1)4m≤e−n(γ−α)2

Thus, setting β=(γ−α)2/8m, we have the exponential bound for φn=IAkwith

respect to Pf0. A similar argument yields the corresponding inequality for inf

f∈UcPf(Ak).

Corollary 4.4.1. Let νbe any probability measure on Uc. When there is a φn(Xn)

such that Efn

0φn(Xn)≤Ce−nβ and inff∈UcEfφn(Xn)≥1−Ce−nβ , we have ||f0−

fnν(df )|| ≥ 2(1 −2Ce−nβ), where fnis the n-fold product density n

1f(xi).

4.4. POSTERIOR CONSISTENCY ON DENSITIES 129

Theorem 4.4.1 (Schwartz). Let Πbe a prior on Lµ.Iff0∈Lµ,and Usatisfy

(i) f0is in the K-L support of Πand

(ii) there exists a uniformly consistent sequence of tests for testing H0:f=f0

versus H1:f∈Uc,

then Π(U|X1,X

2,...,X

n)→1a.s P∞

Proof. Because

Π(Uc|X1,X

2,...,X

n)=Ucn

1f(Xi)Π(df )

Lµn

1f(Xi)Π(df )=Ucn

f(Xi)

f0(XiΠ(df )

Lµn

f(Xi)

f0(Xi)Π(df )

it is enough to show that the last term in this expression goes to 0 a.s. P∞

f0.

We will show in Lemma 4.4.1 that condition (i) implies

for every β>0,lim inf

n→∞ enβ Lµ



f(Xi)

f0(Xi)Π(df )=∞a.e.P∞

fo(4.1)

By Proposition 4.4.1, there exist exponentially consistent tests for testing f0against

Uc. Using these we invoke Lemma 4.4.2, by taking Vn=Ucfor all nto show that

for some β0>0,lim

n→∞ enβ0Uc



f(Xi)

f0(Xi)Π(df ) = 0 a.e.P∞

fo(4.2)

By taking β=β0in (4.1) it easily follows that the ratio in (4.4.1) goes to 0 a.e.

Lemma 4.4.1. If f0is in the Kullback-Leibler support of Πthen

for every β>0,lim inf

n→∞ enβ Lµ



f(Xi)

f0(Xi)Π(df )=∞a.e.P∞

Proof.

Lµ



f(Xi)

f0(Xi)Π(df )≥K(f0)

e−n

1log f0

f(Xi)

For ea ch fin K(f0), by the law of large numbers

nlog f0

f(Xi)→−K(f0,f)>−a.s P∞

130 4. CONSISTENCY THEOREMS

Equivalently, for each fin K(f0),

en(2−1

nlog f0

f(Xi))→∞a.s P∞

f0(4.3)

Hence by Fubini, there is a Ω0⊂ΩofP∞

f0measure 1 such that, for each ω∈Ω0,for

all fin K(f0), outside a set of Π measure 0, (4.3) holds. Using Fatou’s lemma,

lim inf en2Lµ



f(Xi)

f0(Xi)Π(df )≥lim inf en2K(f0)



f(Xi)

f0(Xi)Π(df )

≥K(f0)

en(2−1

nlog f0

f(Xi)(ω))Π(df )→∞

We will state the next lemma in a form slightly stronger than what we need.

Lemma 4.4.2. If there exist tests φn(Xn)and sets Vnwith lim infnΠ(Vn)>0,

such that for some β>0,

Ef0φn(Xn)≤Ce−nβ

and

inf

f∈Vn

Efφn(Xn)≥1−Ce−nβ

then

for some β0>0,lim

n→∞ enβ0Vn



f(Xi)

f0(Xi)Π(df )=0a.e. P∞

Proof. Set qn(x1,x

2,...,x

n)=(1/Π(Vn)Vnn

1f(Xi)Π(df ). Denoting by A(fn

0,q

n)=

f0(xi)qn(xi)dµ, by Corollaries 4.4.1 and 1.2.1 , there is 0 <r<1 such that

A(fn

0,q

n)≤(1 −||P−Q||2

4)≤2Ce−nr

Thus

f0qn(Xn)

f0(Xi)≥e−nr=Pn

f0(qn(Xn)

f0(Xi)≥e−nr

2≤2Cenr

2e−nr

An application of Borel-Cantelli yields

qn(Xn)

f0(Xi)≤e−nr a.s P∞

4.4. POSTERIOR CONSISTENCY ON DENSITIES 131

and we have 1

Π(Vn)enr

2Vnn

1f(Xi)

n

1f0(Xi)Π(df )→0a.sP∞

Since lim inf Π(Vn)>0, we have the conclusion.

Remark 4.4.1.The role of the assumption that f0is in the Kullback-Leibler support

is to ensure that (4.1) holds. Sometimes it might be possible to verify it by direct

calculation without invoking the K-L support assumption. We will see an example of

this kind in the next chapter.

Let f0be in the K-L support of Π. In order to apply the Schwartz theorem, we

need to identify neighborhoods of f0for which there exists a uniformly consistent test

for H0:f=f0vs H1:f∈Uc.

Let Ube a weak neighborhood of the form

U=fdP −fdP0<,f bounded continuous (4.4)

Because fis bounded, by adding a constant we make it nonnegative and multiplying

by a positive constant we can make 0 ≤f≤1. Then Uhas the same expression in

terms of this transformed f, with perhaps a diﬀerent .Nowfis a test function and

which separates P0and Uc. Thus for neighborhoods of the form displayed we have an

unbiased test and consequently a uniformly consistent sequence of tests for

H0:P=P0H1:P∈Uc

For any test function f,|fdP −fdP0|<iﬀ fdP −fdP0<and

(1 −f)dP −(1 −f)dP0<.InotherwordsU={P:|fdP −fdP0|<}can

be expressed as intersections of sets of the type in (4.4).

Theorem 4.4.2. Let Πbe a prior on Lµ.Iff0is in the K-L support of Π, then

the posterior is weakly consistent at f0.

Proof. If U={P:|fidP −fidP0|<

i:1≤i≤k}then

U=∩k

1{P:|fidP −fidP0|<

Hence it is enough to show that the posterior probability of each of the sets in the

intersection goes to 1 a.s f0. By the discussion preceding the theorem, {P:|fidP −

132 4. CONSISTENCY THEOREMS

fidP0|<

i}is an intersection of two sets of the type displayed in (4.4). Since the

Schwartz condition is satisﬁed for these sets

Π(U|X1,X

2,...,X

n)→1a.sP∞

f0.

Further, using a countable base for weak neighborhoods, we can ensure that almost

surely P∞

f0, for all U,Π(U|X1,X

2,...,X

n)→1.

If we have a tail free prior on densities, like a suitable Polya tree prior, then we do

not need a condition like “f0is in the K-L support of Π” to prove weak consistency of

the posterior. On the other hand, consistency is proved for a tail free prior by using

a Schwartz like argument for ﬁnite-dimensional multinomials, which tacitly uses the

condition of f0being in the K-L support. See also the result in the next section that

establishes posterior consistency without invoking Schwartz’s condition.

Applications of Schwartz’s theorem appear in Chapters 5, 6 and 7.

4.4.2 L1-Consistency

What if Uis a total variation neighborhood of f0? LeCam [122] and Barron [7] show

that in this case, if f0is nonatomic, then a uniformly consistent test for H0:f=f0

versus H1:f∈Ucwill not exist.

Barron investigated the connection between posterior consistency and existence of

uniformly consistent tests. The next two results are adapted from an unpublished

technical report of Barron. Some of these appear in [8].

Proposition 4.4.2. Suppose for some β0>0,Π(Wn)<Ce

−nβ0.Iff0is in the

K-L support of Πthen

Π(Wn|Xn)→0a.s.P∞

Proof. By the Markov inequality

Pf0Wn



(Xi)Π(df )>e

−nβ

≤enβ RnWn



(Xi)Π(df )



f0(Xi)µn(dx1,dx

2,...,dx

=enβ Wn

Π(df )

≤enβCe−nβ0

4.4. POSTERIOR CONSISTENCY ON DENSITIES 133

and if β<β

P∞

f0Wn



(Xi)Π(df )>e

−nβ i.o =0

By Lemma 4.4.1, for all β>0,

enβ Lµ



f(Xi)

f0(Xi)Π(df )→∞a.s P∞

f0.

The argument is now easily completed.

Theorem 4.4.3 (Barron). Let Πbe a prior on Lµ,f0in Lµand Ube a neigh-

borhood of f0. Assume that Π(K(f0)) >0for all >0. Then the following are

equivalent.

(i) There exists a β0such that

Pf0{Π(Uc|X1,X

2,...,X

n)>e

−nβ0inﬁnitely often}=0

(ii) There exist subsets Vn,W

nof Lµ, positive numbers c1,c

2,β

1,β

2and a sequence

of tests {φn(Xn)}such that

(a) Uc⊂Vn∪Wn,

(b) Π(Wn)≤C1e−nβ1, and

inff∈VnEfφn≥1−c2e−nβ2.

Proof. (i)=⇒(ii): Set Sn=(x1,x

2,...,x

n):Π(Uc|x1,x

2,...,x

n)>e

−nβ0and

φn=ISn.Letβ<β

Vn=f:Pf(Sn)>1−e−nβ

Wn=f:Pf(Sc

n)≥e−nβ∩Uc

By assumption P∞

f0{φn= 1 inﬁnitely often }= 0 and by construction

inf

f∈Vn

Efφn>1−e−nβ

134 4. CONSISTENCY THEOREMS

Now,

Π(Wn)=Πf:Pf(Sc

n)>e

−nβ∩Uc

≤enβ Uc

Pf(Sc

n)Π(df )

and by Fubini

=enβ Sc

π(Uc|xn)dλn(xn)

≤enβe−nβ0=e−n(β0−β)

where λnis the marginal distribution of Xn.

(ii)=⇒(i):

Π(Uc|Xn)=Π(Uc∩Vn|Xn)+Π(Uc∩Wn|Xn)

Since Wnhas exponentially small prior probability, by Proposition 4.4.2

Π(Wn|Xn)→0a.sP∞

The proof actually shows that for some β0>0, writing i.o. for ”inﬁnitely often”

P∞

f0Π(Wn|Xn)>e

−nβ0i.o =0

Because Π(Uc∩Vn|Xn)≤Π(Vn|Xn), it is enough to show that, for some β>0,

P∞

f0Π(Vn|Xn)>e

−nβ i.o =0

Now,

Π(Vn|Xn)

=φn(Xn)Π(Vn|Xn)+(1−φn(Xn))Π(Vn|Xn)

Since P∞

f0{φn>0 i.o. }= 0, for any β>0, P∞

f0{φnΠ(Vn|Xn)>0 i.o. }=0.

4.4. POSTERIOR CONSISTENCY ON DENSITIES 135

For any βan application of Markov’s inequality and Borel-Cantelli lemma shows

that

Pf0Vn



(xi)Π(df )(1 −φn(xn)) >e

−nβ

≤enβ RnVn



(xi)(1 −φn(xn)) Π(df )



f0(xi)µn(dxn)

=enβ Vn

Ef(1 −φn)Π(df )

≤enβC2e−nβ2

and if β<β

Pf0Vn



(xi)Π(df )(1 −φn(xn)) >e

−nβ i.o =0.

As before by Lemma 4.4.1 for any β,

enβ Lµ



f(Xi)

f0(Xi)Π(df )→∞a.s P∞

f0.

The argument is now easily completed.

This last theorem can be used to develop suﬃcient conditions for posterior con-

sistency on L1-neighborhoods. Barron, Schervish and Wasserman [5] provide such a

condition using bracketing metric entropy. Motivated by their result, we prove the

following.

Deﬁnition 4.4.5. Let G⊂Lµ.Forδ>0, the L1-metric entropy J(δ, G) is deﬁned

as the logarithm of the minimum of all nsuch that there exist f1,f

2,...,f

nin Lµ

with the property G⊂∪

1{f:f−fi<δ}.

Theorem 4.4.4. Let Πbe a prior on Lµ. Suppose f0∈Lµand Π(K(f0)) >0for

all >0. If for each >0, there is a δ<,c1,c

2>0,β<

2/2, and Fn⊂Lµsuch

that, for all nlarge,

1. Π(Fc

n)<C

1e−nβ1,

2. J(δ, Fn)<nβ,

136 4. CONSISTENCY THEOREMS

then the posterior is strongly consistent at f0.

Proof. Let U={f:f−f0<},Vn=Fn∩Uc,and Wn=Fc

n. We will argue that the

pair (Vn,W

n) satisfy (ii) of Theorem 4.4.3. Here Uc⊂Vn∪Wnand Π(Wn)<c

1e−nβ1.

Let g1,g

2,...,g

kin Lµbe such that Vn⊂∪

1Giwhere Gi={f:f−gi<δ}.

Let fi∈Vn∩Gi.Thenforeachi=1,2,...,k,f0−fi>and if f∈Gi, then

f0−f>−δ. Consequently for each i=1,2,...,k, there exists a set Aisuch that

Pf0(Ai)=αand Pfi(Ai)=γ>α+

Hence if f∈Gi, then Pf(Ai)>γ−δ>α+−δ.

Let

Bi=(x1,x

2,...,x

n): 1



j=1

IAi(xj)≥(γ+α)/2

A straightforward application of Hoeﬀeding’s inequality shows that

Pf0(Bi)≤exp[−n2/2]

On the other hand, if f∈Gi,

Pf(Bi)≥Pf1



j=1

IAi(xj)−Pf(Ai)≥(α−γ)

2+δ

≥Pfn−1



j=1

IAi(xj)−Pf(Ai)≥−

2+δ(4.5)

Applying Hoeﬀeding’s inequality to −n−1n

j=1 IAi(xj), the preceding probability

is greater than or equal to

1−exp[−(n/2)(/2−δ)2]

If we set

φn(X1,X

2,...,X

n)= max

1≤i≤kIBi(X1,X

2,...,X

then

Ef0φn≤kexp[−n2/2]

and

inf

f∈Vn

Efφn≥1−exp[−(n/2)(/2−δ)2]

4.5. CONSISTENCY VIA LECAM’S INEQUALITY 137

By choosing log k≤J(δ, Fn)<nβ,wehaveEf0φn≤exp[−n(2/2−β)]. Since

β<

2/2, all that is left to show is

Pf0{φn>0 inﬁnitely often}=0

This follows easily from an application of the Borel Cantelli lemma and from the fact

that φntakes only values 0 or 1.

This last theorem is very much in the spirit of Barron et al. [5]. Their theorem is in

terms of bracketing entropy. If G⊂Lµ,forδ>0, the L1-bracketing entropy J1(δ, G)

is deﬁned as (here we use a weaker notion that suﬃces for our purpose) the logarithm

of the minimum of all nsuch that there exist g1,g

2,...,g

nsatisfying

1. gi≤1+δ,

2. for every g∈Gthereexistsanisuch that g≤gi.

We feel that in many examples the L1entropy is easier to apply than bracketing

entropy.

4.5 Consistency via LeCam’s inequality

It is of technical interest that one can prove posterior consistency without assuming

that the prior is tail free or satisﬁes the condition of f0being in the K-L support. An

inequality of LeCam [121] is useful to do this.

Let Π be a prior on M(X). For any measurable subset Uof M(X), let λUbe the

probability measure on Xgiven by

λU(B)= 1

Π(U)U

P(B)dΠ(P)

We will let λstand for the marginal on X.

If given P,X∼P, and Π(U|Xn) is the posterior probability of U, then

Π(U|·)=Π(U)dλU

dλ (·)= Π(U)dλU

Π(U)dλU+Π(Uc)dλUc

(·)

≤Π(U)

Π(V)

dλU

dλV

(·)ifV⊂Uc

138 4. CONSISTENCY THEOREMS

Also recall that the L1-distance satisﬁes

P−Q=2sup

B|P(B)−Q(B)|=2 sup

0≤f≤1fdP −fdQ

where of course Bsandfs are measurable.

Lemma 4.5.1 (LeCam). Let U, V be disjoint subsets of X. For any P0and any

test function φ

Π(V|x)dP0(x)≤P0−λU+φdP0+Π(V)

Π(U)(1 −φ)dλV(4.6)

Proof.

Π(V|x)dP0(x)=φ(x)Π(V|x)dP0(x)+(1 −φ(x))Π(V|x)dP0(x)

adding and subtracting (1 −φ(x))Π(V|x)dλU(x)

≤φ(x)dP0(x)+(1 −φ(x))Π(V|x)dP0(x)−(1 −φ(x))Π(V|x)dλU(x)

+(1 −φ(x))Π(V|x)dλU(x)

≤φ(x)Π(V|x)dP0(x)+P0−λU+Π(V)

Π(U)(1 −φ)dλV

where the ﬁrst term comes from observing

0≤Π(V|x)≤1

and the second from

0≤(1 −φ)(x)Π(V|x)≤1

The third term follows by noting that

Π(V|x)≤(Π(V)/Π(U))(dλV/dλU)

4.5. CONSISTENCY VIA LECAM’S INEQUALITY 139

Our interest is when Vis the complement of a neighborhood of P0and we have

X1,X

2,...,X

nwhich are given P, i.i.d. P.IfUn∩V=∅and φnare test functions,

then we can write LeCam’s inequality as

Π(V|Xn)≤Pn

0−λn

Un+φndP n

0+Π(V)

Π(Un)(1 −φn)dλV

where of course Pnis the n-fold product of Pand λn

U=(

UPndΠ(P))/Π(U).

Theorem 4.5.1. Let Uδ

n={P:P0−P<δ/n}. If for every δ,{Π(Uδ

n):n≥1}

is not exponentially small, i.e.,

for all β>0,e

nβΠ(Uδ

n)→∞ (4.7)

then the posterior is weakly consistent at P0

Proof. It is not hard to see that

P0−P<δ/n⇒Pn

0−Pn<δ

Consequently the ﬁrst term goes to δ. Since for any weak neighborhood we can choose

an exponentially consistent test φnfor testing H0:f=f0against H1:f∈Vc

n,and

by assumption for all β>0,e

nβΠ(Uδ

n)→∞, it is not hard to see that the third term

goes to 0. Because δis arbitrary, the result follows.

Remark 4.5.1.By Proposition 1.2.1, P−Q≤2H(P, Q). Hence Theorem 4.5.1

holds if we take Uδ

n={P:H(P0,P)<δ/n}

Suppose (4.7) holds and Vnare sets such that for some β0>0,Π(Vn)enβ0→0; then

choosing φn≡0 it follows easily that Π(Vn|X1,X

2,...,X

n)→0. In other words, we

have an analog of Proposition 4.4.2. Consequently, we also have an analog of Theorem

4.4.4.

Theorem 4.5.2. Let Πbe a prior on Lµ. If for each >0, there is a δ<,

c1,c

2>0,β<

2/2, and Fn⊂Lµsuch that for all nlarge,

1. Π(Fc

n)<C

1e−nβ1and

2. J(δ, Fn)<nβ

Further if with Uδ

n={P:P0−P<δ/n},

for every δ, for all β>0,e

nβΠ(Uδ

n)→∞

then the posterior is strongly consistent at f0.

Density Estimation

5.1 Introduction

As the name suggests, density estimation is the problem of estimating the density of a

random variable Xusing observations of X. In this chapter we discuss some Bayesian

approaches to density estimation.

Density estimation has been extensively studied from the non-Bayesian point of

view. These include many methods of estimation starting from simple histogram

estimates to more sophisticated kernel estimates, estimates through Fourier series

expansions, and more recently wavelet-based methods. In addition, the asymptotics

of many of these methods, including minimax rates of convergence are available. There

are many good references; Silverman [151] and Van der Vaart [160] provide a good

starting point.

Consider the simple case when the density is to be estimated through a histogram.

Important features of the histogram are number of bins, their location and their width.

In order to reﬂect the true density, these features of the histogram estimate need to be

dependent not just on the number of observations but on the observations themselves.

The need for such a dynamic choice has been recognized and there have been many

reasonable, ad hoc, prescriptions. This issue persists in one form or another with the

other methods of estimation such as kernel estimates. The Bayesian approach, via

the posterior provides a rational method for choosing these features.

142 5. DENSITY ESTIMATION

In this chapter we discuss histogram priors of Gasperini and mixtures of nor-

mal densities which were introduced by Lo [130] and further developed by Escobar,

Mueller and West [ [168],[59] and [170]]. Gaussian process priors developed by Leonard

[[126],[127]] and studied by Lenk [125] are some what diﬀerent in sprit and are also

discussed. See also Hjort [98] and Hartigan [94].

Consistency is dealt with at some length for the histogram and the mixture of nor-

mal kernel priors. These partly demonstrate diﬀerent techniques to show consistency.

For the priors on histograms direct calculation is easier than invoking the Schwartz

theorem whereas for the mixture of normal kernels Schwartz’s theorem is a conve-

nient tool. This chapter is beset with long computations. To an extent they are both

natural and necessary.

5.2 Polya Tree Priors

A prerequisite for Bayesian density estimation is, of course, a prior on densities.

Since the Dirichlet process and their mixtures sit on discrete measures, these are

clearly unsuitable. On the other hand we have saw in Chapter 3 that by choosing

the parameters appropriately we can get Polya tree priors that are supported by

densities. Since the posterior for these priors involves simple updating rules, it is

natural to consider Polya trees as a candidate in density estimation.

Recall that if we have a Polya tree with partitions {B:∈Ej:j≥1}and pa-

rameters {α:∈E∗

k}:k≥1}, the predictive density at xis given by

α(x) = lim

k→∞



λ(B1(x)2(x)...i(x))

α1(x)2(x)...i(x)

α1(x)2(x)...i(x)0 +α1(x)2(x)...i(x)1

where i(x)=1ifx∈B1(x)2(x)...i(x)and 0 otherwise.

If X1=x1is observed and x1∈B

1,

2,...

kfor a sequence (

1,



2,...)of0sand1s,

and if and diﬀer for the ﬁrst time at the (j+ 1)th coordinate, then the predictive

density α(x|X1=x1)is

α(x|X1=x1)=



λ(B1(x)2(x)...i(x))

α1(x)2(x)...i(x)+1

α1(x)2(x)...i(x)0 +α1(x)2(x)...i(x)1 

∞



j+1

λ(B1(x)2(x)...i(x))

α1(x)2(x)...i(x)

α1(x)2(x)...i(x)0 +α1(x)2(x)...i(x)1 

5.3. MIXTURES OF KERNELS 143

As is to be expected the predictive density depends on the partition. While a

general expression for the predictive density given X1,X

2,...,X

nis cumbersome to

write down, it is clear that sequential updating is possible.

The density estimates from Polya tree priors have no obvious relation with classical

density estimates. Further, the priors lead to estimates that lack smoothness at the

endpoints of the deﬁning partition. Lavine [118] observes that this disadvantage can

be overcome by considering a mixture of {PT(Π(θ),α(θ))}processes, where the par-

titions themselves depend on the hyperparameter θ. One advantage of the Polya tree

priors is the relative ease with which one can conduct robustness studies; see Lavine

[119].

If we have a prior on densities, as discussed in Chapter 4 the consistency of interest

is L1-consistency. It is shown in Barron et al. [5] that if αn=8

n, the posterior is L1-

consistent. Such a high value of αnimplies that the random Ps are highly concentrated

around the prior guess E(P), so that posterior consistency will be an extremely slow

process. Hjort and Walker [165] have used a some what curious argument and show

that with αn=n2+δthe Bayes estimate is L1-consistent.

5.3 Mixtures of Kernels

While Polya tree priors can be made to sit on densities, it is not possible to constrain

the support to have smoothness properties. Much before Polya tree priors became

popular, Lo [131] had developed a useful construction of priors on densities. Much of

this section is based on Lo [131] and Ferguson [63].

Let Θ be a parameter set, typically Ror R2.LetK(x, τ ) be a kernel, i.e.,for each

τ,K(·,τ) is a probability density on Xwith respect to some σ-ﬁnite measure. For any

probability Pon Θ, let

K(x, P )=K(x, τ )dP (τ)

For ea ch P,K(·,P) is a density on Xand Lo’s method consists of choosing a

mixture K(·,P) at random by choosing Paccording to a Dirichlet process. These

would be referred to as Dirichlet mixtures of K(·,P).

Formally the model consists of P∼Dα, given P;X1,X

2,...,X

nare i.i.d. K(·,P).

If α=M¯α,where¯αis a probability measure, then the prior expected density is

f0=K(·,P)Dα(dP )=K(·,τ)¯α(dτ )

144 5. DENSITY ESTIMATION

It is convenient to view the X1,X

2,...,X

nas arising in the following way: P∼

Dαgiven P;τ1,τ

2,...,τ

nare i.i.d Pand given P,τ1,τ

2,...,τ

n;X1,X

2,...,X

nare

independent with Xi∼K(·,τ

i).

The latent variables τ1,τ

2,...,τ

nalthough unobservable, provide insight into the

structure of the posterior and are useful in describing and simulating the posterior.

A simple kernel would be to take τ=(i, h):h>0

K(x, (i, h)) = I(ih,(i+1)h]

h(x)

With this kernel one gets random histograms.

Another very useful kernel is the normal kernel. Here τ=(θ, σ)andK(x, θ, σ)=

(1/σ)φ((x−θ)/σ)whereφis the standard normal density. In this case the prior picks

a random density that is a mixture of normal densities. The weak closure of such

mixtures is all of M(R).

The prior is a probability measure on the space of densities {K(·,P):P∈M(R)}

and so is the posterior given X1,X

2,...,X

n. For the normal kernel Pis in general

not identiﬁable. It is known from [156] that if P1and P2are discrete measures with

ﬁnite support, then K(·,P

1)=K(·,P

2)iﬀP1=P2.ItiseasytoseethatifP1=

N(0,1) ×δ(0,σ0)and P2=δ(0,√1+σ2

0), then K(·,P

1)=K(·,P

2)=N(0,1+σ2

0).

Thus in general, Pis not identiﬁable. Identiﬁability of Pwhen restricted to discrete

measures is still unresolved [63].

If we denote by Π(·|X1,X

2,...,X

n) the posterior distribution of Pgiven X1,...,X

and by H(·|X1,X

2,...,X

n) the posterior distribution of τ1,...,τ

ngiven X1,...,X

then

Π(·|X1,X

2,...,X

n)=Π(·|(τ1,X

1),...,(τn,X

n))H(dτ|X1,X

2,...,X

Since Pand X1,X

2,...,X

nare conditionally independent given τ1,τ

2,...,τ

Π(·|(τ1,X

1),...,(τn,X

n)) = Π(·|(τ1,τ

2,...,τ

n)) = Dα+δτi

and

Π(·|X1,X

2,...,X

n)=Dα+δτiH(dτ|X1,X

2,...,X

The evaluation of these quantities depend on H(·|X1,X

2,...,X

n). If αhas a den-

sity, the joint density ˜α(τ1,τ

2,...,τ

n) is discussed in Chapter 3 (see equation 3.15).

5.3. MIXTURES OF KERNELS 145

Recall that if C1,C

2,...,C

N(P)is a partition of {1,2,...,n}then the density (with

respect to the Lebesgue measure on Rk)at

τ=(τ1,τ

2,...,τ

n):τi=τi,i,i

∈Cj,j =1,2,...N(P)

is N(P)



α(τj)(ej−1)!

n

1(M+i)(5.1)

where ej=#Cjand hence the joint density of the xsandτsat

τ=(τ1,τ

2,...,τ

n):τi=τi,i,i

∈Cj,j =1,2,...N(P)

is N(P)



α(τj)(ej−1)! l∈CjK(xl,τ

n

1(M+i)

Consequently, the posterior density of τis

N(P)

1α(τj)(ej−1)! l∈CjK(xl,τ

PN(P)

1α(τj)(ej−1)! l∈CjK(xl,τ

j)d(τj)

Thus

1

nK(x, τi)H(dτ|X1,X

2,...,X

n) (5.2)

n

PN(P)

1(ej−1)! K(x, τj)l∈CiK(xl,τ

j)α(τj)dτj

PN(P)

1(ej−1)! l∈CiK(xl,τ

j)α(τj)dτj

(5.3)

Since the Bayes estimate ˆ

fof fis, by 5.2, this reduces to

M+nf0(x)+ n

M+nK(x, τi)H(dτ|X1,X

2,...,X

Hence, we have that the Bayes estimate of fis

M+nK(x, τ )¯α(dτ)

M+n

W(P)ei

nK(x, τ )l∈CiK(xl,τ)α(τ)dτ

l∈CiK(xl,τ)α(τ)dτ (5.4)

146 5. DENSITY ESTIMATION

where P={C1,C

2,...,C

N(P)}is a partition of {1,2,...,n},eiis the number of

elements in Ci,and

W(P)= Φ(P)

Φ(P),Φ(P)=

N(P)



1{(ei−1)! 

l∈Ci

K(xl,τ)α(τ)dτ}

The Bayes estimate is thus composed of a part attributable to the prior and a

part attributable to the observations. Since for the Dirichlet, M→0 corresponds to

removing the inﬂuence of the prior, it is tempting to consider the estimate

1

nK(x, τi)H(dτ|X1,X

2,...,X

as a partially Bayesian estimate with the inﬂuence of the prior removed. Unfortu-

nately, this interpretation is quite misleading. As M→0 the Bayes estimate (5.4)

goes to K(x, τ1)˜α(τ1)n

1K(xi,τ

1)dτ1

˜α(τ1)n

1K(xi,τ

1)dτ1

(5.5)

corresponding to a partition in which all τiare equal to τ1. All other terms have a

power of Mand tend to 0. The term (5.5) corresponds to assuming that all the Xis

came from a single parametrized population with density K(x, τ ) and so is highly

parametrized.

The apparent paradox is resolved by the fact that role of the hyperparameters

depends on the context. Here Mdecides the likelihood of diﬀerent clusters and in fact

relatively large values of Mhelp bring the Bayes estimate close to a data-dependent

kernel density estimate. For a penetrating discussion of the role of M, see discussion

by Escobar [66] and West et al. [170].

Clearly to calculate quantities like K(x, τ)α(dτ ) it would be convenient if αis

conjugate to K(., .). Thus if Kis the normal kernel a convenient choice for ¯αis a

prior conjugate to N(τ, σ). Hence an appropriate choice for ¯αis the inverse normal-

gamma prior, i.e., the precision ρ=1/σ2has a gamma distribution and given ρ,τis

N(µ, 1/ρ). Ferguson [63] has interesting guidelines for choosing the parameters of ¯α

and M.

The expression for the Bayes estimate, even though it has an explicit expression, in-

volves enormous computation. The posterior for Dirichlet mixtures of normal densities

is amenable to MCMC methods. Gibbs methods are based on successive simulations

from one-dimensional conditional distributions of τigiven τj,j =i, X1,X

2,...,X

5.4. HIERARCHICAL MIXTURES 147

For a good exposition see Schervish [144] and Chen et al. [32]. The MCMC methods

were developed in the present context by Escobar, Mueller and West ([59], [169],[170]).

A good survey of the issues underlying MCMC issues is given by Escobar and West

in [60].

To implement MCMC one essentially works with the conditional distributions of

τigiven τj,j =i, X1,X

2,...,X

n, which may be written explicitly from the posterior

distribution of the τs given earlier or directly [32]. In practice, αhas a location and

scale parameter (µ, σ), which leads to some complications. In the joint distribution

of τs one replaces ˜αby αµ,σ and multiplies by the prior Π(µ, σ). Starting from this,

one can calculate all the relevant posterior distributions needed in MCMC. See also

Neal [135].

Since no explicit expressions are available for the Bayes estimate of f(x), it would

be worth exploring whether approximations like Newton [137] can be developed.

The next issue would be to do the asymptotics. In Section 5.4 we do this for a

slightly modiﬁed version of the mixture model. While formal asymptotics is yet to be

done for the priors discussed in this section, we expect that the results and techniques

of the next section will go through with minor modiﬁcations.

5.4 Hierarchical Mixtures

This method is a slight variation of the method discussed in the last section.

Let K(x) be a density on R.Foreachh>0 consider the kernel Kh(x, θ)=

(1/h)K((x−θ)/h). For any P∈M(R), let

Kh,P =Kh(x, θ)dP (θ)

Note that Kh,P is just the convolution Kh∗P.IfP∼Dα,thenwegetaprioron

Fh={Kh,P :P∈M(R)}

We now view has the smoothing “window” and think of has a hyperparameter

and put a prior µfor h. The calculations are very similar to those of the last section

except that we need to incorporate the hyperparameter h.

As before, the observations can be thought of as arising from: h∼µ,given

h;P∼Dα;givenh, P ;θ1,θ

2,...,θ

nare i.i.d. Pand given h, P ,and θ1,θ

2,...,θ

X1,X

2,...,X

nare independent with Xi∼Kh(·,θ

i).

148 5. DENSITY ESTIMATION

The posterior distribution of Pgiven X1,X

2,...,X

nis

Π(·|X1,X

2,...,X

=Π(·|(h, θ1,θ

2,...,θ

n,X

1,...,X

n))H(d(h, θ)|X1,X

2,...,X

n) (5.6)

Because Pand X1,X

2,...,X

nare conditionally independent given h, θ1,θ

2,...,θ

Π(·|(h, θ1,θ

2,...,θ

n,X

1,...,X

n)) = Dα+δθi

and

Π(·|X1,X

2,...,X

n)=Dα+δθiH(d(h, θ)|X1,X

2,...,X

As before, if µand αhare densities with respect to Lebesgue measure then the

posterior density of (h, θ1,θ

2,...,θ

n)isgivenby

µ(h)˜α(θ1,θ

2,...,θ

n)n

1Kh(Xi−θi)

µ(h)˜α(θ1,θ

2,...,θ

n)n

1Kh(Xi−θi)dhdθ

where ˜αis given by 3.15.

An expression analogous to (5.4) for the Bayes estimate can be written. In the

next two sections we look at consistency problems in the case when Kgivesriseto

histograms and when Kis the standard normal density.

Ishwaran [103] has used a general polya urn scheme to model θis and used these to

construct measures analogous to a prior and established consistency of the posterior.

These are then applied to a variety of interesting problems.

5.5 Random Histograms

In this section we consider priors that choose at random ﬁrst a bin of width hand

then a histogram with bins (ih, (i+1)h:h∈N)whereN={0±1±2...}. Formally,

in the hierarchical model we take Θ = Nand the kernel K(x)=I(0,1](x).

Thus the model consists of, h∼µ;givenh;choosePon integers with P∼Dαh

and X1,X

2,...,X

nare, given h, P , i.i.d. fh,P where

fh,P (x)= ∞



i=−∞

P{i}

hI(ih,(i+1)h](x)

5.5. RANDOM HISTOGRAMS 149

One could introduce intermediate latent variables θ1,θ

2,...,θ

nwhich are given h, P ;

i.i.d. P. However, they are not of much use here because Xicompletely determines

θi,namely,θi=jiﬀ Xi∈(jh,(j+1)h].

For ea ch h,letnjh be the number of Xis in the bin (jh,(j+1)h]andJh={j:

njh >0}.

A bit of reﬂection shows that the posterior distribution of Pgiven h, X1,X

2,...,X

is Dαh+njhδj,whereδjis the point mass at j.

If µis a density on (0,∞) then the joint density of hand X1,X

2,...,X

nis

µ(h)∞

1[αh(i)][nhi−1]h−n

M[n]

where Mh=αh(N) for any positive real xand positive integer k,x[k]=x(x+

1) ...(x+k−1). Hence the posterior density Π(h|X1,X

2,...,X

n)is

µ(h)∞

1[αh(i)][nhi−1]h−n

∞

0µ(h)∞

1[αh(i)][nhi−1]h−ndh (5.7)

Thus the posterior is of the same form as the prior, with µupdated to (5.7) and

αhupdated to αh+nhj δj.

Since each Dαhleads to the expected density

f¯αh(x)=¯αh(j)

hI(jh,(j+1)h](x)

the prior expectation is given by

f0(x)=f¯αh(x)µ(h)dh

Using the conjugacy of the prior, an expression for the Bayes estimate given the

sample can be written.

A choice of µwhich is positive in a neighborhood of 0 will allow for wide variability

in the choice of histograms and will ensure that the prior has all densities as its

support. If the prior belief leads to the density f0then an appropriate choice of ¯αh

would be

¯αh(j)=(j+1)h

f0(x)dx

Of course, this choice would lead to a prior expected density, which may not be

equal to f0, but it can be viewed as an approximation to f0.

150 5. DENSITY ESTIMATION

5.5.1 Weak Consistency

Gasperini introduced these priors in his thesis and under some assumptions on αh

showed that if the true f0is not constant on any interval then under the posterior

distribution given X1,X

2,...,X

n,hgoesto0,asn→∞. Thus the posterior stays

away from densities that are far from f0. Under additional assumptions on f0, he also

showed that the Bayes estimate of fconverges in L1to f0. In the spirit of Chapter

4 we investigate the consistency properties of the posterior. We conﬁne ourselves to

the case when the random histograms all have support on (0,∞], that is, the case

when Pis a probability on N+={0,1,2,...}. This restriction is not required but

simpliﬁes the proof of Lemma 5.5.2. Some of the following calculations are taken from

Gasperini’s thesis, but the main ideas of the proof and the main results are diﬀerent.

The consistency results in this chapter typically describe a large class of densities

where consistency obtains. We saw in Chapter 4 that when we have a prior Π on

densities, the Schwartz condition Π(Kf0()) >0 for all >0 (recall Kf0()isthe

Kullback-Leibler neighborhood of f0) ensures weak consistency at f0. Thus it seems

appropriate, in the context of histogram priors, that we should attempt to describe f0s

which would satisfy Schwartz’s condition. This would entail relating the tail behavior

of f0to the tail behavior of αhs. This is to be expected but leads to somewhat

cumbrous and restrictive conditions. It turns out that histogram priors are amenable

to direct calculations that lead to consistency results.

To be more speciﬁc, recall that Schwartz’s condition (Lemma 4.4.1) was used to

show that for all β>0,

enβ Ff(xi)

f0(xi)dΠ(f)→∞a.s. P∞

Under some assumptions we will establish this result directly. The following propo-

sition indicates the steps involved.

Proposition 5.5.1. Let Fbe a family of densities. For each h∈H,Πhisaprior

on F;µis a prior on H, i.e., h∼µ; given h;f∼Πhand given h, f ;X1,X

2,...,X

are i.i.d. f. If for a density f0,

for every β>0

µh:enβ Ff(xi)

f0(xi)dΠh(f)→∞a.s. P∞

f0>0 (5.8)

then the posterior is weakly consistent at f0.

5.5. RANDOM HISTOGRAMS 151

Proof. Let Ube a weak neighborhood of f0and let Π be the prior on the space of

densities induced by µ, Πh. Since we have exponentially consistent tests for testing f0

against Uc, it follows from Lemma 4.4.2 that for some β0

enβ0Uc



f(xi)

f0(xi)dΠ(f)→0 a.s. P∞

To establish consistency it is enough to show that

lim inf

n→∞ enβ0F



f(xi)

f0(xi)dΠ(f) = lim inf

n→∞ enβ0F



f(xi)

f0(xi)dΠh(f)dµ(h)

→∞ a.s. P∞

Consider

(h, x):x∈R∞,h∈H:enβ0F



f(xi)

f0(xi)dΠh(f)→∞



By assumption for hin a set of positive µmeasure, the h– section of Ehas measure

1 under P∞

f0. By Fubini there is a F⊂R∞,P∞

f0(F) = 1 and for x∈F,thex−section

of Ehas positive µmeasure and for each x∈Fby Fatou

lim inf

n→∞ Henβ0F



f(xi)

f0(xi)dΠh(f)dµ(h)=∞

Assumptions on the Prior (Gasperini)

(i) µis a prior for hwith support (0,∞).

(ii) For each h,αhis a probability measure on N+, and for all h,αh(1) >0.

(iii) For each h, there is a constant Kh>0 such that

αh(j)

αh(j+1) <K

hfor j=0,1,2...

152 5. DENSITY ESTIMATION

Theorem 5.5.1. Suppose that the prior satisﬁes the assumptions just listed. If f0

is a density such that

(a) x2f0(x)dx < ∞and

(b) limh→0f0log(f0,h/f0)=0,

then the posterior is weakly consistent at f0.

Proof. Let Inh =Fhn

1(f(xi)/f0(xi))Dαh(df )

To apply the last proposition it is enough to show that for any β>0 there exists

h0such that for each hin (0,h

0),

exp[n(β+log Inh

n)] →∞a.s. P∞

f0(5.9)

and this follows if for any >0,there exists h0such that for h∈(0,h

0),

lim

log Inh

n>−a.s. P∞

Then by taking =β/2, (5.9) would be achieved.

log Inh

n=1

nlog Fh



f(xi)

f0h(xi)Dαh(df )+ 1



log f0h(xi)

f0(xi)

where f0h(x)=(1/h)ih (i+1)hf0(y)dy for x∈(ih, (i+1)h].

By assumption b and SLLN for some h0, whenever h<h

lim



log f0h(xi)

f0(xi)>−

2a.s. P∞

Note that whenever f∈F

h,fis a constant on (ih, (i+1)h]:i≥0. Consequently

for f∈F

h,n



f(xi)

f0h(xi)=

i∈Jh

(f∗

h(i))nih

(f∗

0h(i))nih

where nih =#{xi∈(ih, (i+1)h]},Jh={i:nih >0}, and for any density f,f∗

denotes the probability on Ngiven by f∗

h(j)=(j+1)h

jh f(x)dx. Also let fhdenote the

histogram fh(x)=f∗(i)/h for x∈(ih, (i+1)h].

5.5. RANDOM HISTOGRAMS 153

Since Dαhis Dirichlet and αh(N)=1,

nFh

i∈Jh

(f∗

h(i))nih

hnDαh(df )= 1

Γ(n+1) 

i∈Jh

Γ(αh(i)+nih)

Γ(αh(i))

Therefore

nlog Fh



f(xi)

f0h(xi)Dαh(df )= 1

nlog 1

Γ(n+1) 

i∈Jh

Γ(αh(i)+nih)

Γ(αh(i)) −log 

i∈Jh

f∗

0h(i)

It is shown in Lemma 5.5.2 that

nlog 1

Γ(n+1) 

i∈Jh

Γ(αh(i)+nih)

Γ(αh(i)) −

i∈Jh

nih log nih

n→0 a.s.P∞

f0(5.10)

Using (5.10) we have

lim

n→∞

nFh

i∈Jh

(f∗

h(i))nih

hnDαh(df )

= lim

n→∞ 1

nlog 1

Γ(n+1) 

i∈Jh

Γ(αh(i)+nih)

Γ(αh(i)) −log h−

i∈Jh

log f∗

0h(i) + log h

= lim

n→∞ 

i∈Jh

nih

nlog nih

n−log h−1

nlog 

i∈Jh

(f∗

0h(i))nih

=−

i∈Jh

nih

nlog nih

n−log h−

i∈Jh

nih

nlog f∗

0h(i) + log h

→0 a.s. P∞

f0(5.11)

Lemma 5.5.1. Under the assumptions of the theorem,

max

i∈Jh

√n→0a.s P∞

Consequently

#Jh

√n≤

max

i∈Jh

√n→0a.s P∞

154 5. DENSITY ESTIMATION

Proof.

max

i∈Jh

i≤{max(X1,X

2,...,X

h}+1

Now max(X1,X

2,...,X

n)/√n→0. This follows from: If Y1,Y

2,...,Y

nare i.i.d.

(X2

i=Yiin our case) then max(Y1,Y

2,...,Y

n)/n →0iﬀEY1<∞. Recall as-

sumption (a) of Theorem 5.5.1.

Lemma 5.5.2. Under the assumptions of the theorem

nlog 1

Γ(n+1) 

i∈Jh

Γ(αh(i)+nih)

Γ(αh(i)) −

i∈Jh

nih log nih

n→0a.s. P∞

f0(5.12)

Proof. Let ln(h) stand for the ﬁrst term on the left-hand side. Then

ln(h)= 1

nlog 1

Γ(n+1) 

i∈Jh

Γ(αh(i)+nih)

Γ(αh(i))

ln(h)=

nlog 1

Γ(n+1) 

i∈Jh

Γ(αh(i)+nih)

Γ(αh(i))

n

i∈Jh

log Γ(αh(i)+nih)

−1

n

i∈Jh

log Γ(αh(i)) −log h−1

nlog Γ(n+1)

We ﬁrst show that

n

i∈Jh

log Γ(αh(i)) →0 a.s. P∞

Since Γ(x)≤1/x for 0 ≤x≤1, for h<,

0≤1

n

i∈Jh

log Γ(αh(i)) ≤1



log 1

αh(i)

5.5. RANDOM HISTOGRAMS 155

By using a telescoping argument, the right-hand side of the expression becomes



i=2



j=2 log 1

αh(i)−log 1

αh(i−1)+N

nlog 1αh(1)



(N−j+1)logαh(j−1)

αh(j)+N

nlog 1αh(1)

≤(N+1)(N+2)

2nKh+N

nlog 1

αh(1) →0 a.s. P∞

f0(5.13)

By Stirling’s approximation for all x≥1,

log Γ(x)=(x−1

2)logx−x+log√2π+R(x)0<R(x)<1

and we can write

nlog 1

Γ(n+1) 

i∈Jh

Γ(αh(i)+nih)

Γ(αh(i))

n

i∈Jh

log Γ(αh(i)+nih)−1

n

i∈Jh

log Γ(αh(i))

n

i∈Jh{(αh(i)+nih −1

2)log(αh(i)+nih )}

−1

n

i∈Jh αh(i)−nih −log √2π+R(αh(i)+nih)!−log h

−1

n{(n+1

2)log(n+1)−(n+ 1) + log √2π+R(n)}(5.14)

Since i∈Jhnih =nand

n

i∈Jh −αh(i) + log √2π+R(αh(i)) + nih −log h!

≤(maxi∈Jhi)(2+log√2π)

n→0 a.s. P∞

f0(5.15)

156 5. DENSITY ESTIMATION

we get

lim

n→∞ |ln(h)−

i∈Jh

nih

nh log nih

nh |

≤1

n

i∈Jh{(αh(i)+nih −1

2)log(αh(i)+nih )}

−

i∈Jh

nih

nh log nih +logn+logh(5.16)

By adding and subtracting 1/n i∈Jhnih −1/2lognih we have

n

i∈Jh{(αh(i)+nih −1

2)log(αh(i)+nih )−

i∈Jh

nih

nh log nih|

≤|1

n

i∈Jh

αh(i)log(αh(i)+nih )|

n

i∈Jh

(nih −1

2)log(1 + αh(i)

nih |+1

n

i∈Jh

2log nih (5.17)

Using log(1 + x)≤x

≤log(n+1)

n+1

n+log n

2n#Jh

The last term in this expression goes to 0 by Lemma 5.5.2.

The condition α(j−1)/α(j)<Kessentially requires that the prior does not vanish

too rapidly in the tails. If our prior expectation f0is unimodal then it is easy to see

that the condition holds with K=m+h

m−hf0(x)ds,wheremisthemodeoff0.

5.5.2 L1-Consistency

We next turn to L1-consistency. We will use Theorem 4.4.4. Recall that Theorem 4.4.4

required two sets of conditions—one being the Schwartz condition and the other was

construction of a sieve Fnwith metric entropy nβ andsuchthatΠ(Fc

n) is exponen-

tially small. A look at the proof of Theorem 4.4.4 shows that the Schwartz condition

can be replaced by

for all β>0,lim inf

n→∞ enβ n



f(Xi)

f0(Xi)Π(df )=∞a.s P∞

5.5. RANDOM HISTOGRAMS 157

Since we have already discussed this aspect in the last section, here we shall con-

centrate on the construction of a sieve.

To look ahead our sieve will be Fn=∪h>hnFan,h where Fan,h is the set of histograms

with support [−an,a

n]. We will compute the metric entropy of Fnand show that for

a suitable choice of hn,a

nit is of the order nβ. What is then left is to ensure that the

prior gives exponentially small mass to Fc

Proposition 5.5.2.

Let Pδ

k={(P1,P

2,...,P

k):Pi≥0,



Pi≥1−δ}

Then

J(Pδ

k,2δ)≤(k

δ+1

2)log(1+δ)+klog(1 + δ)−1

2log K+1

Proof. Let K∗be the largest integer less than or equal to k/δ and consider

P∗={P∈Pδ

k:Pi=jδ

kfor some integer j}

We will show that given any P∈Pδ

kthere is P∗∈P∗with P−P∗<2δ. The

logarithm of the cardinality of P∗then gives an upper bound for J(Pδ

k,2δ).

Let P∈Pδ

k. Then since

|Pi

Pj−Pi|=Pi

Pj

(1 −Pj)≤Pi

Pj

δ,

we have (Pi/Pj)−Pi<δ.

Given P∈Pδ

kwith Pi=1,letP∗be such that

P∗

i=jδ

kfor some integer jand Pi−P∗

i<δ

Then P∗=(P∗

1,P∗

2,...,P∗

k)∈P∗and also P−P∗<δ.Thuswehaveshownthat

P∗is a 2δnet in Pδ

To compute the number of elements in P∗, consider k∗points a1,a

2,...,a

k∗,each

endowed with a weight of δ/k.Ifweplace(k−1) sticks among these points, then these

divide a1,a

2,...,a

k∗into kparts, those to the left of the ﬁrst stick, those between

the ﬁrst and second, and so on, the last part being all those a

isto the right of the

last stick. Adding the weight of each of these parts gives a (P∗

1,P∗

2,...,P∗

k)∈P∗and

158 5. DENSITY ESTIMATION

any element of P∗corresponds to a kpartition of a1,a

2,...,a

k∗. The number of ways

of partitioning k∗elements into kparts (some may be empty) is k∗+k−1

k−1.

Recall Stirling’s approximation

x!=√2πxx+1

2e−x+θ

12x0<θ<1

so that

k∗+k−1

k−1=(k∗+k−1)!

(k−1)!k∗!

≤(k∗+k)!

k!k∗!

≤√2π(k∗+k)!(k∗+k)!+ 1

2e−(k∗+k)!+ θ

12(k∗+k)!

√2πkk+1

2e−k+θ

12k√2π(k∗)k∗+1

2e−k∗+θ

12k∗

and therefore

log k∗+k−1

k−1≤log (k∗+k)k∗+1

(k∗)k∗+1

+log(k∗+k)k

kk∗+1

+

where

=log 1

√2π+θ

12(k+k∗)−θ

k∗−θ

k<1

so that,

J(Pδ

k,2δ)≤(k∗+1

2)log(1+ k

k∗)+klog(1 + k∗

−1

2log k+1

substituting k∗≤k/δ we get the proposition.

Lemma 5.5.3. Suppose

P∈Pδ

k={(P1,P

2,...,P

k):1≥Pi≥0,



Pi≥1−δ}

δ<1,h

0<h<1and h−h0=<δh

0/2(K+1).Iffhis the histogram fh(x)=

(Pi/h)I(ih,(i+1)h](x)and fh0is the histogram fh0(x)=(Pi)h)I(ih0,(i+1)h0](x), then

fh−fh0<3δ.

5.5. RANDOM HISTOGRAMS 159

Proof. Let

I1=(0,h],I

2=(h, 2h],...I

k=((k−1)h, kh]

and

J1=(0,h

0],J

2=(h0,2h0],...J

k=((k−1)h0,kh

Because k < h,fori<k,

Ii=(Ii∩Ji)∪(Ii∩Ji+1

Further,

Ii∩Ji+1 =((i+1)h0,(i+1)h)

Since fh=fh0on Ii∩Ji,wehave

Ii|fh−fh0|dx =|Pi

h−P(i+1)

h|(i+1)(h−h0)

and because Pi≤1andh<h

kh

0|fh−fh0|dx =

k−1



1|Pi−P(i+1)|(i+1)(h−h0)

h+Pk

(i+1)(h−h0)

≤



(i+1)(h−h0)

≤2(k+1) 

h0≤δ

(5.18)

A bit of notational clariﬁcation: For every h,an/h will not be an integer and hence

when we write Fan,h what we mean is the set of all histograms from 0 to [an/h]where

[an/h]is the largest integer less than or equal to an/h. In our calculations, to avoid

notational mess, we pretend that an/h is an integer.

Lemma 5.5.4. For a>0, let Fa,h be all histograms from [0,a]with bin width h.

Then

∪h>h0Fa,h =∪2h0>h>h0Fa,h

160 5. DENSITY ESTIMATION

Proof. For any h>h

0, for some integer m,(h/m)∈(h0,2h0).The conclusion follows

because any histogram with bin width hcan also be viewed as a histogram with bin

width h/m.

We put all the previous steps together in the next proposition Let Fδ

a,h be all

histograms fhin Fa,h such that Pfh[0,a]>1−δ.

Proposition 5.5.3.

J∪h>hFδ

a,h,5δ≤log(2a

h+1)+( a

hδ log(1 + δ)+ a

nlog(1 + 1

δ)+1

Proof. By Lemma 5.5.4

∪h>hFa,h=∪2h>h>h Fa,h

Set k=2a/h and =δh2/(2a+1)

Let N∗=[h]+1 where for any a,[a] is the largest integer less than or equal to a,and

hi=h+i, i =1,2,···,N∗. Then by Proposition 5.5.2, given any f∈∪

2h>h>hFa,h,

thereissomehisuch that f−fhi<3δ. Use of Proposition 5.5.1 at each of Fa,hi,

and a bit of algebra gives the result.

Theorem 5.5.2. Let µbe a probability measure on (0,∞)such that 0 is in the

support of µ.αis a probability measure on R. Our setup is h∼µ, the prior on Fhis

Dαhwhere αh(i)=α(ih, (i+1)h].Letan→∞,h

n→0such that (an/nhn)→0.

(i) for some β0,β

1,C

2>0,

α(−an,a

n]>1−C1e−nβ0

(ii) µ(0,h

n)<C

2e−nβ1

then the posterior is strongly consistent at any f0satisfying (5.8).

Proof. If an

nhn→0, it follows from Proposition 5.5.3 that J(Fn,δ)<nβfor large

enough n. An easy application of Markov inequality with condition (i), and using (ii)

gives Π(Fc

n)<Ce

−nγ for some Cand γ. Theorem 4.4.4 gives the conclusion.

Thus if an=naand hn=n−bthen what we need is a+b<1. For example if

αis normal then one can take an=n−1/2. The condition would then be satisﬁed if

hn=n−bwith b<1/2.

5.6. MIXTURES OF NORMAL KERNEL 161

5.6 Mixtures of Normal Kernel

Another case of special interest is when Kis the normal These priors were introduced

by Lo [131], (see also Ghorai and Rubin[72] and West [168] who obtained expressions

for the resulting posterior and predictive distributions. These can be further general-

ized by eliciting the base measure α=Mα0of the Dirichlet up to some parameters

and then considering hierarchical priors for these hyperparameters.

5.6.1 Dirichlet Mixtures: Weak Consistency

Returning to the mixture model, let φand φhdenote, respectively the standard normal

density and the normal density with mean 0 and standard deviation h.LetΘ=R

and Mbe the set of probability measures on Θ. If Pis in M, then fh,P will stand

for the density

fh,P (x)=φh(x−θ)dP (θ)

Note that fh,P is just the convolution φh∗P.

To get a feeling for the developments, we ﬁrst look at the case where h=h0is

ﬁxed and our model is P∼ΠandgivenP,X1,X

2,...,X

nare i.i.d. fp.Inthiscase,

the induced prior is supported by Fh0={fh0,P :P∈M}, and the following facts are

easy to establish from Scheﬀe’s theorem:

(i) The map P→ fh0,P is one-to-one, onto Fh0. Further Pn→P0weakly if and

only if fh0,Pn−fh0,P →0.

(ii) Fh0is a closed subset of F.

Fact (ii) shows that Fh0is the support of Π, and hence consistency is to be sought

only for densities of the form fh0,P . Theorem 5.6.1 implies consistency for such densi-

ties. Fact (i) shows that if the interest is in the posterior distribution of P, then weak

consistency at P0is equivalent to strong consistency of the posterior of the density

at fh0,P .

In order to establish weak consistency of the posterior distribution of fwe need to

verify the Schwartz condition. Following is a proposition that though not useful when

ΠisDαis useful in other contexts.

Proposition 5.6.1.

K(fP,f

Q)≤K(P, Q)

162 5. DENSITY ESTIMATION

Proof. A bit of change of variables and order of integration would show that

K(fP,f

Q)=K(Pxφ(x)dx, Qxφ(x)dx)

where Pxis the measure Pshifted by x. Using the convexity of the K-L divergence

and observing K(Px,Q

x)=K(P, Q) for all x,wehave

K(fP,fQ)=K(Pxφ(x)dx, Qxφ(x)dx)≤K(Px,Q

x)φ(x)dx =K(P, Q)

Thus if we have a prior Π such that every Pis in K-L support then the posterior

is weakly consistent at fP. In fact the earlier remark shows that we have weak con-

sistency at Pand hence strong consistency at fP. The Dirichlet does not have this

property. However, we will show in Chapter 6 that for a suitable choice of parameters

the Polya tree satisﬁes this property. Fixing hseverely restricts the class of densities

and is thus not of much interest.

We turn next to the model with a prior for h. Our model consists of a prior µfor h

and a prior Π on M. The prior µ×Π through the map (h, P )→ fh,P induces a prior

on F. We continue to denote this prior also by Π. Thus (h, P )∼µ×Πandgiven

(h, P ), X1,X

2,...,X

nare i.i.d. fh,P . This section describes a class of densities in the

K-L support of Π. By Schwartz’s theorem the posterior will be weakly consistent at

these densities. The results in this section are largely from [74]. The next two results

look at two simple cases and hold for general priors, but Theorem 5.6.3 makes use of

special properties of the Dirichlet.

Theorem 5.6.1. Let the true density f0be of the form f0(x)=fh0,P0(x)=

φh0(x−θ)dP0(θ).IfP0is compactly supported and belongs to the support of Π,

and h0is in the support of µ, then Π(K(f0)) >0for all >0.

Proof. Suppose P0[−k, k] = 1. Since P0is in the weak support of Π, it follows that

Π{P:P[−k, k]>1/2}>0. It is easy to see that f0has moments of all orders.

For η>0, choose ksuch that |x|>kmax(1,|x|)f0(x)dx < η.Forh>0, we write

∞

−∞ f0log (fh,P0/fh,P )asthesum

−k

−∞

f0log fh,P0

fh,P

+k

−k

f0log fh,P0

fh,P

+∞

k

f0log fh,P0

fh,P

(5.19)

5.6. MIXTURES OF NORMAL KERNEL 163

Now

−k

−∞

f0(x)logfh,P0(x)

fh,P (x)dx

≤−k

−∞

f0(x)logk

−kφh(x−θ)dP0(θ)

k

−kφh(x−θ)dP (θ)dx

≤−k

−∞

f0(x)logφh(x+k)

φh(x−k)P[−k, k]dx

=−k

−∞

f0(x)2k|x|

h2dx −log(P[−k, k]) −k

−∞

f0(x)dx

<2k

h2+log2

η

provided P[−k, k]>1/2. Similarly, we get a bound for the third term in (5.19).

Clearly,

c:= inf

|x|≤kinf

|θ|≤kφh(x−θ)>0

The family of functions {φh(x−θ):x∈[−k,k

]}, viewed as a set of functions of θ

in [−k, k], is uniformly equicontinuous. By the Arzela-Ascoli theorem, given δ>0,

there exist ﬁnitely many points x1,x

2,...,x

msuch that for any x∈[−k,k

],there

exists an iwith

sup

θ∈[−k,k]|φh(x−θ)−φh(xi−θ)|<cδ (5.20)

Let

E=P:φh(xi−θ)dP0(θ)−φh(xi−θ)dP (θ)

<cδ;i=1,2,...,m



Since Eis a weak neighborhood of P0,Π(E)>0. Let P∈E.Thenforany

x∈[−k,k

], choosing the appropriate xifrom (5.20), using a simple triangulation

argument we get

φh(x−θ)dP (θ)

φh(x−θ)dP0(θ)−1

<3δ

and so φh(x−θ)dP0(θ)

φh(x−θ)dP (θ)−1

<3δ

1−3δ

164 5. DENSITY ESTIMATION

(provided δ<1/3).

Thus for any ﬁxed h>0, for Pin a set of positive Π-probability, we have

f0log (fh,P0/fh,P )<22k

h2+log2

η+3δ

1−3δ(5.21)

Now for any h,

f0log (f0/fh,P )=f0log (f0/fh,P0)+f0log (fh,P0/fh,P ) (5.22)

The ﬁrst term on the right-hand side of (5.22) converges to 0 as h→h0. To see this,

observe that φh0(x−θ)dP0(θ)

φh(x−θ)dP0(θ)≤sup

|θ|≤k

φh0(x−θ)

φh(x−θ)

The rest follows by an application of the dominated convergence theorem.

Given any >0, choose a neighborhood Nof h0(not containing 0) such that if

h∈N, the ﬁrst term on the right-hand side of (5.22) is less than /2. Next choose η

and δso that for any h∈N, the right-hand side of (5.21) is less than /2. Because

h0is in the support of µ, the result follows.

Remark 5.6.1.In Theorem 5.6.1, the true density is a compact location mixture of

normals with a ﬁxed scale. It is also possible to obtain consistency at true densities

which are (compact) location-scale mixtures of the normal, provided we use a mixture

prior for has well. More precisely, if we modify the prior so that (θ, h)∼P(a

probability on R×(0,∞)) and P∼Π, then consistency holds at f0=φh(x−

θ)P0(dθ, dh) provided P0has compact support and belongs to the support of Π. The

proof is similar to that of Theorem 3.

Theorem 5.6.1 covers the case when the true density is normal or a mixture of

normal over a compact set of locations. This theorem, however, does not cover the

case when the true density itself has compact support, like, say, the uniform. The

next theorem takes care of such densities.

Theorem 5.6.2. Let 0be in the support of µand f0be a density in the support of

Π.Letf0,h=φh∗f0.If

1. lim

h→0f0log(f0/f0,h)=0,

2. f0has compact support,

5.6. MIXTURES OF NORMAL KERNEL 165

then Π(K(f0)) >0for all >0.

Proof. Note that, for each h,

f0log(f0/fh,P )=f0log(f0/f0,h)+f0log(f0,h /fh,P )

Choose h0such that for h<h

0,f0log(f0/f0,h)</2 so all that is required is to

show that for all h>0,

ΠP:f0log (f0,h/fh,P )</2>0

If f0has support in [−k, k]. Then

f0log(f0,h/fh,P )≤k

−k

f0(x)logk

−kφh(x−θ)f0(θ)dθ

k

−kφh(x−θ)dP (θ)dx

The rest of the argument proceeds in the same lines as in Theorem 5.6.1.

While the last two theorems are valid for general priors on M, the next theorem

makes strong use of the properties of the Dirichlet process. For any Pin M,set

P(x)=P(x, ∞)andP(x)=P(−∞,x).

Theorem 5.6.3. Let Dαbe a Dirichlet process on M.Letl1,l

2,u

1,u

2be functions

such that for some k>0for all Pin a set of Dα-probability 1, there exists x0

(depending on P) such that

P(x)≥l1(x),¯

P(x+klog x)≤u1(x)∀x>x

and

P(x)≥l2(x),P(x−klog |x|)≤u2(x)∀x<−x0

(5.23)

For any h>0, deﬁne

Lh(x)=φh(klog x)(l1(x)−u1(x)),if x>0

φh(klog |x|)(l2(x)−u2(x)),if x<0

and assume that Lh(x)is positive for suﬃciently large |x|.Letf0be the “true” density

and f0,h =φh∗f0. Assume that 0is in the support of the prior on h.Iff0is in the

support of Dα(equivalently, supp(f0)⊂supp(α)) and satisﬁes

166 5. DENSITY ESTIMATION

1. lim

h↓0f0log(f0/f0,h)=0;,

2. for all h,lim

a↑∞ ∞

−∞

f0(x)logf0,h(x)

a

−aφh(x−θ)f0(θ)dθ dx =0; and

3. for all h,lim

M→∞ |x|>M

f0(x)logf0,h(x)

Lh(x)dx =0,

then Π(K(f0)) >0for all >0.

Remark 5.6.2.It follows from Doss and Sellke [55] that if α=Mα0,whereα0is a

probability measure, then

l1(x)=exp[−2log|log α0(x)|/α0(x)]

l2(x)=exp[−2log|log α0(x)|/α0(x)]

u1(x) = exp −1

α0(x+klog x)|log α0(x−klog x)|2

u2(x) = exp −1

α0(x−klog |x|)|log α0(x−klog |x|)|2

satisfy the requirements of (5.23). For example, when α0is double exponential, we

may choose any k>2 and the requirements of the theorem are satisﬁed if f0has

ﬁnite moment-generating function in an open interval containing [−1,1].

Remark 5.6.3.The following argument provides a method for the veriﬁcation of

Condition 1 of Theorems 5.6.1 and 5.6.2 for many densities. Suppose that f0is con-

tinuous a.e., f0log f0<∞, and further assume that, as for unimodal densities,

there exists an interval [a, b] such that inf{f(x):x∈[a, b]}=c>0andf0is increas-

ing in (−∞,a) and is decreasing in (b, ∞). Note that {x:f0(x)≥c}is an interval

containing [a, b]. Replacing the original [a, b] by this new interval, we may assume

that f0(x)≤coutside [a, b]. Choose h0such that N(0,h

0) gives probability 1/3to

(0,b−a). Let h<h

0. Let Φ denote the cumulative distribution function of N(0,1).

If x∈[a, b] then

f0,h(θ)≥b

f0(θ)φh(x−θ)dθ ≥c(Φ((b−x)/h)+Φ((x−a)/h)≥c/3

If x>bthen

f0,h(θ)≥x

f0(θ)φh(x−θ)dθ ≥f0(x)1

2+Φ((b−a)/h)−1≥f0(x)/3

5.6. MIXTURES OF NORMAL KERNEL 167

Using a similar argument when x<a, we have that the function

g(x)=log (3f0(x)/c),if x∈[a, b]

log 3,otherwise

dominates log(f0/f0,h)forh<h

0and is Pf0-integrable. Since f0(x)/f0,h(x)→1as

h→0 whenever xis a continuity point of f0and f0log(f0/f0,h)≥0, an application

of (a version of) Fatou’s lemma shows that f0log(f0/f0,h)→0ash→0.

Proof. Let >0 be given and δ>0, to be chosen later. First ﬁnd h0so that

f0log(f0/f0,h)</2 for all h<h

0. Fix h<h

0. Choose k1such that

∞

−∞

f0(x)logf0,h(x)

k1

−k1φh(x−θ)f0(θ)dθ dx < δ

Let p=P[−k1,k

1] and let p0denote the corresponding value under P0.Wemay

assume that p0>0. Let P∗denote the conditional probability under Pgiven [−k1,k

1],

i.e., P∗(A)=P(A∩[−k1,k

1])/p (if p>0) and P∗

0denoting the corresponding objects

for P0.LetEbe the event {P:|p/p0−1|<δ}. Because P0is in the support of Dα,

Dα(E)>0. Now choose x0>k

1such that

(i) |x|>x0

f0(x)log(f0,h(x)/Lh(x)) dx < δ

(ii) Dα(E∩F)>0, where

F=⎧

⎨

⎩

P(x)≥l1(x), P (x+klog x)≤u1(x)∀x>x

and

P(x)≥l2(x),P(x−klog |x|)≤u2(x)∀x<−x0⎫

⎬

⎭

By Egoroﬀ’s theorem, it is indeed possible to meet condition (ii).

Consider the event

G=P:sup

−x0<x<x0

log k1

−k1φh(x−θ)dP ∗

0(θ)

k1

−k1φh(x−θ)dP ∗(θ)<2δ.

We shall argue that Dα(E∩F∩G)>0 and if P∈(E∩F∩G) then f0log(f0/fh,P )<

for a suitable choice of δ.

168 5. DENSITY ESTIMATION

The events E∩Fand Gare independent under Dα, and hence, to prove the ﬁrst

statement, it is enough to show that Dα(G)>0. By intersecting Gwith Eand

using the fact that {φh(x−θ):−x0≤x≤x0}is uniformly equicontinuous when

θ∈[−k1,k

1], we can conclude that Dα(G)≥Dα(G∩E)>0 (see the proof of

Theorem 5.6.1).

Now,

f0log(f0/fh,P )

≤∞

−∞

f0(x)log(f0(x)/f0,h (x))dx

+|x|≤x0

f0(x)logf0,h(x)

k1

−k1φh(x−θ)f0(θ)dθ dx

+|x|≤x0

f0(x)logk1

−k1φh(x−θ)f0(θ)dθ

k1

−k1φh(x−θ)dP (θ)dx

+|x|>x0

f0(x)logf0,h(x)

φh(x−θ)dP (θ)dx

If P∈E∩F∩G,thenforx>x

∞

−∞

φh(x−θ)dP (θ)≥x+klog x

φh(x−θ)dP (θ)

≥φh(klog x)[P(x)−P(x+klog x)]

and because P∈F, the expression is further greater than or equal to

φh(klog x)[l1(x)−u1(x)] = Lh(x)

Using a similar argument for x<−x0,weget

|x|>x0

f0(x)logf0,h(x)

fh,P (x)dx ≤|x|>x0

f0(x)logf0,h(x)

Lh(x)dx < δ

Since P∈E∩G,foreachxin [−x0,x

0],

log k1

−k1φh(x−θ)f0(θ)dθ

k1

−k1φh(x−θ)dP (θ)=logp0

pk1

−k1φh(x−θ)dP ∗

0(θ)

k1

−k1φh(x−θ)dP ∗(θ)<3δ

All these imply that if δis suﬃciently small, then P∈E∩F∩Gimplies that

f0log(f0,h/fh,P )<.

5.6. MIXTURES OF NORMAL KERNEL 169

5.6.2 Dirichlet Mixtures: L1-Consistency

As before, we consider the prior which picks a random density φh∗P,wherehis

distributed according to µand Pis chosen independently of haccording to Dα. Since

we view has corresponding to window length, it is only the small values of hthat are

relevant, and hence we assume that the support of µis [0,M] for some ﬁnite M.

In this model the prior is concentrated on

F=∪0<h<M Fh

where Fh={φh∗P:P∈M}.

In order to apply Theorem 4.4.4, given U={f:f−f0<},forsomeδ</4,

we need to construct sieves {Fn:n≥1}such that J(δ, Fn)≤nβ and Fc

nhas

exponentially small prior probability. Because, as an→∞,Dα{P:P[−an,a

n]>

1−δ}→1, a natural candidate for Fnis

Fn=∪hn<h<M Fan

where hn↓0,anincreases, and Fan

h={φh∗P:P[−an,a

n]>1−δ}. What is then

needed is an estimate of J(δ, Fn). The next theorem provides such an estimate.

The next lemma shows that the restriction h<M simpliﬁes things a bit.

Lemma 5.6.1. Let M>0and let FM

h,a,δ =∪h<h<M Fh,a,δ.Ifa>M/

√δ, then

h,a,δ ⊂F

h,2a,2δ.

Proof. By Chebyshev’s inequality, if h<M then the probability of (−a, a] under

N(0,h

) is greater than 1 −δ.Iff=φh∗P, then since φh=φh∗φh∗,whereh∗<M,

f=φh∗φh∗∗Pand (φh∗∗P)(−a, a]>1−2δ.

Theorem 5.6.4. Let FM

h,a,δ =∪h<h<M {fh,P :P[−a, a]≥1−δ}. Then

J(δ, FM

h,a,δ)≤Ka

where Kis a constant that does depend on δand M, but not on aor h.

We prove Theorem 5.6.4 through a sequence of lemmas. Let Fh,a ={fh,P :P(−a, a]=

1}. Without loss of generality, we shall assume that a≥1

Lemma 5.6.2. J(2δ, Fh,a)≤8

hδ +1

1+log1+δ

δ.

170 5. DENSITY ESTIMATION

Proof. For any θ1<θ

φθ1,h −φθ2,h

√2πh x>(θ1+θ2)/2

exp[−(x−θ2)2/(2h2)]dx

−1

√2πh x>(θ1+θ2)/2

exp[−(x−θ1)2)/(2h2)]dx

√2πh x<(θ1+θ2)/2

exp[−(x−θ1)2/(2h2)]dx

−1

√2πh x<(θ1+θ2)/2

exp[−(x−θ2)2/(2h2)]dx

=4 1

√2π(θ2−θ1)/(2h)

exp[−x2/2]dx

≤2

(θ2−θ1)

Given δ,letNbe the smallest integer greater than √8a/(√πhδ). Divide (−a, a]

into Nintervals. Let

Ei=−a+2a(i−1)

N,−a+2ai

N:i=1,2,...,N

and let θibe the midpoint of Ei. Note that if θ, θ∈Ei, then |θ−θ|<2a/N,and

consequently φθ,h −φθ,h<δ.

Let PN={(P1,P

2,...,P

N):Pi≥0,N

i=1 Pi=1}be the N-dimensional prob-

ability simplex and let P∗

Nbe a δ-net in PN, i.e., given P∈P

N, there is P∗=

(P∗

1,P∗

2,...,P∗

N)∈P∗

Nsuch that N

i=1 |Pi−P∗

i|<δ.

Let F∗={N

i=1 P∗

iφθi,h :P∗∈P

∗

N}. We shall show that F∗is a 2δnet in Fh,a.If

fh,P =φh∗P∈F

h,a,setPi=P(Ei) and let P∗∈P∗

Nbe such that N

i=1 |Pi−P∗

i|<δ.

5.6. MIXTURES OF NORMAL KERNEL 171

Then

/φθ,hdP (θ)−



i=1

P∗

iφθi,h/

≤/

/φθ,hdP (θ)−



i=1 IEi(θ)φθi,hdP (θ)/



i=1

Piφθi,h −



i=1

P∗

iφθi,h/

≤N



i=1

IEi(θ)φθ,h −φθi,hdP (θ)+



i=1 |Pi−P∗

≤2δ

This shows that J(2δ, Fh,a)≤J(δ, PN), and we calculate J(δ, PN) along the lines

of Barron, Schervish and Wasserman as follows: Since |Pi−P∗

i|<δ/N for all i

implies that N

i=1 |Pi−P∗

i|<δ, an upper bound for the cardinality of the minimal

δ-net of PNis given by

number of cubes of length δ/N covering [0,1]N

×volume of (P1,P

2,...,P

N):Pi≥0,



i=1

Pi≤1+δ

=(N/δ)N(1 + δ)N1

So,

J(δ, PN)≤Nlog N−Nlog δ+Nlog(1 + δ)−log N!

≤Nlog N−Nlog δ+Nlog(1 + δ)−Nlog N+N

=N1+log1+δ

δ

≤8

hδ +1

1+log1+δ

δ

Lemma 5.6.3. Let Fh,a,δ ={fh,P :P(−a, a]≥1−δ}. Then J(3δ, Fh,a,δ)≤

J(δ, Fh,a).

172 5. DENSITY ESTIMATION

Proof. Let f=φh∗P∈F

h,a,δ. Consider the probability measure P∗deﬁned by

P∗(A)=P(A∩(−a, a])/P (−a, a]. Then the density f∗=φh∗P∗clearly belongs to

Fh,a and further satisﬁes f−f∗<2δ.

Proof. Putting Lemmas 5.6.2 , 5.6.3 and 5.6.1 together, we have Theorem 5.6.4.

The next theorem formulates the result in terms of strong consistency for Dirichlet-

normal mixtures.

Theorem 5.6.5. Suppose that the prior µhas support in [0,M]. If for each δ>0,

β>0, there exist sequences an,hn↓0and constants β0,β

1(all depending on δ,βand

M) such that

1. for some β0,Dα{P:P[−an,a

n]<1−δ}<e

−nβ0,

2. µ{h<h

n}≤e−nβ1, and

3. an/hn<nβ

then f0is in the K-L support of the prior implies that the posterior is strongly con-

sistent at f0.

Remark 5.6.4.What was involved in the preceding is a balance between anand hn.

Since δand Mare ﬁxed, the constant Kobtained in Theorem 5.6.4 does not play

any role. If αhas compact support, say [−a, a], then we may trivially choose an=a

and so hnmay be allowed to take values of the order of n−1or larger. If αis chosen as

a normal distribution and h2is given a (right truncated) inverse gamma prior, then

the conditions of the theorem are satisﬁed if anis of the order √nand hn=C/√n

for a suitable (large) C(depending on δand β).

5.6.3 Extensions

The methods developed in this chapter toward the simple mixture models can be used

to study many of the variations used in practice. Some of these are discussed in this

section.

1. It is often sensible to let the prior depend on the sample size; see for instance

Roeder and Wasserman [141]. A case in point, in our context would be when

the precision parameter M=α(R) is allowed to depend on the sample size.

If Πnis the prior at stage n, then the results goes through if the assumption

Π(K(f0)) >0 is replaced by lim infn→∞ Πn(K(f0)) >0. This follows from the

5.6. MIXTURES OF NORMAL KERNEL 173

fact the Barron’s Theorem (see Chapter 4) goes through with a similar change.

The only stage that needs some care is an argument which involves Fubini, but

it can be handled easily.

2. Another way the Dirichlet mixtures can be extended is by including a further

mixing. Formally, Let X1,X

2,... be observations from a density fwhere f=

φh∗P,P∼Dατ,h∼π,τis a ﬁnite-dimensional mixing parameter, which is

also endowed with some prior ρ.Letf0be the true density. We are interested

in verifying the Schwartz condition at f0and conditions for strong consistency.

By Fubini’s theorem, Schwartz’s condition is satisﬁed for the mixture if

ρ{τ: Schwartz condition is satisﬁed with ατ}>0 (5.24)

(a) In particular, if f0has compact support, then (5.24) reduces to

ρ{τ: supp(f0)⊂supp(ατ)}>0 (5.25)

(b) Suppose f0is not of compact support and τ=(µ, σ) gives a location-scale

mixture. So we have to seek the condition so that the Schwartz condition

holds with the base measure α((·−µ)/σ). We report results only for α0=

α/α(R) double exponential or normal.

When α0is double exponential, a suﬃcient condition is that f0(µ+σx)has

ﬁnite moment-generating function on an open interval containing [−1,1].

When αis normal, we need the integrability of xlog |x|exp[x2/2] with re-

spect to the density f0(µ+σx). For example, if the true density is N(µ0,σ

0),

then the required condition will be σ<σ

0, so we need

ρ{(µ, σ):σ<σ

0}>0

We omit proof of these statements. Simulation shows inclusion of location,

and scale parameters in the base measure improves convergence of the the

Bayes estimates to f0.

(for (µ, σ)) is compact. For each (µ, τ ), ﬁnd the corresponding an(µ, σ)of

Theorem 5.6.5, i.e., satisfying

Dα(µ,τ){P:P[−an(µ, τ),a

n(µ, τ )] <1−δ}<e

−nβ0

for some β0>0. Now choose an=sup

µ,σ an(µ, σ). The order of anwill

then be the same as the individual an(µ, σ)s.

174 5. DENSITY ESTIMATION

(d) In some special cases, it is also possible to allow unbounded location mix-

tures. For example, when the base measure is normal, a normal prior for

the location parameter is both natural and convenient. Strong consistency

continues to hold in this case as long as σhas a compactly supported

prior. To see this, observe that ρ{|µ|>√n}is exponentially small and

sup|µ|≤√n,σ an(µ, σ) is again of the order of √n.

(e) West et al. put a random prior Pon h, independent of Pand a Dirichlet

prior for P. This allows diﬀerent amounts of smoothing near diﬀerent

sets of Xis. Our methods should apply here also. Such techniques, i.e.,

dependence of hon Xisoronxin the range of Xis have been introduced in

the frequentist literature recently and are also known to improve estimates.

5.7 Gaussian Process Priors

Consider the probabilities p1,p

2,...p

kassociated with a multinomial with k- cells.

Often, for example, when the cells correspond to the bins of a histogram, it would

be evident that a priori that the probabilities of adjacent cells would be highly pos-

itively correlated and the correlation would drop oﬀ for cells are farther apart. The

Dirichlet prior for p1,p

2,...p

kresults in negative covariance whereas we want pos-

itive covariance. It is thus necessary to model other covariance structures. The dif-

ﬁculty is one of specifying covariances which would ensure that the prior sits on

Sk={(p1,p

2,...p

k),p

i≥0pi=1}. Leonard([126],[127]) suggested choosing real

variables Y1,Y

2,...Y

kand setting pi=exp(Yi)/exp(Yi). This ensures that pi≥0

and pi= 1. Further if the distribution of Y1,Y

2,...Y

kis tractable, say N(µ, Σ),

then Leonard shows that one can obtain tractable approximations to the posterior.

The situation is even more striking in the case of smooth random densities where

smoothness already implies that the value of the density at two points x, y would be

close if xand yare close. If we use the method of Section 5.5 calculations indicate

that one gets positive covariance (for ﬁxed h) only for very small values of h.Inthe

spirit of Leonard one could choose a stochastic process {Y(x):x∈R}with smooth

sample paths and for any sample path deﬁne f=exp(y)/((exp y(t))dt). Leonard

[127] suggested using a Gaussian process {Y(x):x∈R}. In this section we present

these Gaussian process priors along the lines of Lenk [125]. Lenk considers a larger

class of priors which gives a uniﬁed appearance to the results. An alternative method

is to consider f=expYconditioned on exp Y(t)dt = 1. Thorburn[157] has taken

5.7. GAUSSIAN PROCESS PRIORS 175

this approach. While this method is not discussed here, it would be interesting to see

how this method relates to those developed by Leonard and Lenk.

Let µ:R→ Rand σ:R×R→ R+be a symmetric function. σis said to be

positive deﬁnite if for any x1,x

2,...,x

k,thek×kmatrix with σ(xi,x

j) as its entries

is positive deﬁnite.

Deﬁnition 5.7.1. Let µ:R→ Rand σbe a positive deﬁnite function on R×R.A

process {Y(x):x∈R}is said to be a Gaussian process with mean µand covariance

kernel σif for any x1,x

2,...,x

k,Y(x1),Y(x2),...,Y(xk)hasak-dimensional normal

distribution with mean µ(x1),µ(x2),...,µ(xk) and covariance matrix whose (i, j)th

entry is σ(xi,x

j).

The smoothness of the sample paths of a stochastic process is governed by moment

conditions. Extensive results of this kind can be found in [36]. Following are a few

that we use.

Theorem 5.7.1. Let {ξ(x):x∈R}be a stochastic process. Suppose that for

positive constants p≥r,

E|ξ(t+h)−ξ(t)|p≤K|h|1+rfor all t, h

Let 0<a<r/p. Then there is a process {η(x):x∈R}equivalent to {ξ(x):x∈R}

(i.e. a process with the same ﬁnite-dimensional distributions as {ξ(x):x∈R}) such

that

|η(t+h)−η(t)|≤A|h|awhenever |h|<δ

As an example consider the standard Brownian motion. A Gaussian process with

µ=0andσ(x, y )=x∧y.Leth>0 then

E|ξ(t+h)−ξ(t)|4=3{Var(ξ(t+h)−ξ(t))}2=3h2

So we can take p=4,r = 1 to conclude that the sample paths are Lipschitz of order

at least a,where0<a<1/4.

More generally, since ξ(t+h)−ξ(t)

his N(0,1),

E|ξ(t+h)−ξ(t)|2k=Akhk

and we can choose p=2k, r =k−1,0<a<(k−1)/2k. Letting k→∞,weseethat

the sample functions are Lipshitz of order afor any 0 <a<1/2.

176 5. DENSITY ESTIMATION

Theorem 5.7.2. If for positive constants p<rand K,

E|ξ(t+h)−ξ(t)|p≤K|h|

|log |h||1+r

and

E|ξ(t+h)+ξ(t−h)−2ξ(t)|p≤K|h|1+p

|log |h||1+r

Then there is a process η(t)equivalent to ξ(t)such that η(t)exists and is continuous

almost surely.

To return to Lenk, we consider a Gaussian process Y(x) with mean µand covariance

kernel σ. Lenk appears to assume that

(i) µis continuous;

(ii) σis continuous on R×Rand positive deﬁnite; and

(iii) there exist positive constants c, β,  and nonnegative integer rsuch that

E|Y(x)−Y(y)|β=C|x−y|1+r+

Condition (iii) guarantees that if r≥1 then with probability 1, the sample paths

are rtimes continuously diﬀerentiable. A useful case is when σis of the form σ(x, y)=

ρ(|x−y|) for some function ρon R. In this case, the process is stationary, and easier

suﬃcient conditions are available for the sample paths to be smooth.

Theorem 5.7.3. Let σ(x, y)=ρ(|x−y|).If

1. for some a>3

ρ(h)=1−O{|log |h||−a}as h→0

then there is an equivalent process with continuous sample paths

2. for some a>3and λ2>0,

ρ(h)=1−λh2

2+O(h2

|log |h||a)as h→0

then there is an equivalent process whose sample paths are continuously diﬀer-

entiable

5.7. GAUSSIAN PROCESS PRIORS 177

Cram´er and Leadbetter [36] remark that a>3 may be replaced by a>1 but the

proof requires lot more work. Here are some examples used in Lenk [125].

(i) ρ(x)=e−|x|=1−|x|+O(x2)asx→0;

(ii) ρ(x)=(1−|x|)I|x|≤1=1−|x|as x→0;

(iii) ρ(x)=e−x2=1−x2+O(x4)asx→0; and

(iv) ρ(x)= 1

1+x2=1−x2+O(x4)asx→0.

Cases (i) and (ii) satisfy condition (1) of the theorem and (iii) and (iv) satisfy

condition (2).

Let Ibe a bounded interval and let {Z(x):x∈R}be a Gaussian process with

mean µand covariance kernel σ. The log-normal process, denoted by LN(µ, σ), is the

process W(x)=exp(Z(x)). We will denote the associated measure on R+by Λ(µ, σ ).

Following is a proposition which will be used later.

Proposition 5.7.1. Fix x1,x

2,...,x

kin Iand constants a1,a

2,...,a

Let µ∗(x)=µ(x)+



aiσ(x, xi)

Then

dΛ(µ∗,σ)

dΛ(µ, σ)=k

1W(xi)ai

Ek

1W(xi)ai

=k

1W(xi)ai

eaµ

x+aσx

2a

Here W∈(R+)Iand the expectation in the right-hand side is with respect to

Λ(µ, σ);µx=(µ(x1),µ(x2),...,µ(xk)) and [σx]i,j =σ(xi,x

j),a=a1,a

2,...,a

We will prove the proposition through a series of simple lemmas.

Lemma 5.7.1. Let (Z1,Z

2,...,Z

k)be multivariate normal with mean vector µ=

(µ1,µ

2,...,µ

k)and covariance Σ.Ifµ∗=(µ∗

1,µ

∗

2,...,µ

∗

k)=µ+aΣ, where ais the

vector (a1,···,a

k)then

dN(µ∗,Σ)

dN(µ, Σ) (Z1,Z

2,...,Z

k)=KeiaiZi

where K=1/EeaiZi=1/eaµ+aΣ

2a.

178 5. DENSITY ESTIMATION

Proof. For any µ1and µ2,

((x−µ1)Σ−1(x−µ1)−(x−µ2)Σ−1(x−µ2))

=2(µ2−µ1)Σ−1x+µ1Σ−1µ

1−µ2Σ−1µ

Only the ﬁrst term depends on x. Absorbing the other two terms in the constant

and taking µ1=µ∗and µ2=µthe lemma follows.

Lemma 5.7.2. Let G(µ, σ )stand for the Gaussian measure with mean µand co-

variance σ.Ifµ∗is as in Proposition 5.7.1, then

dG(µ∗,σ)

dG(µ, σ)(Z)=Kek

1aiZ(xi)(5.26)

Proof. It is enough to show that the ﬁnite-dimensional distributions of the measure

deﬁned by (5.26) are those arising from dG(µ∗,σ). But that is precisely the conclusion

of the lemma 5.7.2.

Next we state a simple measure theoretic lemma whose proof is routine.

Lemma 5.7.3. Suppose P, Q are probability measures on (Ω,A)and Tis a 1-1

measurable function from (Ω,B).IfPQthen PT−1QT −1and

dP T −1

dQT −1(ω)=dP

dQ(T−1(ω))

Proof. To return to the proposition, it easily follows from Lemma 5.7.2 and by taking

T(Z)=eZin Lemma 5.7.3.

We next add another real parameter ξ, and following Lenk we deﬁne a generalized

log-normal process LN (µ, σ, ξ). When ξ= 0 the generalized log-normal process is

deﬁned to be LN (µ, σ), i.e., LN (µ, σ, 0) = LN(µ, σ).

For any real ξ,LN (µ, σ, ξ ) is deﬁned by

dLN(µ, σ, ξ)

dLN(µ, σ, 0)(W)=[IW(x)dx]ξ

C(ξ,µ)(5.27)

where C(ξ,µ)=EIW(x)dx]ξthe expectation being taken under LN(µ, σ, 0). Lenk

shows that this expectation exists for all real ξ.

We are now ready to deﬁne the random density.

5.7. GAUSSIAN PROCESS PRIORS 179

Deﬁnition 5.7.2. Let {W(x).x ∈R}be a generalized log normal process LN(µ, σ, ξ)

on R+. The distribution of

f(x)= W(x)

IW(x)dx

is called a logistic normal process and denoted by LNS(µ, σ, ξ).

Clearly fis a random density. We next show that if fhas logistic normal distribu-

tion then so does the posterior given X1,X

2,...,X

Theorem 5.7.4. If f∼LNS(µ, σ, ξ )then the posterior given X1,X

2,...,X

nis

LNS(µ∗,σ,ξ∗)where µ∗(x)=µ(x)+n

1σ(x, Xi)and ξ∗=ξ−n.

Proof. If W∼LN(µ, σ, ξ) then by the Bayes theorem (for densities) the posterior Λ∗

of Wgiven X1,X

2,...,X

nis

dΛ∗

dΛ(µ, σ, 0)(W)=KI

[W(x)dx]ξn

1W(xi)

[I[W(x)dx]n](5.28)

=KI

[W(x)dx]ξ−n



W(xi) (5.29)

and comparison with (5.26) and (5.27) shows that this is LNS(µ∗,σ,ξ∗). The theorem

follows because the distribution of fis just the posterior distribution of W/ IW(x)dx.

Even though the transformations µ→ µ∗,σ → σ, ξ → ξ∗look simple, any interpre-

tation needs to be tempered. First note that µ, σ, ξ do not identify the prior because

if µ1−µ2≡Cthen both µ1,σξ and µ2,σξ will lead to the same prior for f. Second µ

and σdo not translate separately to E(f)andcov(f(x),f(y)). A change in either µor

σwill aﬀect both E(f)andcov(f(x),f(y)). As n→∞both µ∗→∞and ξ∗→−∞

indicating that these cannot be used to do simple minded asymptotics.

Since the prior is on densities, the natural tool to study consistency is the Schwartz

theorem and Theorem 4.4.4. When the Gaussian process is a standard Brownian

motion, with some work it can be shown that if the true distribution f0satisﬁes

log f0is bounded then the Schwartz condition holds at f0.TowardL1-consistency a

natural sieve to consider would be to divide [a, b]intoO(√n)intervalsandtolookat

the class of functions that have oscillation less than δin all the intervals. These are

just preliminary observations; more careful study needs to be done.

180 5. DENSITY ESTIMATION

It also appears, that in analogy with Dirichlet mixtures, one should introduce a

window hin the covariance and have ρh(x)=(1/h)ρ(x/h).

In any case a lot of further work is needed to develop this promising method.

It would also be good to have some theoretical or numerical evidence justifying the

numerical calculation of the posterior given in Lenk. For instance, one could compare

Lenk’s algorithms with approximations based on discretization.

Inference for Location Parameter

6.1 Introduction

We begin our discussion of semiparametric problems with inference about location

parameters. The related problem of regression is taken up in a later chapter.

Our starting point is an important counterexample of Diaconis and Freedman

[46, 45]. Since the Dirichlet process is a very ﬂexible and popular prior for many

inﬁnite-dimensional examples, it seems natural to use it for estimating a location

parameter. Diaconis and Freedman showed that it leads to posterior inconsistency.

Barron suggests that the pathology is more fundamental. We present some of their re-

sults in Section 2. Doss [50], [51] and [52], showed the existence of similar phenomena

when one wants to estimate a parameter θthat is a median.

A common explanation is that inconsistency is due to the Dirichlet sitting on

discrete distributions. It is indeed true that the semiparametric likelihood is diﬃcult

to handle when a prior sits on discrete distributions. But Diaconis and Freedman [46]

argue in their rejoinder to such comments that they expect the same phenomenon

for Polya tree priors that sit on densities. We take up this problem in Sections 6.3

and 6.4 and show that under certain conditions symmetrized Polya tree priors have a

rich Kullback-Leibler support so that by Schwartz’s theorem, one can show posterior

consistency for the location parameter for a large class of true densities.

182 6. INFERENCE FOR LOCATION PARAMETER

One lesson that emerges from all this is that the tail free property, which is a

natural tool for consistency, is destroyed by the addition of a parameter. Hence the

Schwartz criterion is an appropriate tool for proving consistency. In particular, if one

wants posterior consistency for certain true P0s, then it is desirable to have a prior

whose Kullback-Leibler support contains them.

Another natural prior to consider is the Dirichlet mixture of normals, which has

emerged as the currently most popular prior for Bayesian density estimation. We will

explore its properties in the next chapter and return brieﬂy to the location parameter

in Chapter 7.

Much of this chapter is based on Diaconis and Freedman [46] and Ghosal et.al. [78].

6.2 The Diaconis-Freedman Example

Suppose we have the model

Xi=Yi+θ, i =1,2,...,n

where given Pand θ,Yis are i.i.d. P. Finally Pand θare independent with Dirichlet

process prior Dαfor Pand a prior density µ(θ)forθ. The probability measure ¯αhas

a density g.

Suppose the true value of θis θ0and the true distribution of the YsisP0with den-

sity f0. The densities µ, g, f0are all with respect to Lebesgue measure on appropriate

spaces.

The main interest is in the location parameter θand the behavior of the posterior

for θunder P0. Since the random distributions Pare not symmetrized around 0, the

location parameter has an identiﬁability problem. For the time being, we ignore this.

We will rectify this later by symmetrizing P.

To calculate the posterior, note that the random distribution Pof Xs is a mixture

of Dirichlet, i.e., given θ,P∼Dαθ,whereαθ(·)=α(R)¯α(·−θ). Because P0has a

density Xis may be assumed to be distinct. Hence by expression (3.17) the posterior

density Π(θ|X1,X

2,...,X

n) is proportional to

µ(θ)



g(Xi−θ)

As Barron pointed out in his discussion of [46] the Dirichlet is a pathological prior

for a parameter in a semiparametric problem. The posterior is the same as if one

assumed that Xis are i.i.d. with the parametrized density g(Xi−θ).

6.2. THE DIACONIS-FREEDMAN EXAMPLE 183

Diaconis and Freedman point out that consequences of choosing gcan be serious.

If gis a normal density, then one gets consistency, but not when gis Cauchy. An

intuitive interpretation of this is that a normal likelihood for θprovides a robust

model. For example, the MLE is ¯

X, which is consistent for E(X)=θeven without

normality. On the other hand, a Cauchy likelihood for θ, unlike a Cauchy prior,

does not provide robustness. In fact, Diaconis and Freedman provide the following

counterexample. They construct an f0, which has compact support, is symmetric

around 0, and inﬁnitely diﬀerentiable. Under θ0and P0, nearly half the samples the

posterior concentrates around θ0+δand for nearly another half it concentrates around

θ0−δ. The true model P0can be chosen to make δas large as we please. Because

we are now essentially dealing with a misspeciﬁed model g, when actually f0is true,

some insight into this phenomenon as well as the argument in [46] can be achieved

by studying the asymptotic behavior of the posterior under misspeciﬁed models; see

[17] and Bunke and Milhaud [28].

We now indicate why the same phenomenon holds even if we symmetrize Pto

Ps(A)=(1/2)(P(A)+P(−A)).

Given Pwe ﬁrst generate Z1,Z

2,...,Z

n, i.i.d. P. Then deﬁne Yi=|Zi|δi,whereδi

are i.i.d. and δi=±1 with probability 1/2. Then Y1,Y

2,...,Y

nare i.i.d. Ps.Given

Ysandθ;Xi=Yi+θas before. We will provide a heuristic computation of the

posterior distribution of θ.

Assume without loss generality that X1,X

2,...,X

nand (Xi+Xj)/2,1≤i<j≤n

are all distinct. The variables (θ, X), (θ, Z,δ), and (θ, Y ) may be related in two ways.

If θ=(Xi+Xj)/2 for all pairs i, j then

Yi=|Zi|δi=Xi−θ

are all distinct. Moreover, all the |Zi|s are also distinct. For, if |Zi|=|Zj|, then δiand

δjmust be of opposite sign and θmust be (Xi+Xj)/2, a case we have excluded for

the time being. Hence, given θ, |Z1|,|Z2|,...,|Zn|are ndistinct values in a sample of

size nfrom the distribution P|Z|=Ps,|Z|,wherePis Dαθ. Hence one can write down

the joint density of |Z1|,|Z2|,...,|Zn|by equation (3.17). Finally, δis are independent

given θand |Zi|. Since there is a 1-1 correspondence between Yiand (Zi,δ

i), the

density of Yis given θis



1g|z|(|yi|)1

2=C



g(yi)=C



g(Xi−θ) (6.1)

where C={α(R)[n]}−1{α(R)}n.

184 6. INFERENCE FOR LOCATION PARAMETER

There is a second way in which the Yis can be related to Xis. Suppose θ=(Xi+

Xj)/2. Then |Zi|=|Zj|and δiand δjare of opposite sign. The remaining |Z|s—all

(n−2) of them—are all distinct and diﬀerent from the common value of |Zi|and |Zj|.

Hence, given θ=(Xi+Xj)/2, the density of Zs (with respect to (n−1)-dimensional

Lebesgue measure) is

D

k=i,j

g|Z|(|Yk|)g|Z|(|Yi|)=Cn

1g|Z|(|Yk|)

g|Z|(|Yj|)

where D=C/α(R). Finally, given θ=(Xi+Xj)/2, the density of Y1,Y

2,...,Y

nis

Cn

1g|Z|(|Yk)

g|Z|(|Yj|)

2n=g(Xi−θ)

2g(Xi−Xj)(6.2)

because |Yi|=|Yj|=|Xi−Xj|and g(|Xi−Xj|)=g(Xi−Xj).

The density (6.1) multiplied by µ(θ) leads to the absolutely continuous part of the

posterior for θ, while (6.2) leads to its discrete part. Formally, the discrete part is

Πd(θ|X1,X

2,...,X

n)=

i<j

µXi+Xj

2g(Xi−θ)

2g(Xi−Xj)

and the absolutely continuous part has the density

Πc(θ|X1,X

2,...,X

n)=µ(θ)C



g(yi)=Cµ(θ)



g(Xi−θ)

Hence the posterior is

Π(θ|X1,X

2,...,X

n)=Cµc(θ, X)+Dµd(θ, X )

where CNis the norming constant

CN=Cµc(θ, X)dθ +

θ=(Xi+Xj)/2:i<j

µd(θ, X)

A detailed, rigorous proof appears in lemma 3.1 of and Freedman[45]. The posterior

is still pathological and leads to inconsistency.

Diaconis and Freedman give examples of P0where one of the two terms in the

posterior dominate. In case the ﬁrst term dominates, the posterior for the symmetrized

Dirichlet is similar to the posterior for the Dirichlet, and the proof for consistency in

that case applies here.

6.3. CONSISTENCY OF THE POSTERIOR 185

6.3 Consistency of the Posterior

When Phas a symmetrized Dirichlet prior distribution and gis log concave, as for

normal, then Diaconis and Freedman [45] show that the posterior is consistent for all θ0

for essentially “all” true P0. On the other hand without such assumptions consistency

fails, as indicated in the previous section. One explanation is the pathological form of

the posterior. A somewhat deeper explanation is the fact that the Dirichlet and the

symmetrized Dirichlet live on discrete distributions.

Diaconis and Freedman reacted to this as follows. They argued that discreteness is

not the main issue. They construct a class of Polya tree priors, supported by densities

and remark “Now consider the location problem; we guess this prior is consistent

when expectation is the normal and and inconsistent when it is Cauchy. The real

mathematical issue, it seems to us, is to ﬁnd computable Bayes procedures and ﬁgure

out when they are consistent.”

We believe that Diaconis and Freedman are correct in thinking that existence of

density for random Pis not enough. What one needs is a stronger notion of support

and a prior that has a support rich enough to contain one’s favorite P0s. the weak

support is not good enough except for tail free priors. Since addition of a parameter

destroys the tail free property, neither tail free priors nor the assumption that P0is

in the weak support of the prior helps in ensuring consistency. Schwartz’s theorem

shows that a suﬃcient condition for consistency is that P0is in the Kulback-Leibler

support of the prior. Schwartz’s theorem is stated next in the form in which we need

it.

Our parameter space is Θ ×F

swhere Θ is the real line and Fsis the set of

all symmetric densities on R.OnΘ×F

s, we consider a prior µ×Pand given

(θ, f ), X1,X

2,...,X

nare independent identically distributed as Pθ,f,wherePθ,f is

the probability measure corresponding to the density f(x−θ). We denote by fθ

the density f(x−θ). Given X1,X

2,...,X

n, we consider the posterior distribution

(µ×P)(···|X1,X

2,...,X

n)onΘ×Fsgiven by the density fθ(Xi)/fθ(Xi)d(µ×

P)(θ, f ). The posterior (µ×P)(···|X1,X

2,...,X

n)issaidtobeconsistentat(θ0,f

if, as n→∞,(µ×P)(···|X1,X

2,...,X

n) converges weakly to the degenerate measure

δθ0,f0almost surely Pθ0,f0. Clearly, if the posterior is consistent at (θ0,f

0), the marginal

distribution of (µ×P)(···|X1,X

2,...,X

n) on Θ converges to δθ0almost surely Pθ0,f0.

Theorem 6.3.1. If for all δ>0,

(µ×P){(θ, f ):K(fθ0,f

θ)<δ}>0,(6.3)

then the posterior (µ×P)(···|X1,X

2,...,X

n)is consistent at (θ0,f

0).

186 6. INFERENCE FOR LOCATION PARAMETER

A naive way to ensure (6.3) is to require that θ0and f0belong respectively, to the

Euclidean and Kullback-Leibler supports of µand P. The ﬂaw in this argument is

that the Kullback-Leibler divergence is not a metric. So even if θis close to θ0and

K(f0,f) is small, we cannot draw any conclusion about K(f0θ0,f

θ)orK(f,fθ). A

way out is indicated below.

Deﬁnition 6.3.1. The map (θ, f )→ fθis said to be KL-continuous at (0,f

0)if

K(f0,f

0,θ)=∞

−∞

f0(x)log(f0(x)/f0(x−θ))dx →0asθ→0.

We would then call (0,f

0)aKL-continuity point.

Let f∗

0,θ be the density deﬁned by f∗

0,θ(x)=(f0,θ (x)+f0,θ(−x)) /2, the symmetriza-

tion of f0,θ where f0,θ stands for f0(.−θ). For later convenience we write P∗instead

of Pfor a prior on Fs.

Assumption A: Support of µis Rand for all θsuﬃciently small, f∗

0,θ is in the

K-LsupportofP∗.

It is easy to check that this condition holds for many common densities, e.g., for

normal or Cauchy. However, it fails for densities like uniform on an interval. For such

cases a diﬀerent method is discussed later.

Theorem 6.3.2. If µand P∗satisfy Assumption A and if (0,f

0)is a KL-continuity

point, then the posterior (µ×P∗)(···|X1,X

2,...,X

n)is consistent at (0,f

0).

Proof. We ﬁrst prove it when θ= 0. By Theorem 6.3.1, it is enough to verify that

µ×P∗satisﬁes the Schwartz condition (6.3). For any θ,

K(f0,f

θ)=∞

−∞

f0log(f0/f−θ) (6.4)

=∞

−∞

f0,θ log f0,θ −∞

−∞

f0,θ log f

Since ∞

−∞

f0,θ log f∗

0,θ =∞

−∞

f∗

0,θ log f∗

0,θ (6.5)

and ∞

−∞

f0,θ log f=∞

−∞

f∗

0,θ log f, (6.6)

6.3. CONSISTENCY OF THE POSTERIOR 187

we have, by the concavity of logx

K(f0,f

θ)=

∞

−∞

f0,θ log(f0,θ/f ∗

0,θ)+∞

−∞

f∗

0,θ log(f∗

0,θ/f )

≤1

2∞

−∞

f0,θ log f0,θ

f0,θ +1

2∞

−∞

f0,θ log f0,θ

f0,−θ+K(f∗

0,θ,f)

2K(f0,f

0,−2θ)+K(f∗

0,θ,f)

(6.7)

By the KL-continuity assumption there is an εsuch that for |θ|<ε, the ﬁrst term

is less than δ/2. For any θ, by Assumption A, {f:K(f∗

0,θ,f)<δ/2}has positive P∗

measure. Thus we have, for each θ∈[−ε, ε],{f:K(f∗

0,θ,f)<δ/2}is contained in

{f:K(f0,f

θ)<δ}. Since µ[−ε, ε]>0 this completes the proof for θ=0.

For a gen e ral θ0,K(f0,θ0,f

θ0+θ)=K(f0,f

θ) which by the previous argument is less

than δwith positive probability, if fis chosen as before and θis in [θ0−, θ0+].

Assumption A of Theorem 6.3.2 can be veriﬁed if P∗arises as follows. Let P∗be a

symmetrization of Pobtained by one of the following two methods.

Method 1. Let Pbe a prior on F. The map f→ (f(x)+f(−x))/2fromFto Fs

induces a measure on Fs.

Method 2. Let Pbe a prior on F(R+)—the space of densities on R+. The map

f→ f∗, where, f∗(x)=f∗(−x)=f(x)/2, gives rise to a measure on Fs.

Lemma 6.3.1. Let Pbe a prior on For on F(R+)with a given symmetric f0

in its K-L support. Let P∗be the prior obtained on Fsby Method 1or Method 2.If

f0∈Fs, then

P∗{f∈Fs:K(f0,f)<δ}>0 (6.8)

Proof. For Method 1, the result follows from Jensen’s inequality; the conclusion is

immediate for method 2 because, setting g0(x)=2f0(x)andg(x)=2f(x)forxin

R+,bothg0,g belong to F(R+)andK(f0,f)=K(g0,g).

The K-L continuity assumptions fails if f0has support in a ﬁnite interval. However,

our next result in this section shows that consistency continues to hold even when

f0has support in a ﬁnite interval, provided f0is continuous. The proof consists in

approximating f0by an f1satisfying conditions of Theorem 6.3.2. We ﬁrst need a

lemma to bound a K-L number. It is a slight improvement over a lemma in [78].

Lemma 6.3.2. Let f0and f1be densities so that f0≤Cf1. Then for any f,

K(f0,f)≤Clog C+[K(f1,f)+K(f1,f)]

188 6. INFERENCE FOR LOCATION PARAMETER

Proof. First note that C≥1. Also

K(f0,f)≤f0[log(f0/f1)]+≤Cf1[log(Cf1/f)]+

≤Clog C+Cf1[log(f1/f)]+

(6.9)

But f1[log(f1/f)]+≤K(f1,f)+f1[log(f1/f)]−(6.10)

f1[log(f1/f)]−=f1[log(f/f1)]+≤f1f

f1−1+

=f−f1

2≤K(f1,f)

(6.11)

The last inequality follows from Proposition 1.2.2. Combining (6.9), (6.10) and (6.11),

one gets the lemma.

Theorem 6.3.3. If µand P∗satisfy Assumption A, f0is continuous and has

support in a ﬁnite interval [−a, a], and log α(x)is integrable with respect to N(µ, σ2)

for all (µ, σ), then the posterior P(···|X1,X

2,...,X

n)is consistent at (θ, f0)for all

θ.

Proof. We consider two cases.

Case 1. inf

[−a,a]f0(x)=α>0.

Let

f1(x)=⎧

⎨

⎩

(1 −η)f0(x),for −a<x<a

(η/2)φ−a,σ2,for x≤−a

(η/2)φa,σ2,for x≥a

(6.12)

where φ−a,σ2and φa,σ2are, respectively, the densities of N(−a, σ2)andN(a, σ2)and

σ2is chosen to ensure that f1is continuous at a.

We ﬁrst show that f1is KL-continuous, i.e.,

lim

θ→0∞

−∞

f1log(f1/f1,θ)=∞

−∞

lim

θ→0f1log(f1/f1,θ) = 0 (6.13)

It is enough to establish that for some ε>0, the family {log(f1/f1,θ):|θ|<ε}is

uniformly integrable with respect to f1. This follows because for any M,

sup

|θ|<ε

sup

|x|<M|log(f1(x)/f1,θ(x))|<C

6.4. POLYA TREE PRIORS 189

and when Mis large, for |x|>M,f1,θ(x)=(η/2)(σ√2π)−1exp[−(x−a−θ)2/(2σ2)]

for all |θ|<ε, implying

sup

|θ|<ε |x|>M

f1(x)log(f1(x)/f1,θ (x))dx →0asM→∞

It now follows from Lemma 6.3.2 that, by setting C=(1−η)−1and choosing ηclose

to1sothat(C+1)logC<δ/2, we can choose a δ∗such that K(f1,f)<δ

∗implies

K(f0,f)<δ; consequently {(θ, f ):K(f1,f

θ)<δ

∗}⊂{(θ, f ):K(f0,f

θ)<δ}.

Theorem 6.3.2 shows that the set on the left hand side has positive µ×P∗measure.

Case 2. inf

[−a,a]f0(x)=0.

By the continuity of f0,wecan,givenanyη>0, choose a Csuch that a

−a(f0∨C)=

1+η,wherea∨b=max(a, b). Set f1=(1+η)−1(f0∨C). Then f0≤(1 + η)f1and

using Lemma 6.3.2, we can choose ηand δ∗smallsuchthat{f:K(f1,f)<δ

∗}⊂

{f:K(f0,f)<δ}. Since f1is covered by Case 1, the theorem follows.

In the remaining section we concentrate on constructing Polya tree priors which

satisfy conditions of Theorem 6.3.2 for many f0s.

6.4 Polya Tree Priors

The main result in this section is Theorem 6.4.1. It implies that Assumption A is true

if P∗is a symmetrization of the Polya tree prior in this theorem and K(f0,θ0,α)<

∞for all θ0.

We already discussed the basic properties of Polya trees in Chapter 3. They are

recalled below. Let E={0,1}and Embe the m-fold Cartesian product E×···×E

where E0=∅.Further,setE∗=∪∞

m=0Em.Letπ0={R}and for each m=1,2,...,

let πm={Bε:ε∈Em}be a partition of Rso that sets of πm+1 are obtained from a

binary split of the sets of πmand ∪∞

m=0πmis a generator for the Borel σ-ﬁeld on R.

Let Π = {πm:m=0,1,...}.

A random probability measure Pon Ris said to possess a Polya tree distribution

with parameters (Π,A); we write P∼PT(Π,A), if there exist a collection of non-

negative numbers A={αε:ε∈E∗}and a collection Y={Yε:ε∈E∗}of random

variables such that the following hold:

(i) the collection Yconsists of mutually independent random variables;

(ii) for each ε∈E∗,Yεhas a beta distribution with parameters αε0and αε1;

190 6. INFERENCE FOR LOCATION PARAMETER

(iii) the random probability measure Pis related to Ythrough the relations

P(Bε1···εm)=⎛

⎝



j=1;εj=0

Yε1···εj−1⎞

⎠⎛

⎝



j=1;εj=1

(1 −Yε1···εj−1)⎞

⎠m=1,2,...,

where the factors are Y0or 1 −Y0if j=1.

We restrict ourselves to partitions Π = {πm:m=0,1,...}that are determined by

a strictly positive continuous density αon Rin the following manner: The sets in πm

are intervals of the form {x:(k−1)/2m<x

−∞ α(t)dt ≤k/2m},k=1,2,...,2m.We

term the measure (corresponding to) αas the base measure because its role is similar

to the base measure of Dirichlet process.

Our next theorem reﬁnes theorem 2 of Lavine [119] by providing an explicit condi-

tion on the parameters.

Theorem 6.4.1. Let f0be a density and Pdenote the prior PT(Π,A), where

αε=rmfor all ε∈Emand ∞

m=1 r−1/2

m<∞. Further assume that K(f0,α)<∞.

Then for every δ>0,

P{P:K(f0,f)<δ}>0 (6.14)

Proof. By Theorem 3.3.7, the weaker condition ∞

m=0 r−1

m<∞implies the existence

of a density of the random probability measure. Considering the transformation x→

x

−∞ α(t)dt, assume that fand f0are densities on [0,1].Moreover,Πisthenthe

canonical binary partition. By the martingale convergence theorem, there exists a

collection of numbers {yε:ε∈E∗}from [0,1] such that, with probability one

f0(x) = lim

m→∞ ⎛

⎝



j=1;εj=0

2yε1···εj−1⎞

⎠⎛

⎝



j=1;εj=1

2(1 −yε1···εj−1)⎞

⎠. (6.15)

where the limit is taken through a sequence ε1ε2··· which corresponds to the dyadic

expansion of x. It similarly follows that

f(x) = lim

m→∞ ⎛

⎝



j=1;εj=0

2Yε1···εj−1⎞

⎠⎛

⎝



j=1;εj=1

2(1 −Yε1···εj−1)⎞

⎠(6.16)

for almost every realization of f.NowforanyN≥1,

K(f0,f)=MN+R1N−R2N(6.17)

6.4. POLYA TREE PRIORS 191

where

MN=E⎡

⎣log ⎛

⎝



j=1;εj=0 yε1···εj−1

Yε1···εj−1N



j=1;εj=1 1−yε1···εj−1

1−Yε1···εj−1⎞

⎠⎤

⎦(6.18)

R1N=E[log( ∞



j=N+1;εj=0

2yε1···εj−1

∞



j=N+1;εj=1

2(1 −yε1···εj−1))] (6.19)

and

R2N=E[log( ∞



j=N+1;εj=0

2Yε1···εj−1

∞



j=N+1;εj=1

2(1 −Yε1···εj−1))] (6.20)

with Estanding for the expectation with respect to the distribution of (ε1,ε

2,...)for

a ﬁxed realization of the Ys. The εs come from the binary expansion of x,andxis

distributed according to the density f0.

By the deﬁnition of a Polya tree, MNand R2Nare independent for all N≥1. To

prove (6.14), we show that for any δ>0,thereissomeN≥1 such that

P{MN<δ}>0 (6.21)

|R1N|<δ (6.22)

and

P{|R2N|<δ}>0 (6.23)

The set {(Yε:ε∈Em,m =0,...,N −1) : MN<δ}is a nonempty open

set in R2N−1; it is open by the continuity of the relevant map and it is nonempty

as (yε:ε∈Em,m =0,...,N −1) belongs to this set. Thus (6.21) follows by

the nonsingularity of the beta distribution. Relation (6.22) follows from lemma 2 of

Barron [6]. To complete the proof, it remains to show (6.23) for some N≥1. We

actually prove the stronger fact

lim

N→∞ P{|R2N|≥δ}= 0 (6.24)

Let Estand for the expectation with respect to the prior distribution.i.e., the distri-

bution of the YsandE, as before, the expectation with respect to the distribution of

192 6. INFERENCE FOR LOCATION PARAMETER

(ε1,ε

2,...). Now

P{|R2N|≥δ}

≤δ−1E|R2N|

≤δ−1EE[∞



j=N+1;εj=0 |log(2Yε1···εj−1)|+∞



j=N+1;εj=1 |log(2(1 −Yε1···εj−1))|]

=δ−1E[∞



j=N+1;εj=0

E|log(2Yε1···εj−1)|+∞



j=N+1;εj=1

E|log(2(1 −Yε1···εj−1))|](6.25)

≤δ−1E[∞



j=N+1

max{E|log(2Yε1···εj−1)|,E|log(2(1 −Yε1···εj−1))|]

≤δ−1∞



j=N+1

max

(ε1···εj−1)∈Ej−1max{E|log(2Yε1···εj−1)|,E|log(2(1 −Yε1···εj−1))|]

=δ−1∞



j=N+1

η(rj−1)

where η(k)=E|log(2Uk)|with Uk∼Beta(k, k). By Lemma 6.4.1, η(k)=O(k−1/2)

as k→∞. Since ∞

m=1 r−1/2

m<∞by assumption, the right-hand side of (6.25) is

the tail of a convergent series. This completes the proof of (6.24) and hence of the

theorem as well.

Remark 6.4.1.Essentially the same proof shows that the Kullback-Leibler neighbor-

hoods would continue to have positive measure when the prior is modiﬁed as follows:

Divide Rinto k+ 1 intervals I1,...,I

k+1 and assume that (P(I1),...,P(Ik)) have

a joint density which is positive everywhere on the k-dimensional set {(a1,...,a

k):

ai>0,j =1,...,k,k

j=1 ai<1}.ForeachIj, the conditional distribution given

P(Ij) has a Polya tree prior satisfying the assumptions of the theorem. These priors

are special cases of the priors constructed by Diaconis and Freedman. Moreover, it

follows from theorem 1 of Lavine [119] that such priors can approximate any prior

belief up to any desired degree of accuracy in a strong sense.

Remark 6.4.2.It is not necessary that for each m,αε1···εmbe the same for all

(ε1,...,ε

m)∈Em. The proof goes through even when only αε1···εm−10=αε1···εm−11

for all (ε1,...,ε

m−1)∈Em−1,m≥1, and rm:= min{αε1···εm:(ε1,...,ε

m)∈Em}

satisﬁes the condition ∞

m=1 r−1/2

m<∞.

6.4. POLYA TREE PRIORS 193

Lemma 6.4.1. If Uk∼beta(k, k), then E|log(2Uk)|=O(k−1/2)as k→∞.

Proof. The proof uses Laplace’s method with a rigorous control of the error term. Let

ηk=E|log(2Uk)|, i.e.,

ηk=1

B(k, k)1

0|log(2u)|uk−1(1 −u)k−1du (6.26)

B(k, k)1

0|log(2(1 −u))|uk−1(1 −u)k−1du (6.27)

Adding (6.26) and (6.27) and observing that log(2u) and log(2(1 −u))arealwaysof

the opposite sign,

2ηk=1

B(k, k)1

0|log(u/(1 −u))|uk−1(1 −u)k−1du (6.28)

This implies by Jensen’s inequality that

4η2

k≤1

B(k, k)1

(log(u/(1 −u)))2uk−1(1 −u)k−1du

B(k, k)1

0{1+(log(u/(1 −u)))2}uk−1(1 −u)k−1du −1

(6.29)

We approximate the integral by Laplace’s method. Let

{1+(log(u/(1 −u)))2}uk−1(1 −u)k−1=exp(gk(u)) (6.30)

where

gk(u)=(k−1) log u+(k−1) log(1 −u)+h(u)

and

h(u)=log{1+(log(u/(1 −u)))2}

Clearly, gk(1/2) = −2(k−1) log 2, g

k(1/2) = 0 and g

k(u) is decreasing in uso that

gk(u) has a unique maximum at 1/2. Fix δ>0 and let λ=sup{h(u):|u−1/2|<δ}.

Then on u∈(1/2−δ, 1/2+δ), we have

gk(u)≤−2(k−1) log 2 −(u−1

2)2

2(8(k−1) −λ) (6.31)

194 6. INFERENCE FOR LOCATION PARAMETER

Thus

4η2

≤1

B(k, k)1/2+δ

1/2−δ

exp[−2(k−1) log 2 −4(k−1) 1−λ

8(k−1)(u−1

2)2]du

B(k, k)|u−1

2|>δ{1+(log(u/(1 −u)))2}uk−1(1 −u)k−1du −1 (6.32)

≤Γ(2k)

(Γ(k))22−2(k−1) ∞

−∞

exp[−4(k−1) 1−λ

8(k−1)(u−1

2)2]du

B(k, k)|u−1

2|>δ{1+(log(u/(1 −u)))2}uk−1(1 −u)k−1du −1

Since the function u(1 −u){1+(log(u/(1 −u))2}is bounded on (0,1) by, say, M,the

second term on the right-hand side of (6.32) is dominated by

B(k, k)|u−1/2|>δ

uk−2(1 −u)k−2du

=M(2k−1)(2k−2)

(k−1)2P{|Uk−1−1

2|>δ}

≤M(2k−1)(2k−2)

(k−1)2E|Uk−1−1

2|2/δ2

=O(k−1)

(6.33)

The ﬁrst term on the right-hand side of (6.32) is

Γ(2k)

(Γ(k))22−2k+2(2π)1/2(8(k−1) −λ)−1/2(6.34)

which, by an application of Stirling’s inequalities [[171] p. 253], is less than

(2k)2k−1/2e−2k(2π)1/2exp[(24k)−1]

(kk−1/2e−k(2π)1/2)22−2k+2(2π)1/2

×2−3/2(k−1)−1/21−λ

8(k−1)−1/2

=k

k−11/2

exp[(24k)−1]1−λ

8(k−1)−1/2

=1+O(k−1)

(6.35)

Thus η2

k=O(k−1), completing the proof.

6.4. POLYA TREE PRIORS 195

Remark 6.4.3.While we have discussed consistency issues, it would be interesting

to explore how the robustness calculations in Section 4 of Lavine [119] can be made

in the context of a location parameter.

We have argued that the Schwartz theorem is the best available tool for handling

consistency issues in semiparametric problems. We have also exhibited a Polya tree

priors which have a rich K-L support. However, there are caveats. The consistency

theorem notwithstanding, computation of the posterior for θfor a density f0of the

kind used by Diaconis-Freedman shows that convergence for Cauchy base measure is

very slow. Even for n= 500, one notices the tendency to converge to a wrong value,

as in the case of the Dirichlet prior with Cauchy base measure. Rapid convergence

does take place if we replace the Cauchy by the normal.

A second fact is that the condition r−1/2

m<∞implies that the tail of the

random P∗is close in some sense to the tail of the prior expected density. This in

turn implies that the posterior for fconverges to δf0rather slowly, which might imply

relatively slow convergence also of the posterior for θ. Both these questions can be

better understood if one can get rates of convergence of the posterior and see how

they depend on the base measure and the rms. These are delicate issues.

What happens if r−1/2

m=∞? We have conjectured earlier that then, the Schwartz

condition would not hold. If so, it seems likely that in all such cases consistency would

depend dramatically on the base measure.

Regression Problems

7.1 Introduction

An important semiparametric problem is to make inference about the constants in

the regression equation when the error in the regression model

Yi=α+βxi+i,i=1,2,... (7.1)

has an unknown, symmetric distribution. This is similar to the location parameter

problem, so it is natural to try a symmetrized Polya tree prior for the error distribu-

tion. Another prior that suggests itself is a symmetrized version of Dirichlet mixtures

of normals of Chapter 5. We explore both priors in this chapter with a focus on pos-

terior consistency. The covariate may arise as ﬁxed nonrandom constants or as i.i.d.

observations of a random variable.

Because this is a semiparametric problem, it is natural to try to use Schwartz’s the-

orem. However since the observations are not identically distributed, major changes

are needed. We begin with a variant of Schwartz’s theorem in Section 7.2. In two of

the subsequent sections we discuss how the conditions of the theorem can be veriﬁed.

Lack of i.i.d. structure for the Yis necessitates assumptions on the xis to ensure that

the exponentially consistent tests required by Schwartz’s theorem exist in the cur-

rent context. Also certain conditions have to be imposed on f0to verify conditions

relating to K-L support and variance in the Schwartz theorem. Among other things

198 7. REGRESSION PROBLEMS

it is shown that Polya tree priors of the sort considered in the Chapter 6 fulﬁll the

required conditions on the prior.

We then turn to the Dirichlet mixtures of normal. It turns out that the random

densities are suﬃciently well behaved that the proof for results similar to that outlined

in the previous paragraph can be simpliﬁed to some extent.

It may be observed that as in the Chapter 6 it may be tempting to use a Dirichlet

prior on F. It is easy to show that the posterior for α, β would be pathological in

exactly the same way, namely, it would be identical with the posterior arising from

assigning a parametric prior on F. The proof is quite similar.

In the literature, the regression problem has been handled by putting a Dirichlet

mixture of normals but without symmetrization. This means that there is an identi-

ﬁability problem for the constant but not for the regression coeﬃcient β. Of course,

the posterior for αcannot be consistent, but one can show posterior consistency for β.

In many examples, one would want consistency for both αand β, so symmetrization

seems desirable. See , Burr et al.[29] for an interesting application.

The ﬁnal section discusses binary response regression with nonparametric link func-

tions. This chapter is based heavily on [134] and unpublished work of Messan.

7.2 Schwartz Theorem

Fix f0,α0,β0.Let

fα,β,i =fα+βxi(y)=f(y−(α+βxi)) (7.2)

and put f0i=f0,α0,β0,i.

For any two densities fand g,let

K(f,g)=flog f

g,V(f, g)=flog f

g2

(7.3)

and put

Ki(f,α,β)=K(f0i,f

α,β,i),V

i(f,α,β)=V(f0i,f

α,β,i) (7.4)

As mentioned in the introduction, the main tool we use is a variant of Schwartz’s

theorem. The following theorem is an adaptation to the case when the Yis are inde-

pendent but not identically distributed. Here the xis are nonrandom.

Deﬁnition 7.2.1. Let W⊂F×R×R. A sequence of test functions Φn(Y1,...,Y

is said to be exponentially consistent for testing

H0:(f,α,β)=(f0,α

0,β

0) against H1:(f,α,β)∈W (7.5)

7.2. SCHWARTZ THEOREM 199

if there exist constants C1,C2,C>0 such that

(a) En



f0i

Φn≤C1e−nC ,and

(b) inf

(f,α,β)∈W En



fα,β,i

(Φn)≥1−C2e−nC .

Theorem 7.2.1. Suppose ˜

Πis a prior on Fand µis a prior for (α, β).LetW⊂

F×R×R.If

(i) there is an exponentially consistent sequence of tests for

H0:(f,α,β)=(f0,α

0,β

0)against H1:(f,α,β)⊂W

(ii) for all δ>0,

Π(f,α,β):Ki(f,α,β)<δ for all i, ∞



i=1

Vi(f,α,β)

i2<∞>0

then with ∞

i=1 Pf0iprobability 1, the posterior probability

Π(W|Y1,...,Y

n)= Wn

i=1

fα,β i(Yi)

f0i(Yi)dΠ(f,α,β)

F×R×Rn

i=1

fα,β i(Yi)

f0i(Yi)dΠ(f,α,β)→0 (7.6)

Note that Vi(f,α,β) bounded above in iis suﬃcient to ensure the summability of

∞

i=1 Vi(f,α,β)/i2.

Proof. The proof is similar to the proof of Schwartz’s theorem. If we write (7.6) as

Π(W|Y1,...,Y

n)=I1n(Y1,...,Y

I2n(Y1,...,Y

n)(7.7)

it can be shown, as in the proof of Schwartz’s theorem (Chapter 4), that condition

(i) implies that “ there exists a d>0 such that endI1n(Y1,...,Y

n)→0a.s.”

The denominator can be handled similarly, using Kolomogorov’s strong law of large

numbers for independent but not identically distributed random variables. Yet, with

200 7. REGRESSION PROBLEMS

a later application in mind, we give an argument here with a somewhat weaker as-

sumption than (ii). For any two densities fand g,let

V+(f,g)=flog+

g2

(7.8)

and put

V+i(f,α,β)=V+(f0i,f

α,β,i) (7.9)

We will show that “ for all d>0, endI2n(Y1, ..., Yn)→∞a.s.” under the assumption,

(ii)For al l δ>0,

Π(f,α,β):Ki(f,α,β)<δ for all i, ∞



i=1

V+i(f,α,β)

i2<∞>0

Because V+(f,g)≤V(f,g) it is easy to see that (ii) implies (ii).

Let Vbe the set

(f,α,β):Ki(f,α,β)<δ for all i, ∞



i=1

V+i(f,α,β)

i2<∞

and Wi=log

+(f0i/fα,β,i)(Yi). Applying Kolmogorov’s strong law of large numbers

for independent non-identical variables to the sequence Wi−E(Wi), it follows that

for each f∈V, a.s. ∞

i=1 Pf0i,

lim inf

n→∞ 1



i=1

log fα,β,i(Yi)

f0i(Yi)

≥−lim sup

n→∞ 1



i=1

log+

f0i(Yi)

fα,β,i(Yi)

=−lim sup

n→∞



i=1

K+

i(f,α,β) (7.10)

≥−lim sup

n→∞ 1



i=1

Ki(f,α,β)+ 1



i=1 Ki(f,α,β)/2

≥−lim sup

n→∞ ⎛

⎝



i=1

Ki(f,α,β)+4



i=1

Ki(f,α,β)/2⎞

⎠

7.3. EXPONENTIALLY CONSISTENT TESTS 201

Since for f∈V,n−1n

i=1 Ki(f,α,β)<δ,wehaveforeachf∈V,

lim inf

n→∞



i=1

log fα,β,i(Yi)

f0i(Yi)≥−(δ+δ/2) (7.11)

Choosing Cso that δ+δ/2≤C/8 and noting that

I2n≥V



i=1

fα,β,i(Yi)

f0i(Yi)dΠ(f,α,β)

it follows from Fatou’s lemma that

enC/4I2n→∞ (7.12)

a.s. ∞

i=1 Pf0i.

Remark 7.2.1.Condition (ii) of the theorem can be weakened. It can be seen from

the proof that if the prior assigns positive probability to the following set



i=1

Ki(f,α,β)<δfor all n, ∞



i=1

Vi(f,α,β)+K2

i(f,α,β)

i2<∞

then also the posterior is consistent.

7.3 Exponentially Consistent Tests

Our goal is to establish consistency for (f,α,β)orfor(α, β)at(f0,α

0,β

0), and thus

the sets Wof interest to us are of the type W=Uc,whereUis a neighborhood of β0

or α0alone or of (f0,α

0,β

0). In the ﬁrst case we write Wof this type as a ﬁnite union

of Wis and show that condition (i) of Theorem 7.2.1 holds for each of these Wis.

We begin with a couple of lemmas.

Lemma 7.3.1. For i=1,2, let g0iand gibe densities on R. If for each ithere

exists a function Φi,0≤Φi≤1such that

Eg0i(Φi)=αi≤γi=Egi(Φi) (7.13)

and if

lim inf

n→∞



i=1

(γi−αi)>0 (7.14)

then there exists a constant C, sets Bn⊂Rn,n=1,2,..., and n0— all depending

only on (γi,α

i), such that for n>n

202 7. REGRESSION PROBLEMS

[n

i=1 Pg0i](Bn)<e

−nC , and

[n

i=1 Pgi](Bn)>1−e−nC .

We refer to [134] for a proof. For a density gand θ∈R,letgθstand for the density

gθ(y)=g(y−θ).

Lemma 7.3.2. Let g0be a continuous symmetric density on R, with g0(0) >0.

Let ηbe such that inf|y|<η g0(y)=C>0.

(i) For any ∆>0,there exists a set B∆such that

Pg0(B∆)≤1

2−C(∆ ∧η)

and for any symmetric density g

Pgθ(B∆)≥1

2for all θ≥∆

(ii) For any ∆<0, there exists a set ˜

B∆such that

Pg0(˜

B∆)≤1

2−C(∆ ∧η)

and for any symmetric density g

Pgθ(˜

B∆)≥1

2for all θ≤∆

Proof. (i) Take B∆=(∆,∞). Since θ≥∆andgθis symmetric around θ,Pgθ(B∆)≥

2.Ontheotherhand

Pg0(B∆)=1

2−∆

g0(y)dy ≤1

2−∆∧η

g0(y)dy ≤1

2−C(∆ ∧η) (7.15)

Similarly ˜

B∆=(−∞,∆) would satisfy condition (ii).

Remark 7.3.1.By considering IB∆(y−θ0), it is easy to see that Lemma 7.3.2 holds

if we replace g0by g0,θ0and require θ−θ0>∆orθ−θ0<∆.

7.3. EXPONENTIALLY CONSISTENT TESTS 203

Assumption A. There exists ε0>0 such that the covariate values xisatisfy

lim inf

n→∞



i=1

I{xi<−ε0}>0,lim inf

n→∞



i=1

I{xi>ε

0}>0

Remark 7.3.2.Assumption A forces the covariate xto take both positive and neg-

ative values, i.e., values on both sides of 0. If the condition is satisﬁed around any

point, then by a simple location shift, we can bring it to the present form.

Proposition 7.3.1. If Assumption A holds, f0is continuous at 0and f0(0) >0,

then there is an exponentially consistent sequence of tests for

H0:(f,α,β)=(f0,α

0,β

0)against H1:(f,α,β)∈W

in each of the following cases:

(i) W={(f,α,β): α>α

0,β−β0>∆};

(ii) W={(f,α,β): α<α

0,β−β0>∆};

(iii) W={(f,α,β): α>α

0,β−β0<−∆}; and

(iv) W={(f,α,β): α<α

0,β−β0<−∆}.

Proof. (i) Let Kn={i:1≤i≤n, xi>ε

0}and #Knstand for the cardinality of

Kn. We will construct a test using only those Yis for which the corresponding iis in

Kn.

If i∈Kn,then (α+βxi)−(α0+β0xi)>∆xi, and by Lemma 7.3.2 for each i∈Kn,

there exists a set Aisuch that

αi:= Pf0i(Ai)<1

2−C(η∧∆xi)

and

γi:= inf

(f,α,β)∈W Pfα,β,i (Ai)≥1

where “:=” stands for equality by deﬁnition.

204 7. REGRESSION PROBLEMS

If i≤nand i/∈Kn,setAi=R, so that αi=γi=1.Thus

lim inf

n→∞ n−1



i=1

(γi−αi)

≥lim inf

n→∞ n−1

i∈Kn

C(η∧∆xi)(7.16)

≥C(η∧∆) lim inf

n→∞ #Kn/n > 0

With Φi=IAi, the result follows from Lemma 7.3.1.

(ii) In this case we construct tests using Yisuch that i∈Mn:= {1≤i≤n:xi<

−ε0}.Ifi∈Mn, then

(α+βxi)−(α0+β0xi)<∆xi<−∆ε0

Now using condition (ii) of Lemma 7.3.2, we get sets ˜

Biand then obtain exponentially

consistent tests using Lemma 7.3.1 as in part (i). The other two cases follow similarly.

The union of the W’s in Proposition 7.3.1 is the set {(f, α,β):|β−β0|>∆}.

The case for αalone can be proved in exactly the same way. Combining all eight

exponentially consistent tests for αand βone can get an exponentially consistent

test for α=α0,β =β0.

If random fs are not symmetrized around zero, αis not identiﬁable. So the posterior

distribution for αwill not be consistent. Consistency for βwill continue to hold under

appropriate conditions. To prove the existence of uniformly consistent tests for βin

the nonsymmetric case, we pair Yis and consider the diﬀerence Yi−Yj,whichhas

a density that is symmetric around β(xi−xj). We can now handle the problem in

essentially the same way as in Proposition 7.3.1 to construct strictly unbiased tests.

The veriﬁcation of the other conditions in Sections 7.4, 7.5 and 7.6 is along similar

lines.

The next proposition considers neighborhoods of f0to get posterior consistency

for the true density rather than only the parametric part. We need an additional

assumption.

Assumption B. For so m e L,|xi|<Lfor all i.

Proposition 7.3.2. Suppose that Assumption Bholds. Let Ube a weak neighbor-

hood of f0and let W=Uc×{(α, β):|α−α0|<∆,|β−β0|<∆}. Then there exists

7.3. EXPONENTIALLY CONSISTENT TESTS 205

an exponentially consistent sequence of tests for testing

H0:(f,α,β)=(f0,α

0,β

0)against H1:(f,α,β)∈W

Proof. Without loss of generality take

U=f:Φ(y)f(y)−Φ(y)f0(y)<ε

(7.17)

where 0 ≤Φ≤1 and Φ is uniformly continuous.

Since Φ is uniformly continuous, given ε>0, there exists δ>0 such that |y1−y2|<

δimplies |Φ(y1)−Φ(y2)|<ε/2.

Let ∆ be such that

|(α−α0)+(β−β0)xi|<δ

for α, β ∈Wand all xi. Set ˜

Φi(y)=Φ(y−(α0+β0xi)). Then

Ef0i˜

Φi=Ef0Φi,E

fα,β,i ˜

Φi=Ef(α−α0),(β−β0),i Φ (7.18)

Noting that

Φ(y−((α−α0)+(β−β0)xi))f(α−α0)+(β−β0)xi(y)dy

=Φ(y)f(y)dy

we have

˜

Φi(y)fα,β,i(y)dy

≥Φ(y)f(y)dy −|Φ(y)−Φ(y−((α−α0)+(β−β0)xi))|

×f(α−α0)+(β−β0)xi(y)dy

≥Φ(y)f(y)dy −ε

in the last step, we used the uniform continuity of Φ. An application of Lemma 7.3.1

completes the proof.

If one is interested in showing posterior probability of f∈U, |α−α0|<∆,|β−β0|<

δgoesto1a.s.(f0,α

0,β

0), then it is necessary to get an exponential sequence of tests

for H0:(f,α,β)=(f0,α

0,β

0) against H1:f∈Ucor |α−α0|>Aor |β−β0|>δ.For

this, one has only to combine Propositions 7.3.1, its analogoue for α, and Proposition

7.3.2.

206 7. REGRESSION PROBLEMS

7.4 Prior Positivity of Neighborhoods

In this section we develop suﬃcient conditions to verify condition (ii) of Theorem

7.2.1. A similar problem in the context of location parameter was studied in Chapter

6. There, we managed with Kullback-Leibler continuity of f0at θ0—the true value

of the location parameter, and the requirement that Π{K(f∗

0,θ,f)<δ}>0 for all θ

in a neighborhood of θ0and where f∗

0,θ is close to but diﬀerent from f0,θ. However,

this approach does not carry over to the regression context because, even though

the true parameter remains (α0,β

0), for each iwe encounter diﬀerent parameters

θi=α0+β0xi. Here we take a diﬀerent approach. Since we have no assumptions on

the structure of the random density f, the assumption on f0is somewhat strong. This

condition is weakened in Section 7.7, where we consider Dirichlet mixture of normals.

In that case, the random fis better behaved.

Lemma 7.4.1. Suppose f0∈Fsatisﬁes the following condition: There exists η>0,

Cηand a symmetric density gηsuch that, for |η|<η,

f0(y−η)<C

ηgη(y)for all y(7.19)

Then

(a) for any f∈Fand |θ|<η

K(f0,f

θ)≤Cηlog Cη+K(gη,f)+7K(gη,f)

(b) if, in addition, vargη(log(gη/f)) <∞, then

sup

|θ|<η

varf0log+

fθ<∞

Proof. Part (a) is an immediate consequence of Lemma 6.3.2 and the fact that

K(f0,θ,f)=K(f0,f

θ), which follows from the symmetry of f0and f.

For (b), note that

f0log+

fθ2

=f0,θ log+

f0,θ

f2

≤Cηgηlog+

Cηgη

f2

(7.20)

A remark here: We work with varf0log+f0/fθrather than varf0(log f0/fθ)be-

cause the condition fθ<C

ηgηdoes not imply [log f0,θ/f ]2≤Cηgη[log Cηgη/f ]2.

7.4. PRIOR POSITIVITY OF NEIGHBORHOODS 207

We write the assumption of Lemma 7.4.1 as follows:

Assumption C. For η>0, suﬃciently small, there is gη∈Fand constant Cη>0

such that for |η|<η,

f0(y−η)<C

ηgη(y) for all y

and

Cη→1asη→0

Proposition 7.4.1. Suppose Assumptions B and C hold. Let ˜

Πbe a prior for f

and µbe a prior for (α, β).If(α0,β

0)is in the support of µand if for all ηsuﬃciently

small and for all δ>0

ΠK(gη,f)<δ, vargηlog gη

f<∞>0 (7.21)

then for all δ>0and some M>0,

(˜

Π×µ){(f,α,β):Ki(f,α,β)<δ, V

i(f,α,β)<M for all i}>0 (7.22)

Proof. Choose η,δ0such that (7.21) holds with δ=δ0and

(Cη+1)logCη+Cη δ0+δ0!<δ

Let

V=(α, β):|α−α0|<η

2,|β−β0|<η

2L

Note that

Ki(f0,α,β)=K(f0,f

(α−α0)+(β−β0)xi)

and

Vi(f0,α,β)=V(f0,f

(α−α0)+(β−β0)xi)

and (α, β)∈Vimplies that |(α−α0)+(β−β0)xi|<η for all xi. An application of

Lemma 7.19 immediately gives the result.

Theorem 7.4.1. Suppose that

(i) the covariates x1,x

2,... satisfy Assumptions A and B;

(ii) f0is continuous, f0(0) >0, and f0satisﬁes Assumption C;

208 7. REGRESSION PROBLEMS

(iii) for all suﬃciently small ηand for all δ>0,

Π{K(gη,f)<δ, V(gη,f)<∞} >0

where gηis as in Assumption C.

Then for any neighborhood Uof f0,

Π{(f,α,β):f∈U,|α−α0|<δ,|β−β0|<δ|Y1,Y

2,...,Y

n}→1 (7.23)

a.s. ∞

i=1 Pf0i.

In other words, the posterior distribution is weakly consistent at (f0,α

0,β

0).

Proof. The proof follows from the remarks after Proposition 7.3.2.

Remark 7.4.1.Assumption (ii) of Theorem 7.4.1 is satisﬁed if f0is Cauchy or

normal. If f0is Cauchy, then gη=f0satisﬁes Assumption C. If f0is normal, then

Assumption C holds with gη=fs

0,η,where

0,η =1

2{f0(y−η)+f0(−y−η)}(7.24)

Remark 7.4.2.Assumption B is used in two places: Propositions 7.3.2 and 7.4.1.

For speciﬁc f0s one may be able to obtain the conclusion of Proposition 7.4.1 without

Assumption B. In such cases one would be able to get consistency at (α0,β

0) without

having to establish consistency at (f0,α

0,β

0).

7.5 Polya Tree Priors

In this section we note that Polya tree priors, with a suitable choice of parameters,

satisfy condition (iii) of Theorem 7.19 and hence the posterior distribution is weakly

consistent. To obtain a prior on symmetric densities, we consider Polya tree priors on

densities fon the positive half-line and then considering the symmetrization fs(y)=

2f(|y|).Since K(f,g)=K(fs,g

s)andV(f, g)=V(fs,g

s), this symmetrization

presents no problems.

We brieﬂy recall Polya tree priors from Chapter 3. Let E={0,1},Em={0,1}m

and E∗=8∞

m=1 Em.Foreachm,{B:∈Em}is a partition of R+and for each ,

{B0,B

1}is a partition of B. Further {B:∈E∗}generates the Borel σ-algebra.

7.6. DIRICHLET MIXTURE OF NORMALS 209

A random probability measure Pon R+is said to be distributed as a Polya tree

with parameters (Π,A), where Π is a sequence of partitions as described in the last

paragraph, and A={α:∈E∗}is a collection of nonnegative numbers, if there

exists a collection {Y:∈E∗}of mutually independent random variables such that

(i) each Yhas a beta distribution with parameters α0;andα1

(ii) the random measure Pis given by

P(B1···m)=⎡

⎣



j=1,

j=0

Y1···j−1⎤

⎦⎡

⎣



j=1,

j=1

(1 −Y1···j)⎤

⎦

We restrict ourselves to partitions Π = {Πm:m=0,1,...}that are determined

by a strictly positive, continuous density αon R+in the following sense: The sets in

Πmare intervals of the form

y:k−1

2m<y

−∞

α(t)dt ≤k

2m

Theorem 7.5.1. Let ˜

Πbe a Polya tree prior on densities on R+with α=rmfor

all ∈Em.If∞

m=1 r−1/2

m<∞, then for any density gsuch that K(g, α)<∞and

varg(log g)<∞for all δ>0,

lim

M→∞

Π{f:K(g, f)<δ, V(g, f)<M}>0 (7.25)

The proof is along similar lines as that of Theorem 6.4.1. We refer to [134] for

details.

Although Polya trees give rise to naturally interpretable priors on densities and

leads to consistent posterior, sample paths of Polya trees are, however, very rough

and have discontinuities everywhere. Such a drawback can be easily overcome by

considering a mixture of Polya trees. Posterior consistency continues to hold this case,

because by Fubini’s theorem, prior positivity holds under mild uniformity conditions.

Such priors are worth further study.

7.6 Dirichlet Mixture of Normals

In this section, we look at random densities that arise as mixtures of normal densities.

Let φhdenote the normal density with mean 0 and standard deviation h. For any

210 7. REGRESSION PROBLEMS

probability Pon R,fh,P will stand for the density

fh,P (y)=φh(y−t)dP (t) (7.26)

Our model consists of prior µfor hand a prior ˜

ΠforP. Consistency issues related

to these priors, in the context of density estimation, based on [74], were discussed in

Chapter 5. Here we look at similar issues when the error density fin the regression

model is endowed with these priors.

To ensure that the prior sits on symmetric densities, we let Pbe a random proba-

bility on R+and set

fh,P (y)=1

2φh(y−t)dP (t)+1

2φh(y+t)dP (t) (7.27)

We will denote by ˜

Π both the prior for Pand the prior for fh,P .

The following lemma shows that the random fgenerated by the prior under con-

sideration is more regular than those generated by Polya tree priors, and hence the

conditions on f0are more transparent than those in Section 7.5 or those in Ghosal,

Ghosh, and Ramamoorthi [78].

Lemma 7.6.1. Let f0be a density such that

y2f0(y)dy < ∞and f0(y)logf0(y)dy < ∞(7.28)

If f(y)=φh(y−t)dP (t)and t2dP (t)<∞, then

(i) lim

θ→0f0(y)log f0(y)

fθ(y)dy =f0(y)log f0(y)

f(y)dy, and

(ii) lim

θ→0f0(y)log f0(y)

fθ(y)2

dy =f0(y)log f0(y)

f(y)2

dy.

Proof. We have

log fθ(y)=logφh(y−(t+θ))dP (t)

and hence

|log fθ(y)|≤|log √2πh|+

log e−(y−θ−t)2/(2h2)dP (t)

(7.29)

7.6. DIRICHLET MIXTURE OF NORMALS 211

Since log e−(y−θ−t)2/(2h2)dP (t)<0, by Jensen’s inequality applied to −log x, the last

expression is bounded by

|log √2πh|+(y−θ−t)2

h2dP (t)

Hence



f0(y)log f0(y)

fθ(y)

≤|f0(y)logf0(y)|+f0(y)|log fθ(y)|

≤|f0(y)logf0(y)|+|log √2πh|+f0(y)(y−θ−t)2

h2dP (t)

The dominated Convergence Theorem now yields the result.

We now return to the regression model.

Theorem 7.6.1. Suppose ˜

Πis a normal mixture prior for f.If

(i) Assumptions A and B hold,

(ii) ˜

Π{f:K(f0,f)<δ, V(f0,f)<∞} >0for all δ>0,

(iii) Ef0(log f0)2<∞, and

(iv) t2dP (t)d˜

Π(P)<∞,

then the posterior Π(·|Y1,...,Y

n)is weakly consistent for (f,α,β)at (f0,α

0,β

0)pro-

vided (α0,β

0)is in the support of the prior for (α, β).

Proof. By condition (iv), P:t2dP (t)<∞has ˜

Π probability 1. So we may assume

that

Πf:f=fP,(ii) holds, t2dP (t)<∞>0 (7.30)

Let U=f:f=fP,(ii) holds, t2dP (t)<∞.

For every f∈U, using Lemma 7.6.1, choose δfsuch that, for θ<δ

f0log f0

f−f0log f0

fθ

<δ (7.31)

212 7. REGRESSION PROBLEMS

Now choose εfsuch that |α−α0+(β−β0)xi|<δ

fwhenever |α−α0|<ε

f,|β−β0|<

εf/L.

Clearly, if f∈Uand |α−α0|<ε

fand |β−β0|<ε

f/L,wehave

Ki(f,α,β)<2δand Vi(f,α,β)<V(f0,f)+δ(7.32)

Since ˜

Π{(f,α,β):f∈U,|α−α0|<ε

f,|β−β0|<ε

f/L}>0 (7.33)

we have

Π(f,α,β):Ki(f0,α,β)<δ for all i, ∞



i=1

Vi(f,α,β)

i2<∞>0 (7.34)

An application of Theorem 7.2.1 completes the proof.

It was shown in Chapter 5 that if f0has compact support or if f0=fPwith P

having compact support, then ˜

Π{f:K(f0,f)<δ}>0 for all δ>0. The argument

given there also shows that in these cases, (ii) of Theorem 7.6.1 holds when ˜

Πis

Dirichlet with base measure γ. In Chapter 5 we also described f0s whose tail behavior

is related to that of γsuch that ˜

Π{f:K(f0,f)<δ}>0. In the case when the prior

is Dirichlet, the double-integral in (iv) is ﬁnite if and only if t2dγ(t)<∞. While

normal f0is covered by these results, the case of Cauchy f0cannot be resolved by

the methods in that chapter. However, Dirichlet mixtures of both location and scale

parameters of normal may be able to handle Cauchy, which is a scale mixture of

normal. Results of Chapter 5 may need to be generalized to prove posterior consistency

for these priors. .

7.7 Binary Response Regression with Unknown Link

One of the most popular models in bioassay involves regression of the probability of

some event on a covariate x. The regression is taken to be linear in logit or probit

scale. In this section we consider the same problem with a nonparametric link func-

tion, instead of a logit or probit model. We indicate, without going into details, how

posterior consistency can be established.

Consider klevels of a drug on a suitable scale, say, x1,...,x

k, with probability of

a response (which may be death or some other speciﬁed event) pi,i=1,...,k. The

ith level of the drug is given to nisubjects and the number of responses rinoted.

7.7. BINARY RESPONSE REGRESSION WITH UNKNOWN LINK 213

We thus get kindependent binomial variables B(ni,p

i). The object is often to ﬁnd x

such that p=0.5. Often, piis modeled as

pi=F(α+βxi)=H(xi) (7.35)

where Fis a response distribution and α+βxiis a linear representation of F−1(pi)=

yi. Here pimay be estimated by ri/ni, but if the nis are small, the estimates will

have large variances, so the model provides a way of combining all the data. In a

logit model, Fis taken as a logistic distribution function. In a probit model the link

function is the normal distribution function. The choice of the functional form of

the link function is somewhat arbitrary, and this may substantially aﬀect inference,

particularly at the two ends where data are sparse. In recent years, there has been

a lot of interest in link functions with unknown functional form. In nonparametric

problems of this kind, one puts a prior on For H.Suchanapproachwastakenby

Albert and Chib ([1]) , Chen and Dey ([31]), Basu and Mukhopadhyay ([11, 12])

and some other authors. If one puts a prior on F, one has to put conditions on F

like specifying two values of two quantiles to make (F, α,β ) identiﬁable. In this case,

one can develop suﬃcient conditions for posterior consistency at (F0,α

0,β

0) using

our variant of Schwartz’s theorem. However, in practice, one often puts a Dirichlet

process or some other prior on Fand independently of this, a prior on (α, β). Due

to the discreteness of Dirichlet selections, many authors actually prefer the use of

other priors such as Dirichlet scale mixtures of normals, see Basu and Mukhopadhyay

([11, 12]) and the references therein. Because of the lack of identiﬁability, the posterior

for (α, β) is not consistent. On the other hand, a Dirichlet process prior and a prior

on (α, β) provides a prior on Hand one can ask for posterior consistency of H−1(1/2)

at, say, H−1

0(1/2). This problem can be solved by the methods developed earlier in

this chapter.

Without loss of generality, one may take ni= 1 for all i. To verify condition (ii) of

Theorem 7.2.1, consider

Zi=log(H0(xi))ri(1 −H0(xi))1−ri

(H(xi))ri(1 −H(xi))1−ri(7.36)

where riis 1 or 0 with probability H(xi)and1−H(xi), respectively, and the true H

is denoted by H0. Then it is easily found that

EH0(Zi)=H0(xi)log H0(xi)

H(xi)+(1−H0(xi)) log 1−H0(xi)

1−H(xi)(7.37)

214 7. REGRESSION PROBLEMS

and

EH0(Z2

i)≤2H0(xi)log H0(xi)

H(xi)2

+2(1−H0(xi)) log 1−H0(xi)

1−H(xi)2

(7.38)

Assume that xis lie in a bounded interval containing H−1

0(1/2), and the support of H0

contains a bigger interval. Since the range of xis is bounded, the sequence of formal

empirical distributions n−1n

i=1 δxiof x1,...,x

nis relatively compact. Assume that

all limits of subsequences converge to distributions which give positive measure to all

nondegenerate intervals, provided they lie in a certain interval containing H−1

0(1/2).

Therefore, a positive fraction of xis lie in an interval of positive length if the interval is

close to the point H−1

0(1/2). Also assume that H0is continuous and the support of the

prior for Hcontains H0. For example, if the prior is Dirichlet with a base measure

whose support contains the support of H0, then the above condition is satisﬁed.

Mixture priors often have large supports also. For instance, the Dirichlet scale mixture

of normal prior used by Basu and Mukhopadhyay ([11, 12]) will have this property

if the true link function is also a scale mixture of normal cumulative distribution

functions.

If Hνis a sequence converging weakly to H0, then by Polya’s theorem, the conver-

gence is uniform. Note that for 0 <p<1, the functions plog(p/q)+(1−p) log((1 −

p)/(1 −q)) and p(log(p/q))2+(1−p)(log((1 −p)/(1 −q)))2in qconverge to 0 as

q→p, uniformly in plying in a compact subinterval of (0,1). Thus given δ>0, we

can choose a weak neighborhood Uof H0such that if H∈U, then EH0(Zi)<δand

EH0(Z2

i)’s are bounded. By the assumption on the support of the prior, condition (ii)

of Theorem 7.2.1 holds.

For existence of exponentially consistent tests in condition (i) of Theorem 7.2.1,

consider, without loss of generality, testing H−1(1/2) = H−1

0(1/2) against H−1(1/2) >

H−1

0(1/2) + εfor small ε>0. Let

Kn=i:H−1

0(1/2) + ε/2≤xi≤H−1

0(1/2) + ε

Since

EH(ri)=H(xi)≤H(H−1

0(1/2) + ε)≤1

2(7.39)

and

EH0(ri)=H0(xi)≥H0(H−1

0(1/2) + ε/2) >1

2(7.40)

7.8. STOCHASTIC REGRESSOR 215

the test 1

#Kn

i∈Kn

ri<1

2+η(7.41)

for η=(H0(H−1

0(1/2) + ε/2) −1/2)/2 is exponentially consistent by Hoeﬀeding’s

inequality and the fact that #Kn/n converge to positive limits along subsequences.

Therefore Theorem 7.2.1 applies and the posterior distribution of H−1(1/2) is consis-

tent at H−1

0(1/2).

7.8 Stochastic Regressor

In this section, we consider the case that the independent variable Xis stochastic.

We assume that the Xobservations X1,X

2,... are i.i.d. with a probability density

function g(x) and are independent of the errors 1,

2,.... We will argue that all the

results on consistency hold under appropriate conditions.

Let G(x)=x

−∞ g(u)du, denote the cumulative distribution function of X.Weshall

assume that the following condition holds.

Assumption D. The independent variable Xis compactly supported and 0 <

G(0−)≤G(0) <1.

Under these assumptions, results follow from a conditionality argument and the

corresponding results for the nonstochastic case, conditioned on a sequence x1,x

2,...

such that Assumptions A and B hold. Note that if gsatisﬁes Assumption D, under

P∞

g, almost all sequences x1,x

2,... satisfy Assumptions A and B. For details see

[134]. Thus if Xis stochastic and Assumption D replaces Assumptions A and B in

Theorems 7.5.1 and 7.6.1, posterior consistency holds.

7.9 Simulations

Additional insight can often be obtained by carrying out simulations. In the mixture

model that we have discussed, one can study the eﬀect on the posterior of βby varying

the ingredients in the mixture model. There is an additional issue of symmetrization.

After ﬁxing the prior, one can generate observations from carefully chosen parameters

and error density and in each case examine the behavior of the posterior. Extensive

simulations of this kind have been done by Charles Messan using WINBUGS, and we

present a few of these.

First we look at two cases for the kernel: normal and Cauchy. The base measure

for the Dirichlet process is N(0,1). Figure 7.1 displays the simulated posterior when

216 7. REGRESSION PROBLEMS

observations were generated from (true f0is) normal. The value of βis 3.0., and

the random densities are not symmetrized. It is clear from the graphs that, in this

case, the posterior behaves well, and in addition to consistency also shows asymptotic

normality.

In ﬁgure 7.2, the setup for priors is the same as that just considered, but the

posterior is evaluated when the true f0is Cauchy. Clearly, things do not seem to go

well. Both consistency and asymptotic normality seem to be in doubt.

One could see if the introduction of a hyperparameter for the base measure of the

Dirichlet process would lead to amelioration of the situation. Figures 7.3 and 7.4 show

the result of simulations with a hyperparameter for the base measure. There seems to

be some improvement. The estimates are closer to the true value of β= 3, and there

is a suggestion of asymptotic normality.

7.9. SIMULATIONS 217

218 7. REGRESSION PROBLEMS

7.9. SIMULATIONS 219

220 7. REGRESSION PROBLEMS

Figure 7.4: Sample size n=50 Truef0= cauchy(0, 0.5) Priors: base measure of Dirichlet: N(

,σ)

µ|σ~N(0,2σ)

Classical estimate of beta: σ~Unif(0,10)

ˆ= 2.4641, Var(

ˆ) = infinite Bandwidth h:h~Unif(0,4)

MCMC estimates of beta: Hyperparameter of Dirichlet M= 100

Dirichlet mixture of cauchy: C

ˆ= 2.898 Var( C

ˆ) = 0.0053 Skewness = - 0.0753

Kurtosis = 0.2729

Dirichlet mixture of normal: N

ˆ= 2.899 Var( N

ˆ) = 0.0050 Skewness = - 0.0623

Kurtosis = 0.3620

Uniform Distribution on Inﬁnite-Dimensional

Spaces

8.1 Introduction

Except for a noninformative choice of the base measure αfor a Dirichlet very little

is known about noninformative priors in nonparametric or inﬁnite-dimensional prob-

lems. In this chapter we explore how one may construct a prior that is noninformative,

i.e., completely nonsubjective in the sense of Chapter 1, for nonparametric problems.

One way of thinking of them is as a uniform distribution over an inﬁnite-dimensional

space. Our approach has some similarities with that of Dembski [40], as well as many

diﬀerences.

Several new approaches to construction of such a prior are discussed in Section 8.2.

The remaining sections attempt some validation. In Section 8.3 we show that one of

our methods would lead to the Jeﬀreys prior for parametric models under regularity

conditions. We also brieﬂy discuss what would be reference priors from this point of

view. Section 8.4 contains an application of our ideas to a density estimation problem

of Wong and Shen [172]. We show that for our hierarchical noninformative prior, the

posterior is consistent–a sort of weak frequentist validation. The proof of consistency is

interesting in that the Schwartz condition is not assumed. We also show that the rate

of convergence of the posterior is optimal. In particular, this implies that the Bayes

estimate of the density corresponding to this prior achieves the optimal frequentist

rate–a strong frequentist validation. We oﬀer these tentative ideas to be tried out

222 8. UNIFORM DISTRIBUTION ON INFINITE-DIMENSIONAL SPACES

on diﬀerent problems. Computational or other considerations may require replacing

Piby other sieves, which need not be ﬁnite, changing an index ito h, which may

take values in a continuum, and distributions on Piwhich are not uniform. These

relaxations will create a very large class of priors that are nonsubjective in some

sense and from which it may be convenient to elicit a prior. This approach includes

some of the priors in Chapter 5, namely, the random histograms and the Dirichlet

mixture of normals with standard deviation h. The parameter hcan be viewed as

indexing a sieve. This chapter is almost entirely based on [73] and [80]

8.2 Towards a Uniform Distribution

8.2.1 The Jeﬀreys Prior

By way of motivation we begin with a regular parametric model. Let Θ ⊂Rp.A

uniform distribution on Θ should be associated with the geometry on Θ induced by

the statistical problem. To do this, let I(θ)=[Ii,j(θ)] be the p×pFisher information

(positive deﬁnite) matrix. As shown by Rao [2], the matrix induces a Riemannian

metric on Θ through the integration of

ρ(dθ)=

i

Ii,j(θ)dθidθj

over all curves connecting θto θand minimizing over curves. The minimizing curve is

a geodesic. If the model is N(θ,Σ), then Ii,j =Σ

−1and we get the famous Mahalanobis

distance. Cencov [30] has shown the Riemannian geometry induced by Rao’s metric

is the unique Riemannian metric that changes in a natural way under 1-1 smooth

transformations of Θ onto itself. The Jeﬀreys prior {detI(θ)}1/2can be motivated as

follows.

Fix a θand consider a 1-1 smooth transformation

θ→ ψ(θ)=ψ

such that the information matrix Iψwith the new parametrization ψis identity at

ψ(θ0). This implies that the local geometry in the ψ-space is Euclidean near ψ(θ0)

and hence the Lebesgue measure dψ is a suitable uniform distribution near ψ(θ0). If

we lift this back to the θ-space making use of the Jacobian and the elementary fact

[∂θj

∂ψi

][Ii,j(θ)][ ∂θj

∂ψi

]=Iψ=I

8.2. TOWARDS A UNIFORM DISTRIBUTION 223

we get Jeﬀreys prior in the θ-space, namely,

dψ == {det[∂θi

∂ψj

]}−1dθ ={det[Ii,j(θ)]}1/2dθ

Another way of deriving the Jeﬀreys prior in a similar spirit is given in Hartigan ([93]

pp. 48, 49). The basic paper for the Jeﬀreys prior is Jeﬀreys [106]. These references

are relevant for Section 8.3 especially Remark 8.4.1.

8.2.2 Uniform Distribution via Sieves and Packing Numbers

Suppose we have a model Pwhich is equipped with a metric ρandiscompact.In

applications we use the Hellinger metric. The compactness assumption can then be

relaxed in at least some σcompact cases in a standard way. Our starting point is a

sequence idiminishing to zero and sieves Piwhere Piis a ﬁnite set whose elements

are separated from each other by at least iand has cardinality D(i,P), the largest m

for which there are P1,P

2,...,P

m∈Pwith ρ(Pj,P

j)>

i,j =j,j,j=1,2,...,m.

Clearly, given any P∈Pthere exists P∈P

isuch that ρ(P, P )≤i.ThusPi

approximates Pwithin iand no subset of it will have this property.

In the ﬁrst method we choose i(n), tending to 0 in some suitable way. It is then

convenient to think of Pi(n)as a ﬁnite approximation to Pwith the approximation

depending on the sample size n. The idea is that the approximating ﬁnite model is

made more and more accurate by increasing its cardinality with sample size. In the

ﬁrst method our noninformative prior is just the uniform distribution on Fi(n).

This seems to accord well with Basu’s [9] recommendation in the parametric case to

approximate the parameter space Θ by a ﬁnite set and then put a uniform distribution.

It is also intuitively plausible that the complexity or richness of a model Pi(n)may

be allowed to depend on the sample size. Since this prior depends on the sample size,

we consider two other approaches that are more complicated but do not depend on

sample size.

In the second approach, we consider the sequence of uniform distributions Πion

Piand consider any weak limit Π∗of {Πi}as a noninformative prior. If Π∗is unique,

it is simply the uniform distribution deﬁned and studied by Dembski [40].

In the inﬁnite-dimensional case, evaluation of the limit points may prove to be

impossible. However, the ﬁrst approach may be used, and Πi(n)may be treated as an

approximation to a limit point Π∗.

We now come to the third approach. Here, instead of a limit, we consider the index

as a hyperparameter and consider a hierarchical prior which picks up the index iwith

probability λiand then uses Πi.

224 8. UNIFORM DISTRIBUTION ON INFINITE-DIMENSIONAL SPACES

8.3 Technical Preliminaries

Let Kbe a compact metric space with a metric ρ. A ﬁnite subset Sof Kis called

-dispersed if ρ(x, y)≥for all x, y ∈S, x =y.Amaximal-dispersed set is called

an -net and an -net with maximum possible cardinality is said to be an -lattice.

The cardinality of an -lattice is called the packing number (or -capacity)ofKand is

denoted by D(, K)=D(, K, ρ). As Kis totally bounded, D(, K ) is ﬁnite. Closely

related to packing numbers are covering numbers N(, K, ρ)–the maximum number

of balls of radius needed to cover K. Clearly,

N(, K, ρ)≤D(, K, ρ)≤N(/2,K,ρ)

In view of this, our arguments could also be stated in terms of covering numbers.

Deﬁne the -probability Pby

P(X)=D(, X)

D(, K),X⊂K

It follows that 0 ≤P(·)≤1,P

(∅)=0,P

(K)=1.P

is subadditive and for

X, Y ⊂K. Because Kis compact, subsequences of µwill have weak limits. If all the

subsequences have the same limits, then Kis called uniformizable and the common

limit point is called the uniform probability on K.

The following result of Dembski [40]) will be used in the next section.

Theorem 8.3.1 (Dembski). Let (K, ρ)be a compact metric space. Then the

following assertions hold.

(a) If Kis uniformizable with uniform probability µ, then lim→0P(X)=µ(X)for

all X⊂Kwith µ(∂X)=0.

(b) If lim→0P(X)exists on some convergence-determining class in K, then Kis

uniformizable.

To extend these ideas to noncompact σ-compact spaces, one can take a sequence

of compact sets Kn↑Khaving uniform probability µn. Any positive Borel measure

µsatisfying

µ(·∩Kn)=µn(·∩Kn)

µn(K1)

may be thought of as an (improper) uniform distribution on K. Such a measure would

be unique up to a multiplicative constant by lemma 2 of Dembski [40].

8.4. THE JEFFREYS PRIOR REVISITED 225

8.4 The Jeﬀreys Prior Revisited

Let Xis be i.i.d. with density f(.;θ)(with respect to a σ-ﬁnite measure ν), and Θ is

an open subset of Rd. Assume that {f(.;θ):θ∈Θ}is a regular parametric family,

i.e., there exist {ψ(.;θ)∈(L2(ν))dsuch that for any compact K⊂Θ

sup

θ∈K|f1/2(x;θ+h)−f1/2(x;θ)−hTψ(x;θ)|2ν(dx)=o(h2) (8.1)

as h→0. Deﬁne the Fisher information by the relation

I(θ)=4ψ(x;θ)(ψ(x;θ))Tν(dx) (8.2)

Assume that I(θ) is positive deﬁnite and the map θ→ I(θ) is continuous. Further,

assume the following stronger form of identiﬁability: On every compact set K⊂Θ,

inf{f1/2(x;θ1)−f1/2(x;θ2)2ν(dx):θ1,θ

2∈K, θ1−θ2≥}>0,>0

For i.i.d. observations equip Θ with the Hellinger distance, as deﬁned in Chapter

1, namely,

H(θ1,θ

2)=|f1/2(x;θ1)−f1/2(x;θ2)|2ν(dx)1/2

(8.3)

The following result is the main theorem of this section.

Theorem 8.4.1. Fix a compact subset Kof Θ. Then for all Q⊂Kwith vol

(∂Q)=0, we have

lim

→0

D(, Q)

D(, K)=QdetI(θ)dθ

KdetI(θ)dθ (8.4)

By using Theorem 8.3.1 we conclude that the Jeﬀreys measure µon Θ deﬁned by

µ(Q)∝KdetI(θ)dθ Q ⊂Θ (8.5)

is the (possibly improper) noninformative prior on Θ in the sense of the second ap-

proach described in the introduction.

The idea is to approximate the packing number of relatively small sets by the

Jeﬀreys prior measure for those sets (see 8.13, 8.14) and then ﬁt these small sets into

a given set Qor K. One has to check that the approximation remains good at this

higher scale [vide 8.16].

226 8. UNIFORM DISTRIBUTION ON INFINITE-DIMENSIONAL SPACES

Proof. Fix 0 <η<1. Cover Kby Jcubes of length η.Ineachcubeﬁxaninterior

cube with length η−η2. The interior cube will provide an approximation from below.

Since by continuity, the eigenvalues of I(θ) are uniformly bounded away from zero

and inﬁnity on K, by standard arguments [see theorem I.7.6. in [102]], it follows from

(8.1) that there exist M>m>0 such that

mθ1−θ2≤H(θ1,θ

2)≤Mθ1−θ2,θ

1,θ

2∈K(8.6)

Given η>0choose>0sothat/(2m)<η

2. Any two interior cubes are separated

by at least η/m in terms of Euclidean distance and by in terms of the Hellinger

distance.

For Q⊂K,letQjbe the intersection of Qwith the jth cube and Q

jbe the

intersection with the jth interior cube, j=1,2...,J. Then

Q1∪Q2∪...∪QJ=Q

1∪Q

2∪...∪Q

J(8.7)

Hence J



j=1

D(, Q

j,H)≤D(, Q, H)≤



j=1

D(, Qj,H) (8.8)

In particular, with Q=K,weobtain



j=1

D(, K

j,H)≤D(, K, H )≤



j=1

D(, Kj,H) (8.9)

where Kjand K

jare deﬁned in the same way.

For th e jth cube, choose θj∈K. By an argument similar to that for (8.6), for all

θ, θin the jth cube,

λ(η)

27(θ−θ)TI(θj)(θ−θ)≤H(θ, θ)≤¯

λ(η)

27(θ−θ)TI(θj)(θ−θ) (8.10)

where ¯

λ(η)andλ(η)tend to 1 as η→0.

Let

Hj(θ, θ)=λ(η)

27(θ−θ)TI(θj)(θ−θ)

and

Hj(θ, θ)= ¯

λ(η)

27(θ−θ)TI(θj)(θ−θ)

8.4. THE JEFFREYS PRIOR REVISITED 227

Then from (8.10),

D(, Qj,H)≤D(, Qj,H) (8.11)

D(, Q

j,H)≤D(, Q

j,¯

H) (8.12)

By the second part of theorem IX of Kolmogorov and Tihomirov [115], for some

constants τj,τ

jand absolute constants Ad(depending only on the dimension d),

D(, Qj,H)∼Advol(Qj)7detI(θj)(λ(η))−d−d(8.13)

and

D(, Qj,¯

H)∼Advol(Q

j)7detI(θj)(¯

λ(η))−d−d(8.14)

where the symbol ∼means that the limit of the ratio of the two sides is 1 as →0.

As all metrics, Hjand ¯

Hj;j=1,2,...,J arise from elliptic norms, it can be easily

concluded by making a suitable linear transformation that τj=τ

j=τ(say) for all

j=1,2,...,J. Thus we obtain from (8.7)–(8.14) that

lim sup

→0

D(, Q, H)

D(, K, H )≤J

j=1 vol(Qj)detI(θj)

J

j=1 vol(Kj)detI(θj)¯

λ(η)

λ(η)−d

(8.15)

and

lim sup

→0

D(, Q, H)

D(, K, H )≤J

j=1 vol(Q

j)detI(θj)

J

j=1 vol(Kj)detI(θj)λ(η)

λ(η)−d

(8.16)

Now let η→0. By the convergence of sums J

j=1 vol(Qj)detI(θj)toQI(θ)dθ

and J

j=1 vol(Q

j)detI(θj)→QI(θ)dθ and similarly for sums involving Kjsand

K

js. Also λ(η)→1and¯

λ(η)→1, so the desired result follows.

Remark 8.4.1.It has been pointed out to us by Prof.Hartigan that Jeﬀreys had en-

visaged constructing noninformative priors by approximating Θ with Kullback-Leibler

neighborhoods . He asked us if the construction in this section can be carried out us-

ing the Kullback-Leibler neighborhoods . Because the Kullback-Leibler divergence is

not a metric there would be obvious diﬃculties in formalizing the notion of an -net.

However, if the family of densities {fθ:θ∈Θ}have well-behaved tails such that, for

any θ, θ,K(θ, θ)≤φ(H(θ, θ)), where φ()goesto0asgoes to 0, then any -net

{θ1,...,θ

k}in the Hellinger metric can be thought of as a Kullback-Leibler net in the

sense that

228 8. UNIFORM DISTRIBUTION ON INFINITE-DIMENSIONAL SPACES

1. K(θi,θ

j)>for i, j, =1,2,...k;and

2. for any θthereexistsanisuch that K(θi,θ)<φ().

In such situations, the above theorems allow us to view the Jeﬀreys prioras a limit

of uniform distributions arising out of Kullback-Leibler neighborhoods. Wong and

Shen [172] show that a suitable tail behavior is that for all θ, θ,

fθ/fθ≥exp 1

fθ(fθ

Fθ

)δ<M

We now consider the case when there is a nuisance parameter. Let θbe the pa-

rameter of interest and φbe the nuisance parameter. We can write the information

matrix as I11(θ, φ)I12(θ, φ)

I12(θ, φ)I22(θ, φ)(8.17)

In view of Theorem 8.4.1, and in the spirit of reference priors of Bernardo [18],

the prior for φgiven θis speciﬁed as Π(φ|θ)=I11(θ, φ). So it is only necessary to

construct a noninformative marginal prior for θ. Assume, as before, that the parameter

space is compact. With ni.i.d. observations, the joint density of the observations given

θonly is given by

g(xn,θ)=(c(θ))−1n



f(xi,θ,φ)I22 (θ, φ)dφ (8.18)

where c(θ)=n

1f(xi,θ,φ)I22 (θ, φ)dφ is the constant of normalization. Let In(θ,g )

denote the information in the family {g(xn,θ):θ∈Θ}. Under appropriate regularity

conditions, it can be shown that the information per observation In(θ, g)/n satisﬁes

lim

n→∞ In(θ, g)/n =(c(θ))−1I11.2(θ, φ)I22(θ, φ)dφ =J(θ) ( say) (8.19)

where I11.2=I11 −I2

12/I22 is the (11) element in the inverse of the information

matrix. Let Hn(θ, θ +h) be the Hellinger distance between g(xn,θ)andg(xn,θ+h).

Locally, as h→0,H2

n(θ, θ +h) behaves like h2In(θ, g). Hence by Theorem 8.4.1, the

noninformative (marginal) prior for θwould be proportional to In(θ, g). In view

of (8.19), passing to the limit as n→∞, the (sample size–independent) marginal

noninformative prior for θshould be taken to be proportional to (J(θ))1/2, and so the

8.5 POSTERIOR CONSISTENCY 229

prior for (θ, φ) is proportional to J(θ)π(φ|θ). Generally, for noncompact parameter

space, one can proceed like Berger and Bernardo [14]. Informally, we can sum up

as follows. The prior for θbased on the current approach is obtained by taking the

average of I11(θ, φ) with respect to I22(θ, φ) and then taking the square root. The

reference prior of Berger and Bernardo or the probability matching prior takes average

geometric and harmonic means of other functions of I11(θ, φ) and then transforms

back. In the examples of Datta and Ghosh [38], we believe that they reduce to the

same prior.

8.5 Posterior Consistency for Noninformative Priors for

Inﬁnite-Dimensional Problems

In this section, we show that in a certain class of inﬁnite dimensional families, the

third approach mentioned in the introduction leads to consistent posterior.

Theorem 8.5.1. Let Pbe a family of densities where P, metrized by the Hellinger

distance, is compact. Let εnbe a positive sequence satisfying

∞



n=1

n1/2εn<∞

Let Pnbe a εn-net in P,µnbe the uniform distribution on Pn, and µbe the probability

on Pdeﬁned by µ=∞

n=1 λnµn, where λns are positive numbers adding up to unity.

If for any β>0,

lim

n→∞ eβn λn

D(εn,Pn)=∞(8.20)

then the posterior distribution based on the prior µand i.i.d. observations X1,X

2,...

is strongly consistent at every p0∈P.

Proof. Since Pis compact under the Hellinger metric, the weak topology and the

Hellinger topology coincide on P. Consequently weak neighborhoods and strong

neighborhoods coincide and so do the notions of weak and strong consistency.

To prove consistency, by Remark 4.5.1, it is enough to show that for every δ,if

Uδ

n={P:H(P0,P)<δ/n}then for all β>0,

enβΠ(Uδ

n)→∞

Because ∞

n=1 n1/2εn<∞,givenδ, there is a n0such that for n>n

0,ε

n<δ/n;so

that for n>n

0, there is a Pn∈P

nsuch that H(P0,P

n)<δ/n.

230 8. UNIFORM DISTRIBUTION ON INFINITE-DIMENSIONAL SPACES

Since Π{Pn}=λn/D(εn,Pn) and by assumption, for all β>0,

lim

n→∞ eβn λn

D(εn,Pn)=∞

and Π(Uδ

n)>Π{Pn}; consistency follows.

Remark 8.5.1.Consistency is obtained in the Theorem 8.5.1 by requiring (8.20)

for sieves whose width εnwas chosen carefully. However, it is clear from the proof

that consistency would follow for sieves with width εn↓0 by imposing (8.20) for a

carefully chosen subsequence.

Precisely, if εn↓0,Pnan εn-net, µis the probability on Pdeﬁned by µ=∞

1λnµn

and δnis a positive summable sequence, then by choosing j(n)with

εj(n)≤2

nδn(8.21)

the posterior is consistent, if

exp[nβ]λj(n)

D(εj(n),Pn)→∞ (8.22)

A useful case corresponds to

D(ε, P)≤Aexp[c−α] (8.23)

where 0 <α<2/3andAand care positive constants, δn=n−γfor some γ>1. If

in this case j(n) is the smallest integer satisfying (8.21), then (8.22) becomes

exp[nβ −cε−α

j(n)]λj(n)→∞ (8.24)

If εn=ε/2nfor some ε>0andλndecays no faster than n−sfor some s>0 then

(8.24) holds. Moreover, the condition 0 <α<2 in (8.23) is enough for posterior

consistency in probability.

We can apply this in the following example [see Wong and Shen [172]] the following.

Example 8.5.1. Let

P={g2:g∈Cr[0,1],1

g2(x)dx =1,

g(j)sup ≤Lj,j =1,2,...r

|g(r)(x1)−g(r)(x2)|≤Lr+1|x1−x2m}

where ris a positive integer and 0 ≤m≤1andL’s are ﬁxed constants. By theorem

15 of Kolomogorov and Tihomirov [115] D(ε, P,h)≤exp[cε−1/r+m].

8.6. CONVERGENCE OF POSTERIOR AT OPTIMAL RATE 231

8.6 Convergence of Posterior at Optimal Rate

This section is based on Ghosal, Ghosh and van der Vaart ([80]).

We present a result concerning rate of convergence of the posterior relative to L1,L

and Hellinger metrics. The two main elements controlling the rate of convergence are

the size of the model (measured by packing or covering numbers) and the amount

of prior mass given to a shrinking ball around the true measure. It is the latter

quantity that is easy to estimate for the hierarchical noninformative priors introduced

in Section 8.1. and appearing in Theorem 8.5.1 of the preceding section. See also Shen

and Wasserman [150]

Theorem 8.6.1. Suppose for a sequence nwith n→0and n2

n→∞, a constant

C>0and sets Pn⊂Pwe have

log D(n,Pn,d)≤n2

n(8.25)

Πn(P\Pn)≤exp(−n2

n(C+ 4)) (8.26)

ΠnP:−E0(log p

)≤2

n,E

0(log p

)2≤2

n≥exp(−n2

nC).(8.27)

Then for suﬃciently large M, we have that

Πn(P:d(P, P0)≥Mn|X1,X

2,...,X

n)→0in Pn

0probability

See [80] for a proof.

Condition (8.25) requires that the “model” Pnis not too big and (8.26) ensures

that its complement will not alter too much. It is true for every 

n≥nas soon

at it is true for nand thus can be seen as deﬁning a minimal possible value of

n. Condition (8.25) ensures the existence of certain tests and could be replaced by

a testing condition in the spirit of LeCam [120]. Note that the metric dused here

reappears in the assertion of the theorem. Since the total variation metric is bounded

above by twice the Hellinger metric, the assertion of the theorem using the Hellinger

metric is stronger, but also condition (8.25) will be more restrictive, so that we really

have two theorems. In the case that the densities are uniformly bounded, one can have

a third theorem, when using the L2-distance, which in that case will be bounded above

by a multiple of the Hellinger distance. If the densities are also uniformly bounded

and uniformly bounded away from zero, then these three distances are equivalent

and are also equivalent to the Kullback-Leibler number and L2-norm appearing in

condition (8.27).

232 8. UNIFORM DISTRIBUTION ON INFINITE-DIMENSIONAL SPACES

Aratensatisfying (8.25) for P=Pnand dthe Hellinger metric is often viewed as

giving the “optimal” rate of convergence for estimators of Prelative to the Hellinger

metric, given the model P. Under certain conditions, such as likelihood ratios bounded

away from zero and inﬁnity, this is proved as a theorem by Birg´e [22] and LeCam [122]

and [120]. See also Wong and Shen [172]. From Birg´e’s work it is clear that condition

(8.25) is a measure of the complexity of the model.

Condition (8.27) is the other main condition. It requires that the prior measures

put a suﬃcient amount of mass near the true measure P0. Here “near” is measured

through a combination of the Kullback-Leibler divergence of pand p0and the L2(P0)-

norm of log(p/p0). Again, this condition is satisﬁed for 

n≥nif it is satisﬁed for n

and thus is another restriction on a minimal value of n.

The assertion of the theorem is an in-probability statement that the posterior

mass outside a large ball of radius proportional to nis approximately zero. The in-

probability statement can be improved to an almost-sure assertion, but under stronger

conditions, as indicated below.

Let hbe the Hellinger distance and write log+xfor (log x)∨0.

Theorem 8.6.2. Suppose that conditions (8.25) and (8.26) hold as in the preceding

theorem and ne−Bn2

n<∞for every B>0and

ΠnP:h2(P, P0)/

/p0/p/

/∞≤2

n≥e−n2

Then for suﬃciently large M, we have that Πn(P:d(P, P0)≥Mn|X1,...,X

n)→0

in Pn

0-almost surely.

See also theorem 2.3 in [80].

These theorems are not tailored for ﬁnite-dimensional models. For such cases and

for ﬁnite-dimensional sieves, they yield an extra logarithmic factor in addition to the

correct rate of 1/√n. Suitable reﬁnements of (8.25) and (8.27) to address this issue

are in [80].

Convergence of the posterior distribution at the rate nimplies the existence of

point estimators, which are Bayes in that they are based on the posterior distribution,

which converge at least as fast as nin the frequentist sense. One possible construction

is to deﬁne ˆ

Pnas the (near) maximizer of

Q→ΠnP:d(P, Q)<

n|X1,...,X

n

Theorem 8.6.3. Suppose that Πn(P:d(P, P0)≥n|X1,...,X

n)converges to 0,

almost surely (respectively, in-probability) under Pn

0and let ˆ

Pnmaximize, up to o(1),

8.6. CONVERGENCE OF POSTERIOR AT OPTIMAL RATE 233

the function Q→ ΠnP:d(P, Q)<

n|X1,...,X

n. Then d(ˆ

Pn,P

0)≤2neventually

almost surely (respectively, in-probability) under Pn

Proof. By deﬁnition, the n-ball around ˆ

Pncontains at least as much posterior prob-

ability as the n-ball around P0, both of which by posterior convergence at rate n,

has posterior probability close to unity. Therefore, these two balls cannot be disjoint.

Now apply the triangle inequality.

The theorem is well - known (See e.g. Le Cam ([120] or Le Cam and Yang [121]). If

we use the Hellinger or total variation metric (or some other bounded metric whose

square is convex), then an alternative is to use the posterior expectation, which typ-

ically has a similar property.

In order to state the next theorem we need a strengthening of the notion of entropy.

Given two functions l, u :X→Rthe bracket [l, u] is deﬁned as the set of all

functions f:X→Rsuch that l≤f≤ueverywhere. The bracket is said to be of

size relative to the distance dif d(l, u)<. In the following we use the Hellinger

distance hfor the distance dand the brackets to consist of nonnegative functions,

integrable with respect to a ﬁxed measure µ.LetN[](, P,h) be the minimal number

of brackets of size needed to cover P. The corresponding bracketing entropy is

deﬁned as the logarithm of the bracketing number N[](, P,h). It is easy to see that

N[](, P,h)isbiggerthanN[](/2,P,h) and hence bigger than D(, P,h). However,

in many examples, bracketing and packing numbers lead to the same values of the

entropy up to an additive constant.

In the spirit of Section 8.2.2 we now construct a discrete prior supported on densities

constructed from minimal sets of brackets for the Hellinger distance. For a given

number n>0letPinbe the uniform discrete measure on the N[](n,P,h) densities

obtained by covering Pwith a minimal set of n-brackets and then renormalizing

the upper bounds of the brackets to integrate to one. Thus if [l1,u

1],...,[lN,u

N]are

the N=N[](n,P,h) brackets, then Πnis the uniform measure on the Nfunctions

uj/ujdµ. Finally, construct the hierarchical prior

Π=

n∈N

λnΠn

for a given sequence λnwith λn≥0andnλn= 1. This is essentially the third

approach of Section 8.2.2. As before the rate at which λn→0 is important.

Theorem 8.6.4. Suppose that nare numbers decreasing in nsuch that

log N[](n,P,h)≤n2

234 8. UNIFORM DISTRIBUTION ON INFINITE-DIMENSIONAL SPACES

for every n, and

n2

n/log n→∞

. Construct the prior Πas given previously for a sequence λnsuch that λn>0for all

nand log λ−1

n=O(log n). Then the conditions of Theorem 8.6.2 are satisﬁed for n

a suﬃciently large multiple of the present nand hence the corresponding posterior

converges at the rate nalmost surely, for every P0∈P, relative to the Hellinger

distance.

There are many speciﬁc applications. The situation here is similar to that in several

recent papers on rates of convergence of (sieved) maximum likelihood estimators, as

in Birg´e and Massart, (1996, 1997), Wong and Shen [172], or chapter 3.4 of van der

Vaart and Wellner [161]. We consider again Example 8.5.1 of smooth densities of the

previous section.

Example 8.6.1 (Smooth densities). Because upper and lower brackets can be

constructed from uniform approximations, this shows that the bracketing Hellinger

entropies grow like −1/r,sothatwecantakenof the order n−r/(2r+1) to satisfy

the relation log N[] (n,P,h)≤n2

n. This rate is known to be the frequentist optimal

rate for estimators. From Theorem 8.6.3, we therefore conclude that for the prior

constructed earlier, the posterior attains the optimal rate of convergence.

Since the lower bounds of the brackets are not really needed, the theorem can be

generalized by deﬁning N](, P,h) as the minimal number of functions u1,...,u

msuch

that for every p∈Pthere exist a function uisuch that both p≤uiand h(ui,p)<.

Next we construct a prior Π as before. These upper bracketing numbers are clearly

smaller than the bracketing numbers N[](, P,h), but we do not know any example

where this generalization could be useful.

So far, we have implicitly required that the model Pis totally bounded for the

Hellinger metric. A simple modiﬁcation works for countable unions of totally bounded

models, provided that we use a sequence of priors. Suppose that the bracketing num-

bers of Pare inﬁnite, but there exist subsets Pn↑Pwith ﬁnite bracketing numbers.

Let nbe numbers such that log N[](n,Pn,h)≤n2

nandbesuchthatn2

nis increasing

with n2

n/log n→∞. Then we construct Πnas before with Preplaced by Pn, but we

do not mix these uniform distributions. Instead, we consider Πnitself as the sequence

of prior distributions. Then the corresponding posteriors achieve the convergence rate

n.

It is worth observing that we use a condition on the entropies with bracketing, even

though we apply Theorem 8.6.2, which demands control over metric entropies only.

8.6. CONVERGENCE OF POSTERIOR AT OPTIMAL RATE 235

This is necessary because the theorem also requires control over the likelihood ratios.

If, for instance, the densities are uniformly bounded away from zero and inﬁnity, so

that the quotients p0/p are uniformly bounded, then we can replace the bracketing

entropy also by ordinary entropy. Alternatively, if the set of densities Ppossesses

an integrable envelope function, then we can construct priors achieving the rate n

determined by the covering numbers up to logarithmic factors. Here we deﬁne nas

the minimal solution of the equation log N(, P,h)≤n2and N(, P,h) denotes the

Hellinger covering number (without bracketing).

We assume that the set of densities Phas a µ-integrable envelope function: a

measurable function mwith mdµ < ∞such that p≤mfor every p∈P. Given

n>0let{s1,n,...,s

Nn,n}be a minimal n-net over P(hence Nn=N(n,P,h)) and

put

gj,n =(s1/2

j,n +nm1/2)2/cj,n

where cj,n is a constant ensuring that gj,n is a probability density. Finally, let Πnbe

the uniform discrete measure on g1,n,...,g

Nn,n and let Π = ∞

n=1 λnΠnbe a convex

combination of the Πnas before. This is similar to the construction of sieved MLE in

theorem 6 of Wong and Shen [172]. The following result guarantees an optimal rate

of convergence.

Theorem 8.6.5. Suppose that nare numbers decreasing in nsuch that

log N(n,P,h)≤n2

for every nand n2

n/log n→∞. Construct the prior Π=∞

n=1 λnΠnas given

previously for a sequence λnsuch that λn>0for all nand log λ−1

n=O(log n).

Assume mis a µ-integrable envelope. Then the corresponding posterior converges at

the rate nlog(1/n)in probability, relative to the Hellinger distance.

We omit the proof.

Survival Analysis—Dirichlet Priors

9.1 Introduction

In this chapter, our interest is in the distribution of a positive random variable X,

which arises as the time to occurrence of an event. What makes the problem diﬀerent

from those considered so far is the presence of censoring. Typically, one does not

always get to observe the value of Xbut only obtains some partial information about

X, like X≥aor a≤X≤b. This loss of information is often modeled through

various kinds of censoring mechanisms: left, right, interval, etc. See Andersen et

al. [3] for a deep development of various censoring models. The earliest frequentist

methods for censored data were in the context of right censored data, and it is this

kind of censoring that we will study in this and in Chapter 10. Bayesian analysis of

other kinds of censored data is still tentative, and much remains to be done.

Let Xbe a positive random variable with distribution Fand let Ybe independent

of Xwith distribution G. The model studied in this section is: F∼Π, given F;

X1,X

2,...,X

nare i.i.d F; given G;Y1,Y

2,...,Y

nare i.i.d Gand are independent of

the Xis; the observations are (Z1,δ

1),(Z2,δ

2),...,(Zn,δ

n)whereZi=(Xi∧Yi)and

δi=I(Xi≤Yi).

Our interest is in the posterior distribution of Fgiven (Zi,δ

i):1≤i≤n.

Under the assumption that Xand Yare independent, the posterior distribution of

Fgiven (Z, δ) is independent of G.IfZi=ziand δi= 0, the observation is referred

238 9. SURVIVAL ANALYSIS—DIRICHLET PRIORS

to as (right) censored at zi, and in this case it is intuitively clear that the information

we have about Xis just that Xi>z

iand hence the posterior distribution of F

given (Zi=zi,δ

i=0)isΠ(·|Xi>z

i). Similarly, the posterior distribution of Fgiven

(Zi=zi,δ

i=1)isΠ(·|Xi=zi).

In Section 9.1, we study the case when the underlying prior for Fis a Dirichlet

process. This model was ﬁrst studied by Susarla and Van Ryzin [154]. They obtained

the Bayes estimate of F, and later Blum and Susarla [26] gave a mixture represen-

tation for the posterior. Here we develop a diﬀerent representation for the posterior

and show that the posterior is consistent.

In Section 9.2, we brieﬂy discuss the notion of cumulative hazard function, describe

some its properties, and use it to describe a result of Peterson who shows that, under

mild assumptions, both Fand Gcan be recovered from the distribution of (Z, δ).

This result is used in Section 9.3.

In Section 9.3, we start with a Dirichlet prior for the distribution of (Z,δ )and

through the map discussed in Section 9.2, transfer this to a prior for F. The properties

discussed in Section 9.2 are used to study these priors.

In the last section, we look at Dirichlet process priors for interval censored data

and note that some of the properties analogous to the right censored case do not hold

here. Some of the material in this chapter is taken from [81] and [87].

9.2 Dirichlet Prior

Let αbe a ﬁnite measure on (0,∞). The model that we consider here is F∼Dα;Given

F;X1,X

2,...,X

nare i.i.d F;GivenG;Y1,Y

2,...,Y

nare i.i.d Gand are independent

of the Xis; the observations are (Z1,δ

1),(Z2,δ

2),...,(Zn,δ

n)whereZi=(Xi∧Yi)

and δi=I(Xi≤Yi).

Our interest is in the posterior distribution of Fgiven (Zi,δ

i):1≤i≤n. Under

the independence assumption the distribution of Gplays no role in the posterior

distribution of F.

The posterior can be represented in many ways. Susarla and Van Ryzin [154], who

ﬁrst investigated, obtained a Bayes estimate for F and showed that this Bayes es-

timate converges to the Kaplan-Meier estimate as α(R+)→0. Blum and Susarla

[26] complemented this result by showing that the posterior distribution is a mix-

ture of Dirichlet processes. This mixture representation, while natural, is somewhat

cumbersome.

9.2. DIRICHLET PRIOR 239

Lavine [118] observed that the posterior can be realized as a Polya tree process.

Under this representation computations are more transparent, and this is the repre-

sentation that we use in this chapter. A more elegant approach comes from viewing a

Dirichlet process as a neutral to right prior. This method is discussed in Chapter 10.

Since a Dirichlet process is also a Polya tree, we begin with a proposition that

indicates that a Polya tree prior can be easily updated in the presence of partial

information. The proof is straightforward and omitted.

Proposition 9.2.1. Let µbe a PT(T,α). Given P;X1,X

2,...,X

nare i.i.d P. The

posterior given IB1(X1),I

B2(X2),...,I

Bn(Xn)is again a Polya tree with respect to

Tand with parameters α

=α+#{i:Bi⊂B}.

Let Z=(Z1,Z

2,...,Z

n), where Z1<···<Z

n. Consider the sequence of nested

partitions {πm(Z)}m≥1given by:

π1(Z):B0=(0,Z

1],B

1=(Z1,∞)

π2(Z):B00,B

01,B

10 =(Z1,Z

2],B

11 =(Z2,∞)

and for l≤(n−1),let

πl+1(Z):B0l0,B

0l1,...,B

1l,0=(Zl,Z

l+1],B

1l1=(Zl+1,∞)

where 1lis a string of 1s of length l, and 0lis a string of 0s of length l. The remaining

Bs are arbitrarily partitioned into two intervals such that {πm(Z)}m≥1forms a

sequence of nested partitions that generates B(R+).

Let α1,...,l=α(B1,...,l), and Cn

1,...,l=δi=0 I[(Zi,∞)⊂B1,...,l]. Also, let

Ui=#(Zi,δ

i):Zi>Z

(i),δ

i=1



be the number of uncensored observations strictly larger than Z(i).

Similarly denote by Cithe number of censored observations that are greater than

or equal to Z(i), i.e.

Ci=#(Zi,δ

i):Zi≥Z(i),δ

i=0



where ni=Ci+Ui−1is the number of subjects alive at time Z(i)and n+

i=Ci+Ui

is the number of subjects who survived beyond Z(i). To evaluate the posterior given

(z1,δ

1),...,(zn,δ

n), ﬁrst look at the posterior given all the uncensored observations

among (z1,δ

1),...,(zn,δ

n) . Since the prior on M(X)—the space of all distributions

for X–is a Dα, the posterior on M(X) is Dirichlet with parameter α+(i:∆i=1) δZi.

240 9. SURVIVAL ANALYSIS—DIRICHLET PRIORS

Because a Dirichlet process is a Polya tree with respect to every partition, it is

so with respect to T∗(Z∗). Proposition 9.2.1 easily leads to the updated parameters

α

1,2,...,k.We summarize these observations in the following theorem.

Theorem 9.2.1. Let µ=Dα×δG0be the prior on M(R+)×M(R+). Then the pos-

terior distribution µ1(·|(z1,δ

1),...,(zn,δ

n)) is a Polya tree process with parameters

π(Z,δ)

nand α(Z,δ)

n={´α1,...,l}, where ´α1,...,l=α1,...,l+Ui]+Ci.

Remark 9.2.1.Note that if B1,...,l=(Zk,∞) then

α

1,...,l=α(B1,...,l) + number of individuals surviving at time Zk

and for every other B1,...,l,

α

1,...,l=α(B1,...,l) + number of uncensored observations in B1,...,l

The representation immediately allows us to ﬁnd the Bayes estimate of the survival

function ¯

F=1−F. Fix t>0 and let Z(k)≤t<Z

(k+1). Then, with Z(0) =0

F(t)=k



F(Z(i))

F(Z(i−1))¯

F(t)

F(Z(k))(9.1)

A bit of reﬂection shows that Theorem 9.2.1 continues to hold if we change the parti-

tion to include t, i.e., partition B1kinto (Z(k),t] and (t, ∞) and then continue as before.

Thus the factors in (9.1) are independent beta variables and ˆ

F(t)=E(¯

F(t)|(Zi,δ

i):

1≤i≤n) is seen to be

F(t)=k



α(Z(i),∞)+Ui+Ci

α(Z(i−1),∞)+Ui−1+Ciα(t, ∞)+Ut+Ct

α(Z(k),∞)+Uk+Ct

(9.2)

Rewrite expression (9.2) as

k



α(Z(i),∞)+Ui+Ci

α(Z(i),∞)+Ui+Ci+1 α(t, ∞)+Ut+Ct

α(0,∞)+n(9.3)

If the censored observations and the uncensored observations are distinct (as would

bethecaseifFand Ghave no common discontinuity), then at any Z(i)that is an

9.2. DIRICHLET PRIOR 241

uncensored value, Ci=Ci+1 and the corresponding factor in (9.3) is 1. Thus (9.3)

can be rewritten as

⎡

⎣

Z(i)≤t,δi=0

α(Z(i),∞)+Ui+Ci

α(Z(i),∞)+Ui+Ci+1 ⎤

⎦

α(t, ∞)+Ut+Ct

α(0,∞)+n(9.4)

This is the expression obtained by Susarla and Van Ryzin [154]. The expression is

a bit misleading because it appears that the estimate, unlike the Kaplan-Meier, is a

product over censored values. Keeping in mind that Ct=Ck+1,itiseasytoseethatif

tis a censored value, then the expression is left-continuous at t,andbeingasurvival

function it is hence continuous at t. Similarly, it can be seen that the expression has

jumps at uncensored observations. Thus the expression can be rewritten as a product

over censored observations times a continuous function. This form appears in the

Chapter 10.

As α(0,∞)→0, (9.1) goes to

k



Ui+Ci

Ui−1+CiUt+Ct

Uk+Ck

(9.5)

If Z(i)is uncensored then Ui+Ci=N+

iand Ui−1+Ci=Ni.IfZ(i)is censored then

Ui+Ci=Ui−1+Ciand we get the usual Kaplan-Meier estimate.

We next turn to consistency.

Theorem 9.2.2. Let F0and Ghave the same support and no common point of

discontinuity. Then for any t>0,

(i) E(¯

F(t)|(Zi,δ

i):1≤i≤n)→¯

F0(t)a.e. P∞

F0×G; and

(ii) V(¯

F(t)|(Zi,δ

i):1≤i≤n)→0a.e. P∞

F0×G.

Hence the posterior of Fis consistent (F0.

Proof. Because F0and Ghave the same support and no common point of discon-

tinuity, the censored and uncensored observations are distinct. Note that if a, b, c ≥

0,a+b/a +c≥b/c. Using this fact, it is easy to see that (9.1) is larger than (9.5),

and hence

lim

n→∞ E(¯

F(t)|(Zi,δ

i):1≤i≤n)≥¯

F0(t) a.e. P∞

F0×G

242 9. SURVIVAL ANALYSIS—DIRICHLET PRIORS

On the other hand, writing (9.4) as An(t)Bn(t)where

An(t)=α(t, ∞)+Ut+Ct

α(0,∞)+n,B

n(t)=⎡

⎣

Z(i)≤t,δi=0

α(Z(i),∞)+Ui+Ci

α(Z(i),∞)+Ui+Ci+1 ⎤

⎦

it is easy to see that An(t)→¯

F0(t)¯

G0(t)and

(Bn(t))−1≥

Z(i)≤t,δi=0

Ui+Ci

Ui+Ci+1

The right side of the last expression is the Kaplan-Meier estimate of ¯

G,andso

lim

n→∞ (Bn(t))−1≥¯

G(t)

and

lim

n→∞ Bn(t)≤¯

G(t)−1

so that

lim

n→∞ An(t)Bn(t)≤¯

F0(t)

Since the factors in (9.1) are beta variables, it is easy to write E(¯

F2(t)|(Zi,δ

i):

1≤i≤n). A bit of tedious calculation will show that

E(¯

F2(t)|(Zi,δ

i):1≤i≤n)→¯

0(t)

We leave the details to the reader.

9.3 Cumulative Hazard Function, Identiﬁability

Let Fbe a distribution function on (0,∞). So the survival function ¯

F=1−Fis de-

creasing, right-continuous and limt→0¯

F(t)=1,limt→∞ ¯

F(t) = 0. We will often write

F(A),¯

F(A) for the probability of a set Aunder the probability measure correspond-

ing to the distribution function F.ThusF{t}=¯

F{t}=F(t)−F(t−)= ¯

F(t−)−¯

F(t)

is the probability of {t}.

A concept of importance in survival analysis is failure rate and the related cumu-

lative hazard function. For the distribution function Fof a discrete probability, a

9.3. CUMULATIVE HAZARD FUNCTION, IDENTIFIABILITY 243

natural expression for the hazard rate at sis F{s}/¯

F(s−). Summing this over s≤t

gives a notion of cumulative hazard function for a discrete Fat tas

H(F)(t)=

s≤(t)

F{s}

F(s−)=(·)

dF (s)

F(s−)

Extending this notion, cumulative hazard function for a general Fis deﬁned by

H(F)(·)=(·)

dF (s)

F(s−)

More precisely, let F∈Fand let TF= inf{t:F(t)=1}.NotethatTFmay be ∞.

Set

H(F)=HF(t)=(0,t]

dF (s)

F[s,∞),for t≤TF

HF(TF)fort>T

1. Let {s1,s

2,...}be a dense subset of (0,∞). For each n,let s(n)

1<···<s

(n)

be an ordering of {s1,...,s

n}.Lets(n)

0= 0 and deﬁne

F(t)=⎧

⎨

⎩s(n)

i<t

F(s(n)

i,s(n)

i+1]

F(s(n)

i,∞)for t≤TF

F(TF)fort>T

Then, for all t,Hn

F(t)→HF(t)asn→∞.

2. HFis nondecreasing and right-continuous. The fact that HFis nondecreasing

follows trivially because Fis nondecreasing. To see that HFis right-continuous,

ﬁx a point tand note that if j=max{i≤n:s(n)

i<t}, then

HF(t+) −HF(t) = lim

n→∞

F(s(n)

j+1,s

(n)

j+2]

F(s(n)

j+1,∞)

where both {s(n)

j+1}and {s(n)

j+2}are nondecreasing sequences converging to tfrom

above. Thus F(s(n)

j+1,s

(n)

j+2]→0asn→∞.

If t<T

F, then the denominator of the right hand side of the equation is

positive for some n, hence right-continuity follows. For t≥TFit follows from

the deﬁnition.

244 9. SURVIVAL ANALYSIS—DIRICHLET PRIORS

It is easy to see that HF(t)<∞for every t<T

F.AswithF, we will think of HF

simultaneously as a function and a measure. Thus the measure of any interval

(s, t] under HFwill be deﬁned as HF(s, t]=HF(t)−HF(s). For TF<s<t,

deﬁne HF(s, t]=0.

3. For any t,HFhas a jump at tiﬀ Fhas a jump at t, i.e. {t:HF{t}>0}={t:

F{t}>0}.

4. It follows from preceding that

(a) TF= inf{t:HF(t)=∞or HF{t}=1},

(b) HF{t}≤1 for all t,

(d) HF{TF}=1ifF{TF}>0.

These and other properties of Hand details can be found in Gill and Johansen

[90].

Let Abe the space of all functions on [0,∞) that are nondecreasing, right-continuous,

and may, at any ﬁnite point, be inﬁnity, but has jumps no greater than one, i.e.,

A={B∈H|B{t}≤1 for all t}

Equip Awith the smallest σ-algebra under which, the maps {A→ A(t),t≥0}are

measurable. Hmaps Finto Aand His measurable. The actual range of H, which

we will now describe, is smaller.

For A∈A

,letTA= inf{t:A(t)=∞or A{t}=1}.LetAbe the space of all

cumulative hazard functions on [0,∞). Formally deﬁne Aas

A={A∈A

|A(t)=A(TA) for all t≥TA}

Endow Awith the σ-algebra which is the restriction of the σ-algebra on Ato A.

The map His a 1-1 measurable map from Fonto Aand, in fact, has an inverse [see

Gill and Johansen [90]]. We consider this inverse map next and brieﬂy summarize its

properties.

Let A∈A

.Let{s1,s

2,...}be dense in (0,∞). For each n,let s(n)

1<···<s

(n)

be as before. Fix s<t.IfA(t)<∞, deﬁne the product integral of Aby



(s,t]

(1 −dA) = lim

n→∞ 

s<s(n)

i≤t

(1 −A(s(n)

i−1,s

(n)

i])

9.3. CUMULATIVE HAZARD FUNCTION, IDENTIFIABILITY 245

where A(a, b]=A(b)−A(a)fora<b.IfA(t)=∞and A(s)<∞, set (s,t](1−dA)=

0. If A(s)=∞,set(s,t](1 −dA)=1.

Theorem 9.3.1. Let A∈A. Then Fgiven by

F(t)=1−

(0,t]

(1 −dA)

is an element of F. Further,

A(t)=(0,t]

dF (s)

F[s, ∞)

The following properties of the product integral are included to lend the reader a

better understanding of the nature of the map Hand will be useful later. For details,

we again refer to Gill and Johansen [90].

5. Like H, the product integral also has an explicit expression:



(0,t]

(1 −dA)=exp(−Ac(t)) 

s≤t

(1 −A{s})

where Acis the continuous part of A.

6. Let ρSdenote the Skorokhod metric on D[0,∞) and let {Hn}be a sequence in

A.SaythatρS(Hn,A)→0forsomeA∈Aas n→∞,ifρS(HT

n,A

T)→0for

all T>0whereHT

nand ATare restrictions of Hnand Ato [0,T]. It may be

shown, following Hjort [([100], Lemma A.2, pp. 1290–91), that if {Hn},A ∈A

and ρS(Hn,A)→0, then H−1(Hn)w

→H−1(A). Thus, if Ais endowed with the

Skorokhod metric, then H−1is a continuous map.

Let Fbe a distribution function. In the literature

A(F)=−log ¯

is also used to formalize the notion of “cumulative hazard function of F.” Aarises

by deﬁning the hazard rate at sfor a continuous random variable as

r(s) = lim

∆s→0

∆s

P(s≤X<s+∆s)

P(X≥s

246 9. SURVIVAL ANALYSIS—DIRICHLET PRIORS

If Xhas a distribution Fwith density fthen r(s)=f(s)/¯

F(s) and if the cumulative

hazard function is deﬁned as (.)

0r(s)ds then this gives A(F)=−log ¯

F(·). One

extends the deﬁnition for a discrete Fformally to give A.

It is easy to see that the two deﬁnitions coincide when Fis continuous. However, in

estimating a survival function or a cumulative hazard function one typically employs a

discontinuous estimate. Further, priors like the Dirichlet sit on discrete distributions.

The nature of the map, therefore, plays an important role in inference about lifetime

distributions and hazard rates. For us, the cumulative hazard function of a distribution

will be H(F).

We next turn to identiﬁability of (F, G)by(Z, δ). As before, let Xand Ybe

independent with X∼F, Y ∼G.LetT(x, y)=(z, δ)=(x∧y),I(x≤y)) and denote

by T∗(F, G) the distribution of Twhen X∼F, Y ∼G.

T∗(F, G) is thus a probability measure on T=(0,∞)×{0,1}. Any probability

measure Pon Tgives rise to two subsurvival functions,

S0(t)=P((t, ∞)×{0})

and

S1(t)=P((t, ∞)×{1})

These satisfy

S0(0+) + S1(0+) = 1,S

i(t) decreasing in tlim

t→∞ Si(t) = 0 (9.6)

Conversely, any pair of subsurvival functions satisfying (9.6) correspond to a prob-

ability on T. The following proposition, due to Peterson [138], shows that under mild

assumptions Fand Gcan be recovered from T∗(F, G).

Proposition 9.3.1. Assume that Fand Ghave no common points of discontinuity.

Let T∗(F, G)=(S0,S1). Then for any tsuch that Si(t)>0,i =0,1;

HF(t)=(0,t]

dS1(s)

S0(s−)+S1(s−)(9.7)

F(t)=e−t

dS1

c(s)

S0(s−)+S1(s−)

s≤t,S1{s}>01−S1{s}

S0(s−)+S1(s−)(9.8)

9.4. PRIORS VIA DISTRIBUTIONS OF (Z, δ) 247

sup

t|Fn(t)−F(t)|+|Gn(t)−G(t)|→0iﬀ

sup

t|S0

n(t)−S0(t)|+|S0

n(t)−S0(t)|→0 (9.9)

A similar expression holds for ¯

G. Thus, if we assume that Fand Ghave no com-

mon points of discontinuity and have the same support, then both Fand Gcan be

recovered from T∗(F, G).

9.4 Priors via Distributions of (Z, δ)

It might be argued that in the censoring context, subjective judgments such as ex-

changeability are to be made on the observables (Z, ∆) and would hence lead to

priors for the distribution of (Z, ∆). The model of independent censoring can be used

to transfer this prior to the distribution of the lifetime X.

Formally, let M0⊂M(X)×M(Y) be the class of all pairs of distribution functions

(F, G) such that

1. Fand Ghave no points of discontinuity in common, and

2. for all t≥0,F(t)<1andG(t)<1.

Denote by Tthe function T(x, y)=(x∧y, Ix≤y)andbyT∗the function on M(§×Y)

deﬁned by T∗(P, Q)=(P, Q)◦T−1, i.e., T∗(P, Q) is the distribution of Tunder (P, Q).

Let M0∗=T∗(M0). From the last section we know that on M0,T∗is 1-1. Note that

every prior on M0gives rise to a prior on M0∗via T∗and every prior on M0∗induces

a prior on M0through (T∗)−1.

Theorem 9.4.1. Let Πbe a prior on M0and Π∗=µ◦φ−1be the induced prior

on M0∗.

(i) If Π∗(·|(Zi,δ

i):1≤i≤n)on M0∗is weakly consistent at T∗(P0,Q

0), and

(P0,Q

0)is continuous then the posterior Π(·|(Zi,δ

i):1≤i≤n)on M0is

weakly consistent at (P0,Q

0).

(ii) If Π∗(U|(Zi,δ

i):1≤i≤n)→1for Uof the form

U={(S0,S1):sup

[|S0(t)−S0

0(t)|+|S1(t)−S1

0(t)|<]}

248 9. SURVIVAL ANALYSIS—DIRICHLET PRIORS

(here (S0

0,S1

0)are the subsurvival functions corresponding to T∗(P0,Q

0)), then

the posterior Π(·|(Zi,δ

i):1≤i≤n)on M0is weakly consistent at P0.

Proof. (i) immediately follows from the fact that for continuous distributions the

neighborhoods arising from supremum metric and weak neighborhoods coincide (see

Proposition 2.5.3). The second assertion follows from the continuity property de-

scribed in Proposition 9.3.1 and by noting that Π(.|(Zi,δ

i):1≤i≤n)onM0is just

the distribution of (T∗)−1under Π∗(.|(Zi,δ

i):1≤i≤n).

We have so far not demonstrated any prior on M0∗. We next argue that it is in

fact possible to obtain a Dirichlet prior on M(T) that gives mass 1 to M0∗.

Theorem 9.4.2. Let αbe probability measure on T=(0,∞)×{0,1}and let

(S0

α,S1

α)be the corresponding subsurvival functions. Assume

(a) S0

αand S1

αhave the same support and have no common points of discontinuity;

and

(b) if for i=0,1,Hi(t)=(0,t]dSi

α(s)/((S0

α(s−)+S1

α(s−))) satisﬁes

lim

t→∞ Hi(t)=∞for i=0,1

then for any c>0,Dcα(M0∗)=1.

Proof. We will work with pairs of random subsurvival functions than with random

probabilities on T. We will show that with Dcα probability 1,

(a) S0and S1have the same support and have no common points of discontinuity;

and

(b) for i=0,1,(0,∞)dSi(s)/(S0(s−)+S1(s−)) = ∞

That (a) holds with probability 1 is immediate from assumption (a). For (b), let

t1,t

2,..., continuity points of S0

α,besuchthat



α(ti−1,t

(S0

α(ti−1)+S1

α(ti−1)) =∞

Such tis can be chosen by ﬁrst choosing siwith H1(si)↑∞and then choosing tisin

(si,s

i+1]with



tj∈(si,si+1]

α(ti−1,t

(S0

α(ti−1)+S1

α(ti−1)) ≥H1(si)−H1(si−1)+2

−i

9.5. INTERVAL CENSORED DATA 249

Let Yi=S1(ti−1,t

i]/((S0(ti−1)+S1(ti−1))),clearly iYi≥dSi(s)/(S0(s−)+

S1(s−)). Further, the Yi’s are bounded by 1 and under Dirichlet, are independent.

Note that (S0

α(ti−1)+S1

α(ti−1)) and Yiare independent and hence

E(Yi)= S1

α(ti−1,t

(S0

α(ti−1)+S1

α(ti−1))

Assumption (b) guarantees E(Yi)=∞. This in turn gives E(Yi)=∞[See

Loeve, [132] p 248)].

In addition to consistency, if the empirical distribution of (Z, ∆) is a limit of Bayes

estimate on M0∗, then so is the Kaplan-Meier estimate. This method of constructing

priors on M0is appealing and merits further investigation—for instance the Dirichlet

process on M0∗arises through a Polya urn scheme, and it would be of interest to see

the corresponding process for the induced prior.

9.5 Interval Censored Data

Susarla and Van Ryzin showed that the Kaplan-Meier estimate, which is also the non-

parametric MLE, is the limit of Bayes estimates with a Dαprior for the distribution

of X. The observations in this section show that this result does not carry over to

other kinds of censored data.

Here our observation consists of npairs (Li,R

i]; 1 ≤i≤nwhere Li≤Riand

corresponds to the information X∈(Li,R

i].We assume that (Li,R

i]; 1 ≤i≤n

are independent and that the underlying censoring mechanism is independent of the

lifetime Xso that the posterior distribution depends only on (Li,R

i]; 1 ≤i≤n.

Let t1<t

2< ...,t

k+1 denote the endpoints of (Li,R

i]; 1 ≤i≤narranged in

increasing order and let Ij=(tj,t

j+1]. For simplicity we assume that t1= min

iLiand

tk+1 =max

iRi.

Our starting point is a Dirichlet prior D(cα1,cα

2,...,cα

k)for(p1,p

2,...,p

k)where

pj=P{X∈Ij}. Turnbull [159] suggested the use of the nonparametric maximum

likelihood estimate obtained from the likelihood function



i=1 ⎛

⎝

Ij⊂(Li,Ri]

pj⎞

⎠

If (p1,p

2,...,p

k)hasaD(cα1,cα

2,...,cα

k) prior then the posterior distribution of

(p1,p

2,...,p

k) given (Li,R

i]; 1 ≤i≤nis a mixture of Dirichlet distributions.

250 9. SURVIVAL ANALYSIS—DIRICHLET PRIORS

Call a vector a=(a1,a

2,...,a

n), where each ai,isaninteger,animputation of

(Li,R

i]; 1 ≤i≤nif Iai⊂(Li,R

i]. For an imputation a, let nj(a) be the number of

observations assigned to the interval Ij. Formally nj(a)=#{i:ai=j}.

Let the order O(a) of an imputation be #{j:nj(a)>0}.LetAbe the set of all

imputations of (Li,R

i]; 1 ≤i≤nand let m= mina∈AO(a). Call an imputation a

minimal if O(a)=m.

It is not hard to see that the posterior distribution of (p1,p

2,...,p

k) given (Li,R

i]; 1 ≤

i≤nis



a∈A

CaD(cα1+n1(a),cα

2+n2(a),...,cα

k+nk(a))

where

Ca=k

1Γ(cαj+nj(a))

a∈Ak

1Γ(cαj+nj(a))

The Bayes estimate of any pjis

ˆpj=

a∈A

cαj+nj(a)

c+n

As c↓0,(cαj+nj(a))/(c+n)→nj(a)/n. The behavior of Cais given by the next

proposition.

Proposition 9.5.1. limc→0Ca>0iﬀ ais a minimal imputation.

Proof. Suppose ais not minimal. Let a0be an imputation with O(a)>O(a0):

Ca≤k

1Γ(cαj+nj(a))

k

1Γ(cαj+nj(a0)) =k

j=1 Γ(cαj)

k

j=1 Γ(cαj)j:nj(a)=0) nj(a)

0(cαj+i)

j:nj(a0)=0) nj(a0)

0(cαj+i)

Since O(a)>O(a0) the ratio goes to 0. Conversely, if ais minimal it is easy to see

that

=

a∈Ak

1Γ(cαj+nj(a))

k

1Γ(cαj+nj(a))

converges to a positive limit.

9.5. INTERVAL CENSORED DATA 251

Thus the limiting behavior is determined by minimal imputations. A few examples

clarify these notions.

Example 9.4.1. Consider the right censoring case, i.e., for each ieither Li=Ri

or Ri=tk. Any minimal imputation is given by assigning compatible observations

to the singletons corresponding to uncensored observations and Ikif the last(largest)

observation is censored.

Example 9.4.2. Consider the case when we have current status or case I interval

censored data. Here for each i, either Li=t1or Ri=tk+1 sothatallweknowisif

Xiis to the right of Lior to the left of Ri.

(i) If maxiLi<miniRithe minimal imputation is allocation of all the observations

to the interval (maxiLi,miniRi].

(ii) In general, the minimal imputations have order 2. For example, a consistent

assignment of the data to (t1,miniRi],(maxiLi,t

k+1] would yield a minimal

imputation.

A couple of simple numerical examples help clarify the diﬀerent cases. In the follow-

ing examples the prior of the distribution is Dcα,whereαis a probability measure.

The limit is taken as c→0. Corresponding to any imputation a, we will call the

intervals Ijs for which nj(a)>0,an allocation, and an allocation corresponding to a

minimal imputation will be called a minimal allocation.

Example (a): This example illustrates that the limit of Bayes estimates could be

supported on a much bigger set than the NPMLE. The observed data consist of the

four intervals (1, ∞), (2, ∞), (0, 3], (4, ∞).

The limit of Bayes estimates in this case turns out to be;

F(0,1] = 1/22,

F(1,2] = 2/22,

F(2,3] = 6/22, and

F(4,∞]=13/22,

while the NPMLE is given by,

F(2,3] = 1/2and

F(4,∞]=1/2.

In the example, each minimal allocation consists of only two subntervals.

(i) (0,1], and (4, ∞), with the corresponding numbers of Xis in the subintervals being

1 and 3, respectively, represents a minimal allocation.

(ii) (2, 3] and (4, ∞) with the corresponding numbers of Xis in the subintervals being

252 9. SURVIVAL ANALYSIS—DIRICHLET PRIORS

1 and 3, respectively, represents another minimal allocation.

(iii) (2, 3] and (4, ∞) with the corresponding numbers of Xis in the subintervals being

2 and 2, respectively, represents yet another minimal allocation.

Example (b): This example shows that the limit of Bayes estimates could be sup-

ported on a smaller set than the NPMLE. The observed data consist of the intervals

(0, 1], (2, ∞), (0, 3], (0, 4], and (5, ∞).

The limit of Bayes estimates in this case turns out to be:

F(0,1] = 3/5, and

F(5,∞)=2/5.

while the NPMLE is given by:

F(0,1] = 1/2,

F(2,3] = 1/6, and

F(5,∞)=1/3.

As c→0, while Dirichlet priors lead to strange estimates for the current status data,

the case c= 1 seems to present no problems. Even when c→0 we expect that the

limiting behavior will be more reasonable when the data are case II interval censored,

in the sense described in [91]. In this case, the tendency to push the observation to

the extremes would be less pronounced.

In the current status data case the limit (as c↓0) of the posterior itself exhibits

degeneracy. The following proposition is easy to establish.

Proposition 9.5.2. Let R∗= inf

i:Li=0 Riand L∗=sup

i:Ri=tk+1

Li.

(i) If R∗<L

∗then as c↓0the posterior distribution of P(R∗,L

∗)converges to the

measure degenerate at 0

(ii) If L∗<R

∗then as c↓0the posterior distribution of P(L∗,R

∗)converges to the

measure degenerate at 1

Neutral to the Right Priors

10.1 Introduction

In Chapter 3, among other aspects, we looked at two properties of Dirichlet processes-

the tail free property and the neutral to the right property. In this chapter we discuss

priors that generalize Dirichlet processes via the neutral to the right property.

Neutral to the right priors are a class of nonparametric priors that were introduced

by Doksum [48]. Historically, the concept of neutrality is due to Connor and Mosimann

[34] who considered it in the multinomial context. Doksum extended it to distributions

on the real line in the form of neutral to the right priors and showed that if Π is neutral

to the right, then the posterior given nobservations is also neutral to the right. This

result was extended to the case of right-censored data by Ferguson and Phadia [64].

These topics are discussed in Section 10.2.

Doksum and Hjort showed that a prior is neutral to the right iﬀ the cumulative

hazard function has independent increments. Since independent increment processes

are well understood, this connection provides a powerful tool for studying neutral

to the right priors. In particular, independent increment processes have a canonical

structure, the so-called L´evy representation. The associated L´evy measure can be

used to elucidate properties of neutral to the right priors. For instance Hjort provides

an explicit expression for the posterior given nindependent observations in terms of

254 10. NEUTRAL TO THE RIGHT PRIORS

the L´evy representation when the L´evy measure is of a speciﬁc form. In Section 10.3

we summarize these results.

In Section 10.4 we discuss beta processes. Hjort [100] and Walker and Muliere [166],

respectively, developed beta processes and beta-Stacy processes, which provide con-

crete and useful classes of neutral to the right priors. These priors are analogous to

the beta prior for the Bernoulli (θ), are analytically tractable, and are ﬂexible enough

to incorporate a wide variety of prior beliefs.

The rest of the chapter is devoted to consistency results for neutral to the right

priors. These results center around an example of Kim and Lee [114] of a neutral to

the right prior that is inconsistent at all continuous distributions.

10.2 Neutral to the Right Priors

For any F∈F,asintheChapter9 ¯

F(·)=1−F(·) is the survival function corre-

sponding to F.Let ¯

F(0) = 1. We also continue to denote by F(A) the measure of

the set Aunder the probability measure corresponding to F.

Deﬁnition 10.2.1. ApriorΠonFis said to be neutral to the right if, under Π,

for all k≥1andall0<t

1<...< t

F(t1),¯

F(t2)

F(t1),..., ¯

F(tk)

F(tk−1)

are independent.

If Π is neutral to the right, we will also refer to a random distribution function

Fwith distribution Π as being neutral to right. Note that (0/0) is deﬁned here and

throughout to be 1.

For a ﬁxe d F,ifXis a random variable distributed as F,thenforevery0≤s<t,

F(t)/¯

F(s) is simply the conditional probability F(X>t|X>s). For t>0, ¯

F(t)is

viewed as the conditional probability F(X>t|X>0).

Example 10.2.1. Consider a ﬁnite ordered set {t1,...,t

n}of points in (0,∞). To

construct a neutral to right prior on the set Ft1,...,tnof distribution functions supported

by the points t1,...,t

n, we only need to specify (n−1) independently distributed

[0,1]-valued random variables V1,...,V

n−1, and then set ¯

F(ti)/¯

F(ti−1)=1−Vifor

1≤i≤n−1. Finally, set ¯

F(tn)/¯

F(tn−1) = 0. Observe that ¯

F(tn) = 0 and, for

10.2. NEUTRAL TO THE RIGHT PRIORS 255

1≤i≤n−1,

F(ti)=



j=1

(1 −Vj)

Example 10.2.2. In a similar fashion we can construct a neutral to right prior on

the space FTof all distribution functions supported by a countable subset T={t1<

t2<...}of (0,∞).

Let {Vi}i≥1be a sequence of independent [0,1]-valued random variables such that,

for some η>0,



i≥1

P(Vi>η)=∞

This happens, for instance, when Vis are identically distributed with P(Vi>η)>0.

As before, for i≥1, set ¯

F(ti)/¯

F(ti−1)=1−Vi. In other words, ¯

F(tk)=k

i=1(1 −Vi),

for all k≥1. By the Borel-Cantelli lemma, we have

P

i≥1

(1 −Vi)=0

=1

This deﬁnes a neutral to right prior Π on Fbecause

lim

t→∞

F(t) = lim

k→∞



i=1

(1 −Vi)=0,a.s. Π

Dirichlet process priors of course provide a ready example of a family of neutral to

the right priors. Other examples are the beta process and beta-Stacy process , to be

discussed later.

As before, we consider the standard Bayesian set-up where Π is a prior and given F,

X1,X

2,... be i.i.d. F.Foreachn≥1, denote by ΠX1,...,Xna version of the posterior

distribution, i.e. the conditional distribution of Fgiven X1,...,X

Following are some notations:

For n≥1, deﬁne the observation process Nn(.) as follows:

Nn(t)=

i≤n

I(0,t](Xi) for all t>0

For every n≥1, let Nn(0) ≡0. Observe that Nn(.) is right-continuous on [0,∞). Let

Gt1...tk=σ¯

F(t1),¯

F(t2)

F(t1),..., ¯

F(tk)

F(tk−1).

256 10. NEUTRAL TO THE RIGHT PRIORS

Thus Gt1...tkdenotes the collection of all sets of the form

D=(¯

F(t1),¯

F(t2)

F(t1),..., ¯

F(tk)

F(tk−1))∈C

where C∈B

[0,1].

Theorem 10.2.1 (Doksum). Let Πbe neutral to the right. Then ΠX1,...,Xnis also

neutral to the right.

Proof. Fix k≥1 and let t1<t

2<···<t

kbe arbitrary points in (0,∞). Denote by

Qthe set of all rationals in (0,∞) and let Q=Q∪{t1,...,t

k}.Let{s1,s

2,...}be

an enumeration of Q. Observe that, for large enough m,{t1,...,t

k}⊂{s1,...,s

m}.

For su ch an m, let s(m)

1<···<s

(m)

mbe an ordering of {s1,...,s

m}.LetY(m)

F(s(m)

i)/¯

F(s(m)

i−1) and, under Π, let Π(m)

idenote the distribution of Y(m)

Let n1≤···≤nm. Then, given {Nn(s(m)

1)=n1,...,N

n(s(m)

m)=nm}, the posterior

density of (Y(m)

1,...,Y(m)

m) is written as

fY(m)

1,...,Y (m)

m(y1,...,y

m)= m

i=1(1 −yi)ni−ni−1yn−ni

m

i=1(1 −yi)ni−ni−1yn−ni

idΠ(m)

i(yi)



i=1

(1 −yi)ni−ni−1yn−ni

(1 −yi)ni−ni−1yn−ni

idΠ(m)

i(yi)

This shows that (Y(m)

1,...,Y(m)

m) are independent under the posterior given

{Nn(s(m)

1),...,N

n(s(m)

m)}. Hence,

F(ti)

F(ti−1)=

ti−1<s(m)

j≤ti

F(s(m)

j−1),i=1,...,k

are also independent under the posterior given the same information.

Now, by the right-continuity of Nn(·)wehave,asn→∞,

σ{Nn(sj),j ≤m}↑σ{Nn(t),t≥0}≡σ(X1,...,X

Hence, for any A∈G

t1...tk, by the martingale Convergence theorem, we have

Π(A|Nn(s(m)

1),...,N

n(s(m)

m)) →Π(A|X1,...,X

n) almost surely

Since for each m, the random quantities ¯

F(t1),¯

F(t2)/¯

F(t1)..., ¯

F(tk)/¯

F(k1)are

independent given σ(Nn(s(m)

1),...,N

n(s(m)

m)), independence also holds as m→∞.

10.2. NEUTRAL TO THE RIGHT PRIORS 257

A perusal of the proof given above suggests that for any t1<t

2the posterior

distribution of ¯

F(t2)/¯

F(t1) depends on {Nn(s):t1≤s≤t2}.Inwords,thepos-

terior depends on the number of observations less than t1, the exact observations

between t1and t2and the number of observations greater than t2. This was observed

by Doksum. The following theorem proved in [42] shows that this property essen-

tially characterizes neutral to the right priors. Walker and Muliere [167] have also

obtained characterizations of neutral to the right priors. Their results are presented

in a diﬀerent ﬂavor.

Theorem 10.2.2. Let Πbe a prior on Fsuch that Π{0<F(t)<1,for all t}=1.

Then the following are equivalent:

(i) Πis neutral to the right

(ii) for every t

L¯

F(t)|Π(X1,X

2,...,X

n)) = L¯

F(t)|Nn(s):0<s<t



where L(.)stands for the Law of (.).

Thus, if one wants to estimate the probability that a subject survives beyond t

years based on nsamples of which n1fell below t, then a neutral to the right prior

would lead to the same estimate if the remaining n−n1observations fell just above

tor far beyond it. This is a property that is also shared by the empirical distribution

function. This suggests that neutral to the right priors are appropriate when the

interest is in all of Fand inappropriate if the interest is in a local neighborhood of a

ﬁxed time point.

Ferguson and Phadia [64] extend Doksum’s result in the case of inclusively and

exclusively right censored observations. Let xbe a real number in (0,∞). Given a

distribution function F∈F,anobservationXfrom Fis said to be exclusively right

censored if we only know X≥xand inclusively right-censored if we know X>x.

We state their result next. The proof is straightforward.

Theorem 10.2.3 (Ferguson and Phadia). Let Fbe a random distribution func-

tion neutral to the right. Let Xbe a sample of size one from F, and let xbe a number

in (0,∞). Then

(a) the posterior distribution of Fgiven X>xis neutral to the right, and

(b) the posterior distribution of Fgiven X≥xis neutral to the right.

258 10. NEUTRAL TO THE RIGHT PRIORS

10.3 Independent Increment Processes

As mentioned in the introduction, neutral to the right priors relate to independent

increment process via the cumulative hazard function. To recall from Chapter 9, the

cumulative hazard function is given by

H(F)(t)=HF(t)=(0,t]

dF (s)

F[s,∞)for t≤TF

HF(TF)fort>T

and discussed its properties.

The next result establishes the connection between neutral to the right priors and

independent increment processes with nondecreasing paths via the map H.

Theorem 10.3.1. Let Πbe a neutral to the right prior on F. Then, under the

measure Π∗on Ainduced by the map H,{A(t):t>0}has independent increments.

Conversely, if Π∗is a probability measure on Asuch that the process {A(t):t>0}

has independent increments, then the measure induced on Fby the map

H−1:A→ 1−

(0,t]

(1 −dA)

is neutral to the right.

Proof. First suppose that Π is neutral to the right on Fand let t1<··· <t

kbe

arbitrary points in (0,∞). Consider, as before, a dense set {s1,s

2,...}in (0,∞). Let,

for each n,s(n)

1<···<s

(n)

nbe as before.

Suppose nis large enough that s(n)

n≥tk. Then, for each 1 ≤i≤k,wehavewith

Fas

F(ti)−An

F(ti−1)= 

ti−1<s(n)

j≤ti

F(s(n)

j−1,s

(n)

F(s(n)

j−1,∞)

=

ti−1<s(n)

j≤ti1−

F(s(n)

j−1)

Because for each n,¯

F(s(n)

1), ¯

F(s(n)

2)/¯

F(s(n)

1), ..., ¯

F(s(n)

n)/¯

F(s(n)

n−1) are independent,

F(t1), An

F(t2)−An

F(t1), ..., An

F(tk)−An

F(tk−1) are also independent. Letting n→∞,

we get that AF(t1), AF(t2)−AF(t1), ..., AF(tk)−AF(tk−1) are independent.

10.3. INDEPENDENT INCREMENT PROCESSES 259

For the converse, suppose Π∗on Asuch that, under Π∗,{A(t):t>0}is an

independent increment process. Again, let t1<···<t

kbe arbitrary points in (0,∞).

Then with s(n)

1<···<s

(n)

nas before, let, for 1 ≤i≤k,

A(ti)= 

s(n)

j≤ti

(1 −A(s(n)

j−1,s

(n)

j])

If FA=H−1(A), then it follows from the deﬁnition of the product integral that

F(n)

A(t)→¯

FA(t) for all t,asn→∞.Now,observethat,for1≤i≤k,

A(ti)

A(ti−1)=

ti−1<s(n)

j≤ti

(1 −A(s(n)

j−1,s

(n)

j])

Since A(s(n)

j−1,s

(n)

j],1≤j≤nare independent for each nso are ¯

A(ti)/¯

A(ti−1),1≤

i≤k. Consequently, we have independence in the limit, i.e., ¯

FA(t1), ¯

FA(t2)/¯

FA(t1),

..., ¯

FA(tk)/¯

FA(tk−1) are independent.

It is not hard to verify that for a neutral to the right prior Π,

EΠH(F)=H(EΠF)

Since the posterior given X1,X

2,...,X

nis again neutral to the right, the above

property continues to hold for ΠX1,X2,...,Xn. It is shown in Dey etal. [43] that in the

time discrete case the above property characterizes neutral to the right property. We

expect a similar result to hold in general.

Doksum was the ﬁrst to observe a connection between neutral to the right priors

and independent increment processes. He, however, considered cumulative hazard

function deﬁned by D(F)(t)=DF(t)=−log ¯

F(t). The proof of Theorem 10.3.2 is

straightforward.

Theorem 10.3.2 (Doksum). A prior Πon Fis neutral to the right if and only

if ˜

Π=Π◦D−1is an independent increment process measure such that ˜

Π{H∈H:

limt→∞ H(t)=∞} =1.

The theory of neutral to the right priors owes much of its development and analytic

elegance to its connection with independent increment processes. The principal ex-

amples of general families of neutral to the right priors have been constructed via this

connection. Next, we brieﬂy discuss the relevant theory of these processes in terms of

a representation due to P. L´evy. Following is a brief description of the representation.

260 10. NEUTRAL TO THE RIGHT PRIORS

The following facts are wellknown and can be found in , for example, Ito [104] and

Kallenberg [110].

Deﬁnition 10.3.1. A stochastic process {A(t)}t≥0is said to be an independent

increment process if A(0) = 0 almost surely and if, for every kand every {t0<t

···<t

k}⊂[0,∞), the family {A(ti)−A(ti−1)}k

i=1 is independent.

Let Hbe a space of functions deﬁned by

H={H|H:[0,∞)→ [0,∞],H(0) = 0,H non-decreasing, right-continuous}

(10.1)

Let B(0,∞)×[0,∞]be the Borel σ-algebra on (0,∞)×[0,∞].

Theorem 10.3.3. Let Π∗be a probability on H. Under Π∗,{A(t):t>0}is an

independent increment process if and only if the following three conditions hold. There

exists

1 a ﬁnite or countable set M={t1,t

2,...}of points in (0,∞)and, for each ti∈M,

a positive random variable Yideﬁned on Hwith density fi;

2 a nonrandom continuous nondecreasing function b; and

3 a measure λon (0,∞)×[0,∞],B(0,∞)×[0,∞]that for all t>0, satisﬁes

(a) λ({t}×[0,∞]) = 0 and

(b) 

0<s≤t

0≤u≤∞

1+uλ(ds du)<∞,

such that

A(t)=b(t)+

ti≤t

Yi(A)+ 

0<s≤t

0≤u≤∞

uµ(ds du, A) (10.2)

where, for each A∈H,µ(·,A)is a measure on (0,∞)×[0,∞],B(0,∞)×[0,∞]such

that, under Π∗,µ(·,·)is a Poisson process with parameter λ(·), i.e., for arbitrary dis-

joint Borel subsets E1,...,E

kof (0,∞)×[0,∞],µ(E1,·),...,µ(Ek,·)are independent,

and

µ(Ei,·)∼P oisson(λ(Ei)) for 1≤i≤k

Note the following facts about independent increment processes, which will be

useful later and facilitate understanding of the remaining subject matter.

10.3. INDEPENDENT INCREMENT PROCESSES 261

(1) The measure λon (0,∞)×[0,∞] is often expressed as a family of measures

{λt:t>0}where λt(A)=λ((0,t]×A) for Borel sets A.

(2) The representation may be expressed equivalently in terms of the moment-generating

function of A(t)as

E(e−θA(t))=e−b(t)

ti≤tE(e−θYi)exp ⎡

⎢

⎣−

0<s≤t

0≤u≤∞

(1 −e−θu)λ(ds du)⎤

⎥

⎦

(3) The random variables Yioccurring in the decomposition arise from the jumps

of the process at ﬁxed points. Say that tis a ﬁxed jump-point of the process if

Π∗(A{t}>0) >0. It is known that there are at most countably many such ﬁxed

jump-points, and the set Mis precisely the set of such points and that Yi=A{ti}.

(4) The random measure A→ µ(·,A) also has an explicit description. For any Borel

subset Eof (0,∞)×[0,∞],

µ(E,A)=#{(t, A{t})∈E:A{t}>0}

(5) Let Ac(t)=A(t)−b(t)−ti≤tA{ti}. Then

Ac(t)= 

0<s≤t

0≤u≤∞

uµ(du ds, A)

(6) The countable set M, the set of densities {fi:i≥1}, the measure λ, and the

nonrandom function bare known as the four components of the process {A(t):

t>0}, or, equivalently, of the measure Π∗. The measure λis known as the L´evy

measure of Π∗.

(7) A L´evy process Π∗without any non-random component, i.e., for which b(t)=0,

for all t>0, has sample paths that increase only in jumps almost surely Π∗.Most

of the L´evy processes that we encounter here will be of this type.

262 10. NEUTRAL TO THE RIGHT PRIORS

10.4 Basic Properties

Let Π be a neutral to the right prior on F. From what we have seen so far, the maps D

and Hyield independent increment process measures ˜

Π and Π∗, respectively. Let the

L´evy measures of ˜

ΠandΠ

∗be denoted ˜

λand λ∗, respectively. The next proposition

establishes a simple relationship between ˜

λand λ∗.

Proposition 10.4.1. Suppose ˜

λand λ∗are as earlier. Then

1 for each t,˜

λtis the distribution of x→−log(1 −x)under the measure λ∗

t, and

2 for each t,λ∗

tis the distribution of x→ 1−e−xunder ˜

λt

Proof. The proposition is an easy consequence of the following easy fact.

If ω→ µ(·,ω)isanM(X)-valued random measure which is a Poisson process with

parameter measure λ, then for any measurable function g:X→X, the random

measure ω→ µ(g−1(·),ω) is a Poisson process with parameter measure λ◦g−1.

Note that

D(F)(t)−D(F)(t−)=−log F(t, ∞)

F[t, ∞)

=−log 1−F{t}

F[t, ∞)

=−log[1 −(H(F)(t)−H(F)(t−))]

It is of interest to know if we can choose neutral to the right priors with large

support. The next proposition gives a suﬃcient condition that will ensure that the

support is all of F. Recall that the (topological) support Eof a measure µon a

metric space Xis the smallest closed set Ewith µ(Ec)=0.WeviewFas a metric

space under convergence in distribution.

Proposition 10.4.2. If the support of the L´evy measure λHis all of [0,∞)×[0,1]

then the support of Πis all of F.

Proof. We need to show that every open set (in the topology given by convergence

in distribution) has positive Π measure. Since the set of continuous distributions is

dense in F, it is enough to show that neighborhoods of continuous distributions have

positive Π measure. We will establish a stronger fact, namely, that every uniform

neighborhood has positive prior probability.

10.4. BASIC PROPERTIES 263

Let F0be a continuous distribution , A0=H(F0) be the hazard function of F0and

let U={F:sup

0<s≤t|F(s)−F0(s)|<}. In view of the last section, Ucontains a set

H−1(V), where Vis of the form V={A:sup

0<s≤t|A(s)−A0(s)|<δ}. We will show

that Π(U)>0 by showing that Π ◦H−1(V)>0.

To see this, set δ0=δ/3andchoose0=t0<0<t

1<t

2... < t

k<t

k+1 =tsuch

that for i=1,2,...,(k+1);A0(ti)−A0(ti−1)<δ

Recall the deﬁnition of µ(.;A). Let

W={A:µ(Ei;A)=1,i=1,2,...,k}

where Ei=(ti−1,t

i]×(A0(ti)−A0(ti−1)−δ0/k, A0(ti)−A0(ti−1)+δ0/k).

If ti<s≤ti+1,

|A(s)−A0(s)|

≤



1|(A0(tj)−A0(tj−1)) −(A(tj)−A(tj−1))|+|(A0(s)−A0(ti))−(A(s)−A(ti))|

The ﬁrst term on the right-hand side is less than iδ0/k and the second term is less

than 2δ0so that for every s∈(0,t],|A(s)−A0(s)|<δ. Hence W⊂V.

Under the measure induced by H−1, the random variables µ(Ei;A)=1,i =

0,1,2,...,k−1 are independent Poisson random variables with parameters λ(Ei),i =

1,2,...,k. These are positive by assumption and hence Vhas positive Π ◦H−1mea-

sure.

Let A∗be a right continuous function increasing to ∞. A convenient class of neutral

to the right priors are those with L´evy measure λHof the form

dλH(x, s)=a(x, s)dA∗(x)ds 0<x<∞,0<s<1 (10.3)

with 1

0sa(x, s)ds < ∞for all x. Without loss of generality we assume that for all

x, 1

0sa(x, s)ds = 1. This ensures that the prior expectation of A(t)isA0(t).

Every neutral to the right prior gives rise to a L´evy measure via λH. Is every L´evy

measure on R+×[0,1] obtainable as λHof a neutral to the right prior? The next

proposition answers the question for the class of measures just discussed.

Proposition 10.4.3. Let A∗be H(F∗)for some distribution function F∗and

dλH(x, s)=a(x, s)dA∗(x)ds 0<x<∞,0<s<1

264 10. NEUTRAL TO THE RIGHT PRIORS

such that for all x, 1

0sa(x, s)ds =1so that E(A(t)=A∗(t).

The function A→ (0,t](1 −dA(s)) (where (0,t]stands for the product integral)

deﬁnes a neutral to the right prior on F.

Proof. It can be easily deduced from the basic properties of the product integral that

the function A→ (0,t](1 −dA(s)) induces a probability measure on the set of all

functions which are right continuous and decreasing. In order to show that this is a

prior on Fwe need to verify that if ¯

F(t)=(0,t](1 −dA(s)), then with probability 1

limt→∞ ¯

F(t) = 0. This follows because the property of independent increments gives

E

(0,t]

(1 −dA(s)) = 

(0,t]

(1 −dE(A)(s)) = ¯

F∗

Each ¯

F(t) is decreasing in tand limt→∞ E(¯

F(t) = limt→∞ ¯

F∗(t)=0.

L´evy representation plays a central role in the study of posteriors of neutral to

the right priors. When the prior is neutral to the right, since the posterior given

X1,X

2,...,X

nis again neutral to the right, this posterior has a L´evy representation.

An expression for the posterior in terms of λDcan be found in Ferguson [62] and in

terms of λHcan be found in Hjort [100]. There is another proof due to Kim [113].

James [105] has a some what diﬀerent approach, an approach we believe is promising

and deserves further study. We will give a result from [100] without proof.

Our setup consists of random variables X1,X

2,...,X

nthat are independent iden-

tically distributed Fand Y1,Y

2,...,Y

n, which are independent of the Xisandare

independent identically distributed as G0. The observations are Zi=Xi∧Yiand

δi=I(Xi≤Yi). Let

Nn(t)=



I(Zi>t) be the number of observations greater than t

and

Mn(t) be the number of Zisequal to t

Theorem 10.4.1 (Hjort). Let Πbe a neutral to the right prior with L´evy measure

of the form (10.3). When all the uncensored values—the Zis with δi=1—are distinct

among themselves, and from the values of the censored observations, the posterior has

the L´evy representation given by

10.5. BETA PROCESSES 265

1Mn

u: the set of uncensored values are points of ﬁxed jumps. The distribution of the

jump at Zihas the density

(1 −s)Nn(Zi)sa(Zi,s)

1

0(1 −s)Nn(Zi)sa(Zi,s)ds

2 the L´evy measure of the continuous part has

ˆa(x, s)=(1−s)Nn(x)+Mn(x)

Remark 10.4.1.Consequently

E¯

F(t2)

F(t1)|Π(

(Zi,δ

i):≤i≤n)

=⎡

⎣

Zi∈Mn

u:t1<Zi≤t21

0(1 −s)Nn(Zi)+1sa(Zi,s)ds

1

0(1 −s)Nn(Zi)sa(Zi,s)ds ⎤

⎦e−t2

t11

0(1−s)Nn(z)+Mn(z)sa(z,s)dsd ˆ

A(z)

(10.4)

10.5 Beta Processes

Beta processes, introduced by Hjort [100] are continuous analogs of a time-discrete

case where (see Example 10.2.2) the Vis are independent beta random variables. The

continuous case is obtained as a limit of the time-discrete case. However, in order

to ensure that the limit exists, the parameters of the beta random variables have to

be chosen carefully. In addition to introducing beta processes and elucidating their

properties for right censored data, Hjort [100] studied extensions to situations more

general than right censored data. This chapter only deals with a part of [100].

10.5.1 Deﬁnition and Construction

Let A∗be a hazard function with ﬁnitely many jumps. Let t1,...,t

kbe the jump-

points of A∗.Letc(·) be a piecewise continuous non-negative function on [0,∞)and

let A∗,c denote the continuous part of A∗.LetA∗(t)<∞for all t.

Deﬁnition 10.5.1. An independent increment process Ais said to be a beta pro-

cess with parameters c(.)andA∗, written A∼beta(c, A∗), if the following holds: A

has L´evy representation as in Theorem 10.3.3 with

266 10. NEUTRAL TO THE RIGHT PRIORS

1M={t1,...,t

k}and the jump-size at any tjgiven by

Yj≡A{tj}∼beta(c(tj)A∗{tj},c(tj)(1 −A∗{tj}))

2L´evy measure given by

λ(ds du)=c(s)u−1(1 −u)c(s)−1du dA∗,c(s)

for 0 ≤s<∞,0<u<1; and

3b(t)≡0 for all t>0.

The existence of such a process is guaranteed by Proposition 10.4 but this existence

result does not give any insight into the prior. A better understanding of the prior

comes from the construction of Hjort who obtained these priors as weak limits of

time-discrete processes on Aand showed that the sample paths are almost surely

in A. In a very similar spirit, we construct the prior on Fas a weak limit of priors

sitting on a discrete set of points on (0,∞).

Let F∗∈Fand, to begin, assume that it is continuous. Let A∗=H(F∗)bethe

cumulative hazard function corresponding to F∗.

Let Qbe a countable dense set in (0,∞), enumerated as {s1,s

2,...}.Foreach

n≥1, let {s(n)

1<···<s

(n)

n}be an ordering of s1,...,s

n. Construct a prior Πnon

Fs1,...,snas in Example 10.2.1 by requiring that, under Πn,

V(n)

i∼beta c(s(n)

i−1)¯

F∗(s(n)

i−1),c(s(n)

i−1)1−¯

F∗(s(n)

i−1) for 1 ≤i≤n−1.(10.5)

Let V(n)

n≡1 and let Fbe a random distribution function , such that, under Πn,

L(¯

F(t)) = L⎛

⎜

⎝

s(n)

i≤t

(1 −V(n)

i)⎞

⎟

⎠for all t>0

Theorem 10.5.1. {Πn}n≥1converges weakly to a neutral to the right prior Πon

F, which corresponds to a beta process.

10.5. BETA PROCESSES 267

Proof. First observe that, as n→∞,

EΠn(¯

F(t)) = 

s(n)

i≤tEΠn(1 −V(n)

=

s(n)

i≤t1−F∗(s(n)

i−1,s

(n)

F∗(s(n)

i−1,∞)

→

(0,t]

(1 −dH(F∗))

=

(0,t]

(1 −dA∗)= ¯

F∗(t)

for all t≥0. Thus EΠn(F)=Fn

→F∗as n→∞. Hence, by Theorem 2.5.1, {Πn}is

tight.

We now follow Hjort’s calculations to show that the ﬁnite-dimensional distributions

of the process F, under the prior Πn, converges weakly to those under the prior

induced by a beta process with parameters cand A0on H.

Consider, for each n≥1, an independent increment process Ac

nwith process mea-

sure Π∗

non Asuch that, for each ﬁxed t>0,

L(Ac

n(t)) = L(

s(n)

i≤t

V(n)

Thus, for each n≥1, Ac

nis a pure jump-process with ﬁxed jumps at s(n)

1,...,s

(n)

n−1

and with random jump sizes given by V(n)

i,...,V(n)

n−1at these sites. Clearly, Π∗

ninduces

the prior Πnon F.

Now, for any ﬁxed t>0, repeating computations as in Hjort [ [100], Theorem 3.1,

pp. 1270-72] with

cn,i =c(s(n)

i−1),b

n,i =cn,i

F∗,c(s(n)

i−1)and an,i =cn,i −bn,i

one concludes that, for each θ,asn→∞,

E[e−θAc

n(t)]→exp 1

0t

(1 −e−θu)λ(ds du)

268 10. NEUTRAL TO THE RIGHT PRIORS

and, similarly,

Eexp −



j=1

θjAc

n(aj−1,a

j]→exp −



j=1 1

0aj

aj−1

(1 −e−θju)λ(ds du)

Thus the ﬁnite-dimensional distributions of the independent increment processes An

converge to the ﬁnite-dimensional distributions of an independent increment process

with L´evy measure as in Deﬁnition 10.5.1. If the process measure is denoted by Π∗

and the corresponding induced measure on Fis denoted by Π, then considering the

Skorokhod topology on Aand by the continuity of H−1, we conclude that, for all

a1,...,a

L(¯

F(a1),..., ¯

F(am)|Πn)w

→L(¯

F(a1),..., ¯

F(am)|Π)

Therefore, {Πn}converges weakly to Π, a neutral to the right prior on F.

10.5.2 Properties

The following properties of beta processes are from Hjort [100].

1LetA∗∈Abe a hazard function with ﬁnitely many points of discontinuity and let

cbe a piecewise continuous function on (0,∞).

If A∼beta(c,A∗) then E(A(t)) = A∗(t). In other words F=H−1(A) follows a

beta(c,F∗) prior distribution and we have E(F(t)) = F∗(t)whereF∗=H−1(A∗).

The function centers the expression for the variance. If M={t1,...,t

k}is the set

of discontinuity points of A0then

V(A(t)) = 

tj≤t

A∗{tj}(1 −A∗{tj})

c(tj)+1 +t

dA∗,c(s)

c(s)+1

where A∗,c(t)=A∗(t)−ti≤tA∗{ti}.

2LetA∼beta(c,A∗) where, as before, A∗has discontinuities at points in M.Given

F,letX1,...,X

nbe i.i.d. F. Then the posterior distribution of Fgiven X1,...,X

is again a beta process, i.e., the corresponding independent increment process is

again beta.

10.5. BETA PROCESSES 269

To describe the posterior parameters, let Xnbe the set of distinct elements of

{x1,...,x

n}. Deﬁne

Yn(t)=



i=1

I(Xi≥t)and ¯

Yn(t)=



i=1

I(Xi>t)

With Nn(t)asbefore,notethat ¯

Yn(t)=n−Nn(t)andYn(t)=n−Nn(t−).

Using this notation, the posterior beta process has parameters

cX1...Xn(t)=c(t)+Yn(t)

A∗

X1...Xn(t)=t

c(z)dA∗(z)+dNn(z)

c(z)+Yn(z)

More explicitly, A∗

X1...Xnhas discontinuities at points in M∗=M∪Xn, and for

t∈M∗,

A∗

X1...Xn{t}=c(t).A∗{t}+Nn{t}

c(t)+Yn(t)

A∗,c

X1...Xn(t)=t

c(z)dA∗,c(z)

c(z)+Yn(z)

Note that if t∈M∗,

A{t}∼beta (c(t)A∗{t}+Nn{t},c(t)(1 −A∗{t})+Yn(t)−Nn{t}).

3 Our interest is in the following special case of 2. Suppose A∼beta(c,A∗)andt

A∗is continuous. Then the posterior given X1,...,X

nis again a beta process with

parameters

cX1...Xn(t)=c(t)+Yn(t)

and

A∗X1...X

n(t)=A∗,dX1...X

n(t)+A∗,cX1...X

n(t)

where

A∗,d

X1...Xn(t)= 

ti∈Xn

ti≤t

Nn{ti}

c(ti)+Yn(ti)

and

A∗,c

X1...Xn(t)=t

c(z)dA∗(z)

c(z)+Yn(z)

270 10. NEUTRAL TO THE RIGHT PRIORS

As a consequence, if t∈Xn, then under the posterior ΠX1,...,Xnwe have

A{t}∼beta(Nn{t},c(t)+ ¯

Yn(t)).

Also note that the Bayes estimates are

EΠX1,...,Xn(A(t)) = A∗X1...X

n(t)

and

EΠX1,...,Xn(¯

F(t)) = 

ti∈Xn

ti≤t1−Nn{ti}

c(ti)+Yn(ti)exp −t

c(z)dA∗(z)

c(z)+Yn(z)(10.6)

4 A neat expression for the posterior and the Bayes estimate for right censored data

can be easily obtained using Theorem 10.4.1. We leave the details to the reader.

Using these explicit expressions it is not very diﬃcult to show that beta processes

lead to consistent posteriors. However since we take up the consistency issue more

generally in the next section we do not pursue it here.

Like the Dirichlet, any two beta processes tend to be mutually singular. This is

proved in [43].

Walker and Muliere [167] started with a positive function Don (0,∞) and a distri-

bution function ˆ

Fand constructed a class of priors on Fcalled beta-Stacy processes.

We again consider the simple case when ˆ

Fis continuous. The beta-Stacy process is

the neutral to the right prior with

dλD(s, x)=D(x)e−sD(x)¯

F(x)

1−e−sdsd ˆ

Ax;0<x<∞,0<s<∞

The beta process prior thus relates to an independent increment process via Hand

the beta -Stacy via D. Viewing the processes as measures on Fprovides a mean to

calibrate the prior information in Hin terms of that in Dand vice versa. Though not

explicitly formulated in the following form, the relationship between the two priors is

already implicit in remark 2 and remark 4 of [167].

Theorem 10.5.2. Πis a Beta Stacy (D, ˆ

F)process iﬀ Πis a Beta (C, ˆ

A)process

prior where C=D¯

Fand ˆ

Ais the cumulative hazard function of ˆ

Proof. Because Beta Stacy process has λDgiven above, we can compute its λHusing

Proposition 10.4.1. This immediately yields the assertion.

10.6. POSTERIOR CONSISTENCY 271

10.6 Posterior Consistency

Since neutral to the right priors, like tail free priors, possess nice independence and

conjugacy properties it appeared that they would always yield consistent posteriors.

However, Kim and Lee [114] gave an example of a neutral to the right prior which is

inconsistent. Their elegant example is constructed with a homogeneous L´evy measure

and is inconsistent at every continuous distribution.

Recall from Theorem 4.2.1 that to establish posterior consistency at F0, it is enough

to show that with F∞

0probability 1, for all t

(i) lim

n→∞ E(F(t)|X1,X

2,...,X

n)=F0(t)and

(ii) lim

n→∞ V(F(t)|X1,X

2,...,X

n)=0.

The next theorem shows that for neutral to the right priors consistency of Bayes

estimates ensures consistency of the posterior.

Theorem 10.6.1. Let Πbe a neutral to the right prior of the form (10.3). If

lim

n→∞ E(F(t)|X1,X

2,...,X

n)=F0(t)

then

lim

n→∞ V(F(t)|X1,X

2,...,X

n)=0

Proof. Let X[1] <X

[2] ...X

[k]be the ordering of the observations X1,X

2,...,X

which are less than t. Then, apart from an exponential factor going to 1,

E(¯

F(t)2|X1,X

2,...,X

n)=



21

0(1 −s)j+2a(s, X[j])ds

1

0(1 −s)ja(s, X[j])ds

multiplying each term by 1

0(1 −s)j+1a(s, X[j])ds/ 1

0(1 −s)j+1a(s, X[j])ds,weget



21

0(1 −s)j+2a(s, X[j])ds

1

0(1 −s)j+1a(s, X[j])ds



11

0(1 −s)j+1a(s, X[j])ds

1

0(1 −s)ja(s, X[j])ds →(¯

F0(t))2

There is another structural aspect of neutral to the right priors. Consistency for

the censored case follows from consistency for the uncensored case. Following is the

result. For a proof, see Dey et al. [43]

272 10. NEUTRAL TO THE RIGHT PRIORS

Theorem 10.6.2. Suppose Xis a survival time with distribution Fand Yis a

censoring time distributed as G.X1,X

2,..., are given F=F, i.i.d. Fand Y1,Y

2,...,

be i.i.d. G, where Gis continuous and has support all of R+. We also assume that the

Xis and Yis are independent. Let Zi=Xi∧Yiand ∆i=I(Xi≤Yi).IfΠis a neutral

to the right prior for Fwhose posterior is consistent at all continuous distributions

F0, then the posterior given (Zi,∆i):i≥1is also consistent at all continuous F0.

Proof. Fix t1<t

2. since the exponential term in 10.4 goes to 0 as n→∞, our assump-

tion on consistency translates into: for any continuous distribution F,ifX1,X

2,...,X

are i.i.d. F, then

lim

n→∞ 

Xi∈(t1,t2]0,1s(1 −s)Nn(Xi)+1a(Xi,s)ds

0,1s(1 −s)Nn(Xi)a(Xi,s)ds =¯

F(t2)

F(t1)

Fix F0continuous. Let X1,X

2,...,X

nbe i.i.d. F0and Y1,...,Y

nbe i.i.d. G, and let

(Zi,∆i)beasabove.Wewillﬁrstshowthat

lim

n→∞ 

Zi∈M∗

n∩(0,t]0,1s(1 −s)Nn(Xi)+1a(Xi,s)ds

0,1s(1 −s)Nn(Xi)a(Xi,s)ds =¯

F(t)a.s.(F0×G)∞

where M∗

n={Zj:∆

j=0}.

With t1<t

2ﬁxed, let φbe an increasing continuous mapping of (t1,∞)into(t2,∞)

and deﬁne

Z∗

i=ZiI(∆i=1)+φ(Zi)I(|Deltai=0)

Then Z∗

iare again i.i.d. with a continuous distribution F∗

0such that

F∗

0(t2)

F∗

0(t1)=¯

J(t1,1) −¯

J(t2,1)

J∗(t1)

where ¯

J∗(t)=P(Z>t)and ¯

J(t1=P(Z>t,∆=1).

Now using our assumption, if Nn

∗(t)=n

i=1 I(Z∗

i>t) then

lim

n→∞ 

Z∗

i∈(t1,t2]0,1s(1 −s)Nn

∗(Z∗

i)+1a(Z∗

i,s)ds

0,1s(1 −s)Nn

∗(Z∗

ia(Z∗

i,s)ds =¯

J(t1,1) −¯

J(t2,1)

J∗(t1)a.s

10.6. POSTERIOR CONSISTENCY 273

Note that the above product is only over the uncensored Zis and that, for each t1<t

with ∆i=1,Nn(Zi)≤Nn

∗(Zi). Now using the Cauchy-Schwarz inequality we get

1

(1 −s)n+2sa(x, s)ds1

(1 −s)nsa(x, s)ds

=1

[(1 −s)(n+2)/2]2sa(x, s)ds1

[(1 −s)(n)/2]2sa(x, s)ds

≥1

(1 −s)n+1sa(x, s)ds2

and consequently 1

0(1 −s)n+1sa(x, s)/1

0(1 −s)nsa(x, s)ds is decreasing in n. Hence,

we have

lim

n→∞ 

Zi∈M∗

n∩(t1,t2]0,1s(1 −s)Nn(Zi)+1a(Zi,s)ds

0,1s(1 −s)Nn(Zi)a(Zi,s)ds

≤lim

n→∞ 

Z∗

i∈(t1,t2]0,1s(1 −s)Nn

∗(Z∗

i)+1a(Z∗

i,s)ds

0,1s(1 −s)Nn

∗(Z∗

i)a(Z∗

i,s)ds

=¯

J(t1,1) −¯

J(t2,1)

J∗(t1)

Let 0 = t0<t

1<t

2<...< t

k=tbe a partition of (0,t]. Then

lim

n→∞ 

Zi∈M∗

n∩(0,t]0,1s(1 −s)Nn(Zi)+1a(Zi,s)ds

0,1s(1 −s)Nn(Zi)a(Zi,s)ds ≤



J(ti−1,1) −¯

J(ti,1)

J∗(ti−1)

As the width of the partition max |ti−ti−1goes to 0, the right-hand side converges

to the product integral (0,t](1 −J(ds, 1)/¯

J(s)), which from Peterson [138] is equal

to ¯

F(t).

Let ˆ

Fndenote the Bayes estimate of ¯

Fgiven X1,X

2,...,X

nand let ¯

F∗

ndenote the

Bayes estimate of ¯

Fgiven (Zi,δ

i):1≤i≤n.wehaveshownthatforallt,

F∗

n(t)≤ˆ

Fn(t) and hence lim inf

nF∗

n≥¯

Similarly, by considering the “Bayes” estimate for G,withMn

0={(Zj,∆j:∆

0)},

lim inf

n

Zi≤t:Zi∈Mn

01

0(1 −s)Nn(Zi)+1a(Zi,s)ds

1

0(1 −s)Nn(Zi)a(Zi,s)ds ≥¯

274 10. NEUTRAL TO THE RIGHT PRIORS

Consider,



Zi≤t:Zi∈Mn

u1

0(1 −s)Nn(Zi)+1a(Zi,s)ds

1

0(1 −s)Nn(Zi)a(Zi,s)ds 

Zi≤t:Zi∈Mn

01

0(1 −s)Nn(Zi)+1a(Zi,s)ds

1

0(1 −s)Nn(Zi)a(Zi,s)ds (10.7)

but this is equal to



Zi≤t1

0(1 −s)Nn(Zi)+1a(Zi,s)ds

1

0(1 −s)Nn(Zi)a(Zi,s)ds

But this is just the Bayes estimate based on i.i.d. observations from the continuous

survival distribution ¯

F0(t)¯

G(t) and by assumption (10.7) converges to ¯

F0(t)¯

G(t). The

conclusion follows easily.

Thus, as far as consistency issues are concerned, we only need to study the uncen-

sored case. We begin looking at the simple case when the L´evy measure is homoge-

neous. In the sequel for any a, b > 0, we denote by B(a, b),the usual beta function

given by

B(a, b)=Γ(a)Γ(b)

Γ(a+b)=1

(1 −s)a−1sb−1ds

If fis an integrable function on (0,1) we set

K(n, f )=1

(1 −s)nf(s)ds

We will repeatedly use the fact that

for any p, q; lim

n→∞ nq−pΓ(n+p)

Γ(n+q)=1

Lemma 10.6.1. Suppose fis a nonnegative function on (0,1) such that

(a) 0<1

0f(s)ds < ∞and

(b) for some α<1,0<lim

s→0sαf(s)=b<∞.

Then

lim

n→∞

K(n, f )

B(n+1,1−α)=b

10.6. POSTERIOR CONSISTENCY 275

Proof. Since

1



(1 −s)nf(s)ds ≤(1 −)n1

f(s)ds =o(n−(1−α))

and as n→∞,n

1−αB(n, 1−α)→Γ(1 −α), we have

lim

n→∞ 1

(1 −s)nf(s)ds

B(n+1,1−α)= 0 (10.8)

Similarly, because α<1,

1



(1 −s)ns−αds ≤(1 −)n

s−αds ≤(1 −)n1−1−α

1−α=o(n−(1−α))

which in turn yields

lim

n→∞ 1

(1 −s)ns−αds

B(n+1,1−α)= 0 (10.9)

Given δ, use assumption (b) to choose >0 such that for s<

(b−δ)s−α<f(s)<(b+δ)s−α

Then

K(n, f )≤(b+δ)B(n+1,1−α)+1



(1 −s)nf(s)ds

and by (10.8) we have

lim

n→∞

K(n, f )

B(n+1,1−α)≤(b+δ)

A similar argument using (10.9) shows that

lim

n→∞

K(n, f )

B(n+1,1−α)≥(b−δ)

Since δis arbitrary, the lemma follows.

Theorem 10.6.3. Let A∗be a cumulative hazard function which is continuous and

ﬁnite for all x. Suppose that a neutral to the right prior with no ﬁxed jumps has the

expected hazard function A∗and the L´evy measure

dλH(x, s)=a(s)dA∗(x)ds 0<x<∞,0<s<1

276 10. NEUTRAL TO THE RIGHT PRIORS

such that

for some α<1,0<lim

s→0s1+αa(s)=b<∞(10.10)

If F0is a continuous distribution with F0(t)>0for all t, then with F∞

0-probability

1, the posterior converges weakly to the measure degenerate at F1−α

0. In particular, if

(10.10) holds with α=0then the posterior is consistent at F0.

Proof. Set f(s)=sa(s). We have 1

0f(s)ds = 1. Using (10.4),

E(¯

F(t)|X1,X

2,...,X

n)=

Xi≤t

K(Nn(Xi)+1,f)

K(Nn(Xi),f)e−ψn(t)(10.11)

where ψn(t)=t

01

0(1 −s)Nn(x)+Mn(x)sa(s)dsdA∗(x).

For any x<t,(1 −s)Nn(x)+Mn(x)<(1 −s)Nn(t)and hence ψn(t) is bounded above

by (1

0(1 −s)Nn(t)ds)A∗(t). Since Nn(t)→∞as n→∞, it follows that ψn(t)→0as

n→∞. Hence the exponential factor goes to 1.

If X(1) <X

(2) ... < X

(n−Nn(t)) is an ordering of the n−Nn(t) samples that are

less than t, then, since with F0probability 1 the X1,X

2,...,X

nare all distinct,

Nn(X(1))=n−1,Nn(X(2))=n−2, and so on. Thus the ﬁrst term in (10.11) reduces

(i=n−Nn(t))



i=0

K(n−i, f )

K((n−i−1),f)=K(n, f)

K(Nn(t)−1,f)

It follows from Lemma 10.6.1 that

lim

n→∞

K(n, f )

K(Nn(t)−1,f)= lim

n→∞

B(Nn(t)−1,1−α)

B(n, 1−α)

= lim

n→∞

Γ(Nn(t)−α)

Γ(Nn(t)−1)

Γ(n)

Γ(n+1−α)

= lim

n→∞ Nn(t)

n+11−α

=¯

F0(t)1−αa.s. F∞

Remark 10.6.1.The Kim-Lee example had the homogeneous L´evy measure given

by a(s)=2s−3/2. In this case the conditions of the Theorem 10.6.3 are satisﬁed with

α=1/2 so that the posterior converges to F1/2

10.6. POSTERIOR CONSISTENCY 277

We next turn to a suﬃcient condition for consistency in the general case. We begin

with an extension of Lemma 10.6.1.

For ea ch xin a set Xlet f(x, .) be a non negative function on (0,1). Let mnbe a

sequence of integers such that lim

n→∞

n=c, 0<c<1.

Lemma 10.6.2. Suppose

(a) 0<supx1

0f(x, s)ds =I<∞and

(b) As s→0,f(x, s)converges uniformly (in x), to the constant function 1, i.e., as

→0,

δ=sup

sup

s< |f(s, x)−1|→0

Then

lim

n→∞



mni+2

i+1

1

0(1 −s)i+1f(xi,s)ds

1

0(1 −s)if(xi,s)ds =1

and the convergence is uniform in the x

is.

Proof. To avoid unpleasant expressions involving fractions of integrals, set

Ki,x =1

(1 −s)if(x, s)ds and Li,x =1

s(1 −s)if(x, s)ds

We will show that for any x,givenδsmall, there is an m0such that, for i>m

i+1−2δ

i+2 ≤Ki+1,x

Ki,x ≤i+1+2δ

i+2 (10.12)

The bounds in inequality 10.12 do not depend on the xis. Consequently, we have

uniformly in the xis,

1−2δ

mn+1

n−mn

≤



i+2

i+1

Ki+1,x

Ki,x ≤1+ 2δ

mn+1

n−mn

For small positive y,e−2y<1−y<1+y<e

y. Hence, as n→∞, the left-hand

side converges to e−4δ(1−c)/c and the right side to e2δ(1−c)/c. Letting δgoto0wehave

the result.

278 10. NEUTRAL TO THE RIGHT PRIORS

To prove (10.12) note that

Ki+1,x

Ki,x

=1−Li,x

Ki,x

For any 0 <=1−α<1,

(1 −δHi,)≤Ki,x ≤(1 + δ)Hi, +αiI

and

(1 −δJi,)≤Li,x ≤(1 + δ)Ji, +αiI

where

Hi, =

(1 −s)ids =1−αi+1

i+1

and

Ji, =

s(1 −s)ids =1−αi+1(1 + +i)

(i+1)(i+2)

Now

(i+2)Li,x

Ki,x ≤(i+2)(1 + δ)Ji,

(1 −δ)Hi,

+αiI

(1 −δ)Hi, 

which goes to (1+δ)/(1−δ)asi→∞. Further, the right-hand side does not involve

x, and hence this convergence is uniform in x.

On the other hand,

(i+2)Li,x

Ki,x ≥(i+2)(1 −δ)Ji,

(1 + δ)(Hi, +αiI)

which goes to (1 −δ)/(1 + δ), again uniformly in x.

Because δ→0asgoes to 0, given any δ>0, for suﬃciently small ,(1−δ)/(1+δ)

is larger than (1 −δ) and (1 + δ)/(1 −δ) is smaller than (1 + δ).

Thus given any δ>0,thereisannsuch that for i>n

,

1−δ≤(i+2)Li,x

Ki,x ≤1+δ

Using Ki+1,x/Ki,x =1−(Li,x/Ki,x), we get

1−1+δ

i+2 <Ki+1,x

Ki,x

<1+1−δ

i+2

and this is (10.12)

10.6. POSTERIOR CONSISTENCY 279

Remark 10.6.2.In the Lemma 10.6.2, assumption (a) can be replaced by:

(a’) 0 <supx1

0(1 −s)f(x, s)ds < ∞.

This follows from setting g(s, x)=(1−s)f(x, s) and noting that gsatisﬁes as-

sumptions (a) and (b) and that

1

0(1 −s)n+1f(x, s)ds

1

0(1 −s)nf(x, s)ds =1

0(1 −s)ng(x, s)ds

1

0(1 −s)n−1g(x, s)ds

and observing that (n+2)/(mn+2)=n

mn[(i+2)/(i+1)and(n+1)/(mn+1)=

n

mn[(i+1)/i both converge to the same limit 1/c.

Theorem 10.6.4. Let Πbe a neutral to the right prior with

dλH(x, s)=c(x)a(x, s)dA∗(x)ds 0<x<∞,0<s<1

If f(x, s)=sa(x, s)satisﬁes the assumption of the Lemma 10.6.2 (or the remark

following it) then the posterior is consistent at any continuous distribution F0.

Proof. Since the exponential factor in equation (10.4) goes to 1, it follows immediately

from Lemma 10.6.2 that for each twith ¯

F0(t)>0,

E(¯

F(t)|Π()|X1,X

2,...,X

n)→¯

F0(t)

Theorem 10.6.5. The posterior of the beta(C, A∗)prior is consistent at all con-

tinuous distribution F0.

Proof. Since the L´evy measure satisﬁes the conditions of Remark 10.6.2, this is an

immediate consequence of Theorem 10.6.4.

Remark 10.6.3.Kim and Lee [114] have shown consistency when

1(1−s)f(x, s)≤1and

2asx→0,f(x, s) converges uniformly in xto a positive continuous function b(x).

The result is marginally more general than that of Kim and Lee. The methods that

we have used are more elementary.

280 10. NEUTRAL TO THE RIGHT PRIORS

To summarize, neutral to priors are an elegant class of priors that can, in terms of

mathematical tractability, conveniently handle right censored data. We have also seen

that some caution is required if one wants consistent posteriors. As with the Dirichlet,

mixtures of neutral to priors would yield more ﬂexibility in terms of prior opinions

and posteriors that are amenable to simulation. These remain to be explored.

Exercises

11.0.1. If two probability measures on RKagree on all sets of the form (a1,b

1]×

(a2,b

2],...×(ak,b

k] then they agree on all Borel sets in Rk.

11.0.2. Let Mtbe the median of Beta(ct, c(1−t)) where 0 <t<1. Show that Mt≥1

iﬀ t≥1

2. [Hint: If x≥1

2show that xct−1(1 −x)c(1−t)−1is increasing in t. Suppose

t≥1

2and Mt<1

2. Then 1/2

0xct−1(1 −x)c(1−t)−1dx ≥1

2. Make the change of variable

x→ (1 −x) to obtain a contradiction]

11.0.3. Suppose αis a ﬁnite measure. Deﬁne X1,X

2,... by

X1is distributed as ¯α

, for any n≥1,

P(Xn+1 ∈B|X1,X

2,...,X

n)=α(B)+n

1δXi(B)

α(R+n)

Show that X1,X

2,... form an exchangeable sequence and the corresponding

DeFinneti measure on M(R)isDα

11.0.4. Assume a Dirichlet prior and show that the predictive distribution of Xn+1

given X1,X

2,...,X

n, converges to P0weakly almost surely P0. Examine what hap-

pens when the prior is a mixture of Dirichlet processes.

282 11. EXERCISES

11.0.5. Show that if P∈Mαand Uis a neighborhood of Pin set-wise convergence

then Dα(U)>0. However Mαis not the smallest closed set with this property.

11.0.6. Show that a Polya tree prior is a Dirichlet process iﬀ for any ∈E∗

i,α

0+α1=

α.

11.0.7. Let Lµbe the set of all probability measures dominated by a σ−ﬁnite

measure µ. Verify that, when restricted to Lµall the three σ−algebras discussed in

section 2.2 coincide.

11.0.8. Let Ebe a measurable subset of Θ ×X such that θ=θ,E

θ∩Eθ=∅and for

all θ,Pθ(Eθ) = 1. For any two priors Π1,Π2on Θ show that Π1−Π2=λ1−λ2,

where λiare the respective marginals on X.

Derive the Blackwell- Dubins merging result from Doob’s theorem

11.0.9. Consider fθ=U(0,θ); 0 <θ<1. Show that the Schwartz condition fails at

θ= 1 but posterior consistency holds. Can you use the results in Section 4.3 to prove

consistency?

11.0.10. Suppose X1,X

2,...,X

nare i.i.d. Ber(p), i.e.,

Pr(Xi=1)=p=1−Pr(Xi=0)

Apriorforpmay be elicited by asking for a rule for predicting Xn+1. Suppose for all

n≥1, one is given the rule

Pr(Xn+1 =1|X1,X

2,...,X

n)=a+n

1Xi

a+b+n

Assuming that the prediction loss is squared error, show that there is a unique

prior corresponding to this rule and identify the prior

11.0.11. With Xis as in Exercise11.0.11, consider a conjugate prior and a realization

of the Xis such that ˆp=n

1Xi/n is bounded away from 0 and 1 as n→∞.

Show directly (without using the results established in the text) that as n→∞, the

posterior distribution of √n(ˆp−p)/(ˆp(1 −ˆp) converges weakly to N(0,1)

11.0.12. Let X1,X

2...,X

nbe i.i.d. N(0,1). Consider a Bayesian who does not know

the true density and who uses the model, θ∼N(µ, η) and given θ,X1,X

2...,X

nbe

i.i.d. N(θ, 1). Calculate the posterior of θgiven X1,X

2...,X

nand verify that with

probability 1 under the joint distribution under N(0,1), the density of √n(θ−¯

converges in L1distance to N(0,1).

283

11.0.13. Consider Xis as in Exercise11.0.11. Consider a beta prior, i.e., a prior with

density

Π(p)=cpα−1(1 −p)β−1,α≥,β ≥0

a Discuss why relatively small values of α+βindicate relative lack of prior information

b Consider a sequence of hyperparameters αi,β

isuch that αi+βi→0 but αi/βi→

C, 0<C<1. Show that the corresponding sequence of priors converge weakly, and

determine the limiting prior. Would you call this prior noninformative? Reconcile

your answer with the discussion in (a)

11.0.14. (1). For a multinomial with probabilities p1,p

2,...,p

kfor kclasses,calculate

the Jeﬀreys prior. [Hint: Use the following well known identity (see [144]): Let B

be a positive deﬁnite matrix. Let A=B+xxT. Then det A= det B(1+xTB−1x)

]

(2). In the above problem calculate the reference prior for (p1,p

2) assuming k=3.

For the next four problems P∼Dαand given P,X1,X

2,...,X

nare i.i.d. P.

11.0.15. Assume ∞

−∞ x2dα < ∞. Calculate the prior variance of the population

meanxdP

11.0.16. Assuming αhas the Cauchy density

1+x2

and xdP =T(P) is well deﬁned for almost all P, show that T(P) has the same

Cauchy distribution.

[Hint: Use Sethuraman’s construction]

11.0.17. For ¯αCauchy, show that xdP =T(P) is well deﬁned for almost all P.

[Hint: If Yiis a sequence of independent random variables such that n

1Yiconverges

in distribution, then ∞

1Yiis ﬁnite a.s. Alternatively, use methods of Doss and Selke

[55]]

11.0.18. Let αθ=N(θ, 1) and θ∼N(µ, η). Given X1,X

2,...,X

nare all distinct,

calculate the posterior distribution of θ.

For the next three problems, let P∼Dα,Pa convolution of Pand N(0,h

2)and

hhave the prior density Π(h). Given P,letX1,X

2,...,X

nbe i.i.d. f,wherefis the

density of P

284 11. EXERCISES

11.0.19. Let Cnbe the information that all the Xis are distinct. For any ﬁxed x

calculate E(f(x)|X1,X

2,...,X

n,C

n) assuming the Xis are all distinct.

11.0.20. Let the true density f0be uniform on (0,1). Verify if the Bayes estimate

E(f|X1,X

2,...,X

n,h) is consistent in the L1distance

11.0.21. Let f0be Normal or Cauchy with location and scale parameters chosen by

you but not equal to 0 and 1. Set n=50fromf0, draw a sample of size n,namely,

X1,X

2,...,X

n. Simulate the Bayes estimate of f(x) when the prior is a Dirichlet

mixture of normal and ¯α=N(0,1) or N(µ, σ2)withµand σ2independent, µnormal

and σ2is inverse gamma truncated above.

Plot f0and the Bayes estimate. Discuss whether putting a prior on µ, σ2leads to a

Bayes estimate that is closer to f0than the Bayes estimate under a prior with ﬁxed

values of µand σ2. (Base your comments from 10 simulations on each case).

11.0.22. Let f0be normal or Cauchy. Using the Polya tree prior recommended in

Chapter6 and a normal or Cauchy prior for the location parameter, calculate numeri-

cally the posterior for θ, for various values of nand various choices of X1,X

2,...,X

11.0.23. (a) Assume the regression model discussed in Chapter7 with a prior for the

random density fthat is Dirichlet mixture of Normal or Cauchy . Calculate and

plot the posterior for βfor the diﬀerent priors listed in Exercise11.0.21.

(b) Do the same but symmetrize faround 0. Discuss whether the behavior of the

posterior for βis similar o that in (a)

11.0.24. Examine Doob’s theorem in the regression set up considered in Chapter 7

11.0.25. Show that the Bayes estimate for survival function under a Dirichlet prior

with censored data has a representation as a product of survival probabilities and

that it converges to the Kaplan-Meier estimate as α(R)→0.

11.0.26. Show that the Bayes estimate for the bivariate survival function is incon-

sistent in the following example (due to R.Pruitt):

(T1,T

2)∼Fand F∼Dαwhere αis the uniform distribution on (0,2) ×(0,2). The

censoring random variable (C1,C

2)takesthevalues(0,2),(2,0) and (2,2) with equal

probability of 1/3. The Bayes estimator for Fis inconsistent when F0is the uniform

distribution on (1,2) ×(0,1).

11.0.27. Show, in the context of Chapter9 that if one starts with a Dirichlet prior

for the distribution of (Z, ∆) (i.e., a prior for probability measures on {0,1}×R+),

then the induced prior for F-the distribution of the survival time XisaBetaprocess.

References

[1] James H. Albert and Siddhartha Chib. Bayesian analysis of binary and

polychotomous response data. J. Amer. Statist. Assoc., 88(422):669–679, 1993.

[2] S.-I. Amari, O. E. Barndorff-Nielsen, R. E. Kass, S. L. Lauritzen,

and C. R. Rao.Diﬀerential geometry in statistical inference. Institute of

Mathematical Statistics, Hayward, CA, 1987.

[3] Per Kragh Andersen, Ørnulf Borgan, Richard D. Gill, and Niels

Keiding.Statistical models based on counting processes. Springer-Verlag, New

York, 1993.

[4] Charles E. Antoniak. Mixtures of Dirichlet processes with applications to

Bayesian nonparametric problems. Ann. Statist., 2:1152–1174, 1974.

[5] Andrew. Barron, Mark J. Schervish, and Larry Wasserman. The

consistency of posterior distributions in nonparametric problems. Ann. Statist.,

27(2):536–561, 1999.

[6] Andrew R. Barron. The strong ergodic theorem for densities: generalized

Shannon-McMillan-Breiman theorem. Ann. Probab., 13(4):1292–1303, 1985.

[7] Andrew R. Barron. Uniformly powerful goodness of ﬁt tests. Ann. Statist.,

17(1):107–124, 1989.

286 References

[8] Andrew R. Barron. Information-theoretic characterization of Bayes perfor-

mance and the choice of priors in parametric and nonparametric problems. In

Bayesian statistics, 6 (Alcoceber , 1998), pages 27–52. Oxford Univ. Press, New

York, 1999.

[9] D. Basu. Statistical information and likelihood. Sankhy¯a Ser. A, 37(1):1–71,

1975. Discussion and correspondance between Barnard and Basu.

[10] D. Basu and R. C. Tiwari. A note on the Dirichlet process. In Statistics

and probability: essays in honor of C. R. Rao, pages 89–103. North-Holland,

Amsterdam, 1982.

[11] S. Basu and S. Mukhopadhyay. Binary response regression with normal

scale mixture links. In Generalized Linear Models- A Bayesian Perspective,

pages 231–241. Marcel-Dekker, New York, 1998.

[12] S. Basu and S. Mukhopadhyay. Bayesian analysis of binary regression

using symmetric and asymmetric links. Sankhy¯a Ser. B, 62:372–387, 2000.

[13] James O. Berger.Statistical decision theory and Bayesian analysis.

Springer-Verlag, New York, 1993. Corrected reprint of the second (1985) edi-

tion.

[14] James O. Berger and Jose-M. Bernardo. Estimating a product of

means: Bayesian analysis with reference priors. J. Amer. Statist. Assoc.,

84(405):200–207, 1989.

[15] James O. Berger and Jos´

e M. Bernardo. On the development of ref-

erence priors. In Bayesian statistics, 4 (Pe˜n´ıscola, 1991), pages 35–60. Oxford

Univ. Press, New York, 1992.

[16] James O. Berger and Luis R. Pericchi. The intrinsic Bayes factor for

model selection and prediction. J. Amer. Statist. Assoc., 91(433):109–122, 1996.

[17] Robert H. Berk and I. Richard Savage. Dirichlet processes produce

discrete measures: an elementary proof. In Contributions to statistics, pages

25–31. Reidel, Dordrecht, 1979.

[18] Jose-M. Bernardo. Reference posterior distributions for Bayesian inference.

J. Roy. Statist. Soc. Ser. B, 41(2):113–147, 1979. With discussion.

References 287

[19] P. J. Bickel. On adaptive estimation. Ann. Statist., 10(3):647–671, 1982.

[20] P. J. Bickel and J. A. Yahav. Some contributions to the asymptotic theory

of Bayes solutions. Z. Wahrscheinlichkeitstheorie und Verw. Gebiete, 11:257–

276, 1969.

[21] Patrick Billingsley.Convergence of probability measures. John Wiley &

Sons Inc., New York, second edition, 1999. A Wiley-Interscience Publication.

[22] Lucien Birg´

e. Approximation dans les espaces m´etriques et th´eorie de

l’estimation. Z. Wahrsch. Verw. Gebiete, 65(2):181–237, 1983.

[23] David Blackwell. Discreteness of Ferguson selections. Ann. Statist., 1:356–

358, 1973.

[24] David Blackwell and Lester Dubins. Merging of opinions with increas-

ing information. Ann. Math. Statist., 33:882–886, 1962.

[25] David Blackwell and James B. MacQueen. Ferguson distributions via

P´olya urn schemes. Ann. Statist., 1:353–355, 1973.

[26] J. Blum and V. Susarla. On the posterior distribution of a Dirichlet pro-

cess given randomly right censored observations. Stochastic Processes Appl.,

5(3):207–211, 1977.

[27] J. Borwanker, G. Kallianpur, and B. L. S. Prakasa Rao. The

Bernstein-von Mises theorem for Markov processes. Ann. Math. Statist.,

42:1241–1253, 1971.

[28] Olaf Bunke and Xavier Milhaud. Asymptotic behavior of Bayes esti-

mates under possibly incorrect models. Ann. Statist., 26(2):617–644, 1998.

[29] Burr, D, Cooke G.E., Doss H. and P.J. Goldschmidt-Clermont.A

meta analysis of studies on the association of the platlet p1a polymorphism of

glycoprotein iiia and risk of coronary heart disease. Technical report, 2002.

[30] N. N. ˇ

Cencov.Statistical decision rules and optimal inference. American

Mathematical Society, Providence, R.I., 1982. Translation from the Russian

edited by Lev J. Leifman.

288 References

[31] Ming-Hui Chen and Dipak K. Dey. Bayesian modeling of correlated binary

responses via scale mixture of multivariate normal link functions. Sankhy¯a Ser.

A, 60(3):322–343, 1998. Bayesian analysis.

[32] Ming-Hui Chen, Qi-Man Shao, and Joseph G. Ibrahim.Monte Carlo

methods in Bayesian computation. Springer-Verlag, New York, 2000.

[33] Bertrand S. Clarke and Andrew R. Barron. Information-theoretic

asymptotics of Bayes methods. IEEE Trans. Inform. Theory, 36(3):453–471,

1990.

[34] Robert J. Connor and James E. Mosimann. Concepts of independence

for proportions with a generalization of the Dirichlet distribution. J. Amer.

Statist. Assoc., 64:194–206, 1969.

[35] Harald Cram´

er.Mathematical Methods of Statistics. Princeton University

Press, Princeton, N. J., 1946.

[36] Harald Cram´

er and M. R. Leadbetter.Stationary and related stochastic

processes. Sample function properties and their applications. John Wiley & Sons

Inc., New York, 1967.

[37] Sarat Dass and Jayeong Lee. A note on the consistency of bayes factors

for testing point null versus nonparametric alternatives.

[38] G. S. Datta and J. K. Ghosh. Noninformative priors for maximal invariant

parameter in group models. Test, 4(1):95–114, 1995.

[39] A. P. Dawid, M. Stone, and J. V. Zidek. Marginalization paradoxes in

Bayesian and structural inference. J. Roy. Statist. Soc. Ser. B, 35:189–233,

1973. With discussion by D. J. Bartholomew, A. D. McLaren, D. V. Lindley,

Bradley Efron, J. Dickey, G. N. Wilkinson, A. P.Dempster, D. V. Hinkley, M.

R. Novick, Seymour Geisser, D. A. S. Fraser and A. Zellner, and a reply by A.

P. Dawid, M. Stone, and J. V. Zidek.

[40] William A. Dembski. Uniform probability. J. Theoret. Probab., 3(4):611–

626, 1990.

[41] Luc Devroye and L´

aszl´

oGy

orfi. No empirical probability measure

can converge in the total variation sense for all distributions. Ann. Statist.,

18(3):1496–1499, 1990.

References 289

[42] J. Dey, L. Drˇ

aghici, and R. V. Ramamoorthi. Characterizations of tail

free and neutral to the right priors. In Advances on methodological and applied

aspects of probability and statistics, pages 305–325. Gordon and Breach science

publishers.

[43] J. Dey, R.V. Erickson, and R.V. Ramamoorthi. Some aspects of neutral

to right priors. submitted . Bayesian analysis.

[44] P. Diaconis and D. Freedman. Partial exchangeability and suﬃciency.

In Statistics: applications and new directions (Calcutta, 1981), pages 205–236.

Indian Statist. Inst., Calcutta, 1984.

[45] P. Diaconis and D. Freedman. On inconsistent Bayes estimates of location.

Ann. Statist., 14(1):68–87, 1986.

[46] Persi Diaconis and David Freedman. On the consistency of Bayes esti-

mates. Ann. Statist., 14(1):1–67, 1986. With a discussion and a rejoinder by

the authors.

[47] Persi Diaconis and Donald Ylvisaker. Conjugate priors for exponential

families. Ann. Statist., 7(2):269–281, 1979.

[48] Kjell Doksum. Tailfree and neutral random probabilities and their posterior

distributions. Ann. Probability , 2:183–201, 1974.

[49] J. L. Doob. Application of the theory of martingales. In Le Calcul des

Probabilit´es et ses Applications., pages 23–27. Centre National de la Recherche

Scientiﬁque, Paris, 1949. Colloques Internationaux du Centre National de la

Recherche Scientiﬁque, no. 13,.

[50] Hani Doss. Bayesian estimation in the symmetric location problem. Z.

Wahrsch. Verw. Gebiete, 68(2):127–147, 1984.

[51] Hani Doss. Bayesian nonparametric estimation of the median. I. Computation

of the estimates. Ann. Statist., 13(4):1432–1444, 1985.

[52] Hani Doss. Bayesian nonparametric estimation of the median. II. Asymptotic

properties of the estimates. Ann. Statist., 13(4):1445–1464, 1985.

[53] Hani Doss. Bayesian nonparametric estimation for incomplete data via suc-

cessive substitution sampling. Ann. Statist., 22(4):1763–1786, 1994.

290 References

[54] Hani Doss and B. Narasimhan. Dynamic display of changing posterior

in Bayesian survival analysis. In Practical nonparametric and semiparametric

Bayesian statistics, pages 63–87. Springer, New York, 1998.

[55] Hani Doss and Thomas Sellke. The tails of probabilities chosen from a

Dirichlet prior. Ann. Statist., 10(4):1302–1305, 1982.

[56] L. Drˇ

aghici and R. V. Ramamoorthi. A note on the absolute continuity

and singularity of Polya tree priors and posteriors. Scand. J. Statist., 27(2):299–

303, 2000.

[57] R. M. Dudley. Measures on non-separable metric spaces. Illinois J. Math.,

11:449–453, 1967.

[58] Richard M. Dudley.Real analysis and probability. Wadsworth &

Brooks/Cole Advanced Books & Software, Paciﬁc Grove, CA, 1989.

[59] Michael D. Escobar and Mike West. Bayesian density estimation and

inference using mixtures. J. Amer. Statist. Assoc., 90(430):577–588, 1995.

[60] Michael D. Escobar and Mike West. Computing nonparametric hi-

erarchical models. In Practical nonparametric and semiparametric Bayesian

statistics, pages 1–22. Springer, New York, 1998.

[61] Thomas S. Ferguson. A Bayesian analysis of some nonparametric problems.

Ann. Statist., 1:209–230, 1973.

[62] Thomas S. Ferguson. Prior distributions on spaces of probability measures.

Ann. Statist., 2:615–629, 1974.

[63] Thomas S. Ferguson. Bayesian density estimation by mixtures of normal

distributions. In Recent advances in statistics, pages 287–302. Academic Press,

New York, 1983.

[64] Thomas S. Ferguson and Eswar G. Phadia. Bayesian nonparametric

estimation based on censored data. Ann. Statist., 7(1):163–186, 1979.

[65] Thomas S. Ferguson, Eswar G. Phadia, and Ram C. Tiwari. Bayesian

nonparametric inference. In Current issues in statistical inference: essays in

honor of D. Basu, pages 127–150. Inst. Math. Statist., Hayward, CA, 1992.

References 291

[66] J.-P. Florens, M. Mouchart, and J.-M. Rolin. Bayesian analysis of

mixtures: some results on exact estimability and identiﬁcation. In Bayesian

statistics, 4 (Pe˜n´ıscola, 1991), pages 127–145. Oxford Univ. Press, New York,

1992.

[67] Sandra Fortini, Lucia Ladelli, and Eugenio Regazzini. Exchangeabil-

ity, predictive distributions and parametric models. Sankhy¯a Ser. A, 62(1):86–

109, 2000.

[68] David A. Freedman. Invariants under mixing which generalize de Finetti’s

theorem: Continuous time parameter. Ann. Math. Statist., 34:1194–1216, 1963.

[69] David A. Freedman. On the asymptotic behavior of Bayes’ estimates in the

discrete case. Ann. Math. Statist., 34:1386–1403, 1963.

[70] David A. Freedman. On the asymptotic behavior of Bayes estimates in the

discrete case. II. Ann. Math. Statist., 36:454–456, 1965.

[71] Marie Gaudard and Donald Hadwin. Sigma-algebras on spaces of prob-

ability measures. Scand. J. Statist., 16(2):169–175, 1989.

[72] J. K. Ghorai and H. Rubin. Bayes risk consistency of nonparametric Bayes

density estimates. Austral. J. Statist., 24(1):51–66, 1982.

[73] S. Ghosal, J. K. Ghosh, and R. V. Ramamoorthi. Non-informative

priors via sieves and packing numbers. In Advances in statistical decision theory

and applications, pages 119–132. Birkh¨auser Boston, Boston, MA, 1997.

[74] S. Ghosal, J. K. Ghosh, and R. V. Ramamoorthi. Posterior consistency

of Dirichlet mixtures in density estimation. Ann. Statist., 27(1):143–158, 1999.

[75] Subhashis Ghosal. Normal approximation to the posterior distribution

for generalized linear models with many covariates. Math. Methods Statist.,

6(3):332–348, 1997.

[76] Subhashis Ghosal. Asymptotic normality of posterior distributions in high-

dimensional linear models. Bernoulli, 5(2):315–331, 1999.

[77] Subhashis Ghosal. Asymptotic normality of posterior distributions for expo-

nential families when the number of parameters tends to inﬁnity. J. Multivariate

Anal., 74(1):49–68, 2000.

292 References

[78] Subhashis Ghosal, Jayanta K. Ghosh, and R. V. Ramamoorthi.

Consistent semiparametric Bayesian inference about a location parameter. J.

Statist. Plann. Inference, 77(2):181–193, 1999.

[79] Subhashis Ghosal, Jayanta K. Ghosh, and Tapas Samanta.Oncon-

vergence of posterior distributions. Ann. Statist., 23(6):2145–2152, 1995.

[80] Subhashis Ghosal, Jayanta K. Ghosh, and Aad W. van der Vaart.

Convergence rates of posterior distributions. Ann. Statist., 28(2):500–531, 2000.

[81] J. K. Ghosh, R. V. Ramamoorthi, and K. R. Srikanth. Bayesian anal-

ysis of censored data. Statist. Probab. Lett., 41(3):255–265, 1999. Special issue

in memory of V. Susarla.

[82] J. K. Ghosh, B. K. Sinha, and S. N. Joshi. Expansions for posterior

probability and integrated Bayes risk. In Statistical decision theory and related

topics, III, Vol. 1 (West Lafayette, Ind., 1981), pages 403–456. Academic Press,

New York, 1982.

[83] Jayanta K. Ghosh.Higher Order Asymptotics, volume 4. NSF-CBMS Re-

gional Conference Series in probability and Statistics, 1994.

[84] Jayanta K. Ghosh, Subhashis Ghosal, and Tapas Samanta. Stability

and convergence of the posterior in non-regular problems. In Statistical deci-

sion theory and related topics, V (West Lafayette, IN, 1992), pages 183–199.

Springer, New York, 1994.

[85] Jayanta K. Ghosh, Shrikant N. Joshi, and Chiranjit Mukhopad-

hyay . Asymptotics of a Bayesian approach to estimating change-point in a

hazard rate. Comm. Statist. Theory Methods, 25(12):3147–3166, 1996.

[86] Jayanta K. Ghosh and Rahul Mukerjee. Non-informative priors. In

Bayesian statistics, 4 (Pe˜n´ıscola, 1991), pages 195–210. Oxford Univ. Press,

New York, 1992.

[87] Jayanta K. Ghosh and R. V. Ramamoorthi. Consistency of Bayesian

inference for survival analysis with or without censoring. In Analysis of cen-

sored data (Pune, 1994/1995), pages 95–103. Inst. Math. Statist., Hayward,

CA, 1995.

References 293

[88] Jayanta K. Ghosh and Tapas Samanta. Nonsubjective Bayes testing—an

overview. J. Statist. Plann. Inference, 103(1-2):205–223, 2002. C. R. Rao 80th

birthday felicitation volume, Part I.

[89] Ghosh.J.K. Review of approximation theorems in statistics by serﬂing. Jour-

nal of Ameri. Stat., 78(383):732, September 1983.

[90] Richard D. Gill and Søren Johansen. A survey of product-integration

with a view toward application in survival analysis. Ann. Statist., 18(4):1501–

1555, 1990.

[91] Piet Groeneboom. Nonparametric estimators for interval censoring prob-

lems. In Analysis of censored data (Pune, 1994/1995), volume 27 of IMS Lec-

ture Notes Monogr. Ser., pages 105–128. Inst. Math. Statist., Hayward, CA,

1995.

[92] J. Hannan. Consistency of maximum likelihood estimation of discrete distri-

butions. In Contributions to probability and statistics, pages 249–257. Stanford

Univ. Press, Stanford, Calif., 1960.

[93] J. A. Hartigan.Bayes theory . Springer-Verlag, New York, 1983.

[94] J. A. Hartigan. Bayesian histograms. In Bayesian statistics, 5 (Alicante,

1994), pages 211–222. Oxford Univ. Press, New York, 1996.

[95] David Heath and William Sudderth. De Finetti’s theorem on exchange-

able variables. Amer. Statist., 30(4):188–189, 1976.

[96] David Heath and William Sudderth. On ﬁnitely additive priors, coher-

ence, and extended admissibility. Ann. Statist., 6(2):333–345, 1978.

[97] David Heath and William Sudderth. Coherent inference from improper

priors and from ﬁnitely additive priors. Ann. Statist., 17(2):907–919, 1989.

[98] N. L. Hjort. Bayesian approaches to non- and semiparametric density esti-

mation. In Bayesian statistics, 5 (Alicante, 1994), pages 223–253. Oxford Univ.

Press, New York, 1996.

[99] Nils Lid Hjort.Application of the Dirichlet Process to some nonparametric

estimation problems(in Norwegian). Ph.D. thesis, University of TromsØ.

294 References

[100] Nils Lid Hjort. Nonparametric Bayes estimators based on beta processes in

models for life history data. Ann. Statist., 18(3):1259–1294, 1990.

[101] N.L Hjort and D Pollard. Asymptotics of minimisers of convex processes.

Statistical Research Report, Department of Mathematics. University of Oslo,

1994.

[102] I. A. Ibragimov and R. Z. Hasminski

ı.Statistical estimation. Springer-

Verlag, New York, 1981. Asymptotic theory, Translated from the Russian by

Samuel Kotz.

[103] Hemant Ishwaran. Exponential posterior consistency via generalized P´olya

urn schemes in ﬁnite semiparametric mixtures. Ann. Statist., 26(6):2157–2178,

1998.

[104] K. Ito.Stochastic processes. Matematisk Institut, Aarhus Universitet, Aarhus,

1969.

[105] Lancelot James. Poisson process partition calculus with applications to

exchangeable models and bayesian nonparametrics. .

[106] Harold Jeffreys. An invariant form for the prior probability in estimation

problems. Proc. Roy. Soc. London. Ser. A., 186:453–461, 1946.

[107] R. A. Johnson. An asymptotic expansion for posterior distributions. Ann.

Math. Statist., 38:1899–1906, 1967.

[108] Richard A. Johnson. Asymptotic expansions associated with posterior dis-

tributions. Ann. Math. Statist., 41:851–864, 1970.

[109] Joseph B. Kadane, James M. Dickey, Robert L. Winkler, Wayne S.

Smith, and Stephen C. Peters. Interactive elicitation of opinion for a

normal linear model. J. Amer. Statist. Assoc., 75(372):845–854, 1980.

[110] Olav Kallenberg.Foundations of modern probability. Springer-Verlag, New

York, 1997.

[111] Robert E. Kass and Larry Wasserman. The selection of prior distribu-

tions by formal rules. Journal of the American Statistical Association , 91:1343–

1370, 1996.

References 295

[112] J. H. B. Kemperman. On the optimum rate of transmitting information.

Ann. Math. Statist., 40:2156–2177, 1969.

[113] Yongdai Kim. Nonparametric Bayesian estimators for counting processes.

Ann. Statist., 27(2):562–588, 1999.

[114] Yongdai Kim and Jaeyong Lee. On posterior consistency of survival mod-

els. Ann. Statist., 29(3):666–686, 2001.

[115] A. N. Kolmogorov and V. M. Tihomirov.ε-entropy and ε-capacity of

sets in functional space. Amer. Math. Soc. Transl. (2), 17:277–364, 1961.

[116] Ramesh M. Korwar and Myles Hollander. Contributions to the theory

of Dirichlet processes. Ann. Probability, 1:705–711, 1973.

[117] Steffen L. Lauritzen.Extremal families and systems of suﬃcient statistics.

Springer-Verlag, New York, 1988.

[118] Michael Lavine. SomeaspectsofP´olya tree distributions for statistical mod-

elling. Ann. Statist., 20(3):1222–1235, 1992.

[119] Michael Lavine. MoreaspectsofP´olya tree distributions for statistical mod-

elling. Ann. Statist., 22(3):1161–1176, 1994.

[120] Lucien Le Cam.Asymptotic methods in statistical decision theory. Springer-

Verlag, New York, 1986.

[121] Lucien Le Cam and Grace Lo Yang.Asymptotics in statistics. Springer-

Verlag, New York, 1990. Some basic concepts.

[122] L. LeCam. Convergence of estimates under dimensionality restrictions. Ann.

Statist., 1:38–53, 1973.

[123] E. L. Lehmann.Testing statistical hypotheses. Springer-Verlag, New York,

second edition, 1997.

[124] E. L. Lehmann.Theory of point estimation. Springer-Verlag, New York,

1997. Reprint of the 1983 original.

[125] Peter J. Lenk. The logistic normal distribution for Bayesian, nonparametric,

predictive densities. J. Amer. Statist. Assoc., 83(402):509–516, 1988.

296 References

[126] Tom Leonard. A Bayesian approach to some multinomial estimation and

pretesting problems. J. Amer. Statist. Assoc., 72(360, part 1):869–874, 1977.

[127] Tom Leonard. Density estimation, stochastic processes and prior informa-

tion. J. Roy. Statist. Soc. Ser. B, 40(2):113–146, 1978. With discussion.

[128] D. V. Lindley. On a measure of the information provided by an experiment.

Ann. Math. Statist., 27:986–1005, 1956.

[129] D. V. Lindley. The use of prior probability distributions in statistical infer-

ence and decisions. In Proc. 4th Berkeley Sympos. Math. Statist. and Prob.,

Vol. I , pages 453–468. Univ. California Press, Berkeley, Calif., 1961.

[130] Albert Y. Lo. Consistency in the location model: the undominated case.

Ann. Statist., 12(4):1584–1587, 1984.

[131] Albert Y. Lo. On a class of Bayesian nonparametric estimates. I. Density

estimates. Ann. Statist., 12(1):351–357, 1984.

[132] Michel Lo`

eve.Probability theory. II. Springer-Verlag, New York, fourth

edition, 1978. Graduate Texts in Mathematics, Vol. 46.

[133] R. Daniel Mauldin, William D. Sudderth, and S. C. Williams.P´olya

trees and random distributions. Ann. Statist., 20(3):1203–1221, 1992.

[134] Amewou-Atisso Messan, Subhashis Ghosal, Jayanta K. Ghosh, and

R. V. Ramamoorthi. Posterior consistency for semiparametic regression

problems. Bernoulli , To appear(2), 2002.

[135] Radford M. Neal. Markov chain sampling methods for Dirichlet process

mixture models. J. Comput. Graph. Statist., 9(2):249–265, 2000.

[136] Michael A. Newton, Claudia Czado, and Rick Chappell. Bayesian

inference for semiparametric binary regression. J. Amer. Statist. Assoc.,

91(433):142–153, 1996.

[137] Michael A. Newton, Fernando A. Quintana, and Yunlei Zhang.

Nonparametric Bayes methods using predictive updating. In Practical non-

parametric and semiparametric Bayesian statistics, pages 45–61. Springer, New

York, 1998.

References 297

[138] Arthur V. Peterson, Jr. Expressing the Kaplan-Meier estimator as a

function of empirical subsurvival functions. J. Amer. Statist. Assoc., 72(360,

part 1):854–858, 1977.

[139] David Pollard.Convergence of stochastic processes. Springer-Verlag, New

York, 1984.

[140] David Pollard.A user’s guide to measure theoretic probability. Cambridge

University Press, Cambridge, 2002.

[141] Kathryn Roeder and Larry Wasserman. Practical Bayesian density

estimation using mixtures of normals. J. Amer. Statist. Assoc., 92(439):894–

902, 1997.

[142] Donald B. Rubin. The Bayesian bootstrap. Ann. Statist., 9(1):130–134,

1981.

[143] Gabriella Salinetti. Consistency of statistical estimators: the epigraphi-

cal view. In Stochastic optimization: algorithms and applications (Gainesville,

FL, 2000), volume 54 of Appl. Optim., pages 365–383. Kluwer Acad. Publ.,

Dordrecht, 2001.

[144] Mark J. Schervish.Theory of statistics. Springer-Verlag, New York, 1995.

[145] Lorraine Schwartz. On Bayes procedures. Z. Wahrscheinlichkeitstheorie

und Verw. Gebiete, 4:10–26, 1965.

[146] Gideon Schwarz. Estimating the dimension of a model. Ann. Statist.,

6(2):461–464, 1978.

[147] Robert J. Serfling.Approximation theorems of mathematical statistics.

John Wiley & Sons Inc., New York, 1980. Wiley Series in Probability and

Mathematical Statistics.

[148] Jayaram Sethuraman. A constructive deﬁnition of Dirichlet priors. Statist.

Sinica, 4(2):639–650, 1994.

[149] Jayaram Sethuraman and Ram C. Tiwari. Convergence of Dirichlet

measures and the interpretation of their parameter. In Statistical decision the-

ory and related topics, III, Vol. 2 (West Lafayette, Ind., 1981), pages 305–315.

Academic Press, New York, 1982.

298 References

[150] Xiaotong Shen and Larry Wasserman. Rates of convergence of posterior

distributions. Ann. Statist., 29(3):687–714, 2001.

[151] B. W. Silverman.Density estimation for statistics and data analysis. Chap-

man & Hall, London, 1986.

[152] RichardL.Smith. Nonregular regression. Biometrika, 81(1):173–183, 1994.

[153] S. M. Srivastava.A course on Borel sets. Springer-Verlag, New York, 1998.

[154] V. Susarla and J. Van Ryzin. Nonparametric Bayesian estimation

of survival curves from incomplete observations. J. Amer. Statist. Assoc.,

71(356):897–902, 1976.

[155] V. Susarla and J. Van Ryzin. Large sample theory for a Bayesian non-

parametric survival curve estimator based on censored samples. Ann. Statist.,

6(4):755–768, 1978.

[156] Henry Teicher. Identiﬁability of ﬁnite mixtures. Ann. Math. Statist.,

34:1265–1269, 1963.

[157] Daniel Thorburn. A Bayesian approach to density estimation. Biometrika,

73(1):65–75, 1986.

[158] Luke Tierney and Joseph B. Kadane. Accurate approximations for pos-

terior moments and marginal densities. J. Amer. Statist. Assoc., 81(393):82–86,

1986.

[159] Bruce W. Turnbull. The empirical distribution function with arbitrarily

grouped, censored and truncated data. J. Roy. Statist. Soc. Ser. B, 38(3):290–

295, 1976.

[160] A. W. van der Vaart.Asymptotic statistics. Cambridge University Press,

Cambridge, 1998.

[161] Aad W. van der Vaart and Jon A. Wellner.Weak convergence and

empirical processes. Springer-Verlag, New York, 1996. With applications to

statistics.

[162] Richard von Mises.Probability, statistics and truth . Dover Publications

Inc., New York, english edition, 1981.

References 299

[163] Abraham Wald. Note on the consistency of the maximum likelihood esti-

mate. Ann. Math. Statistics, 20:595–601, 1949.

[164] A. M. Walker. On the asymptotic behaviour of posterior distributions. J.

Roy. Statist. Soc. Ser. B, 31:80–88, 1969.

[165] Stephen Walker and Nils Lid Hjort. On Bayesian consistency. J. R.

Stat. Soc. Ser. B Stat. Methodol., 63(4):811–821, 2001.

[166] Stephen Walker and Pietro Muliere. Beta-Stacy processes and a gen-

eralization of the P´olya-urn scheme. Ann. Statist., 25(4):1762–1780, 1997.

[167] Stephen Walker and Pietro Muliere. A characterization of a neutral

to the right prior via an extension of Johnson’s suﬃcientness postulate. Ann.

Statist., 27(2):589–599, 1999.

[168] Mike West. Modelling with mixtures. In Bayesian statistics, 4 (Pe´ıscola,

1991), pages 503–524. Oxford Univ. Press, New York, 1992.

[169] Mike West. Approximating posterior distributions by mixtures. J. Roy.

Statist. Soc. Ser. B, 55(2):409–422, 1993.

[170] Mike West, Peter M¨

uller, and Michael D. Escobar. Hierarchical pri-

ors and mixture models, with application in regression and density estimation.

In Aspects of uncertainty, pages 363–386. Wiley, Chichester, 1994.

[171] E. T. Whittaker and G. N. Watson.A course of modern analysis.Cam-

bridge University Press, Cambridge, 1996. An introduction to the general the-

ory of inﬁnite processes and of analytic functions; with an account of the prin-

cipal transcendental functions, Reprint of the fourth (1927) edition.

[172] Wing Hung Wong and Xiaotong Shen. Probability inequalities for likeli-

hood ratios and convergence rates of sieve MLEs. Ann. Statist., 23(2):339–362,

1995.

[173] Michael Woodroofe. Very weak expansions for sequentially designed ex-

periments: linear models. Ann. Statist., 17(3):1087–1102, 1989.

Index

aﬃnity, 13

Albert, J.H., 213

amenable group, 52

Andersen, P.K., 237

Antoniak, C.E., 113

Bahadur, R.R., 29

ball, 10

Barron, A., 48, 132, 133, 135–137, 143,

171, 181, 182, 191

Basu, D., 46, 103

Basu, S., 213, 214

Bayes estimates, 122

asymptotic normality, 38

consistency, 122

Berger, J., 46, 47, 50, 51, 88, 229

Berk, R.H., 103

Bernardo, J.M., 47–50, 52, 228, 229

Bernstein, 34

beta distribution, 87

beta process, 254, 265

consistency, 279

construction, 266

deﬁnition, 265

properties, 268–270

beta:-Stacy process, 254, 270

BIC, 40

Bickel, P., 34

Billingsley, P., 12, 13

Birg´e, L., 232, 234

Blackwell, D., 21, 103

Blum, J., 238

Borgan, Ø., 237

Borwanker, J., 35

boundary, 10

bracket, 233

Bunke, O., 183

Burr D., 198

Cencov, N.N., 222

censored data, 237

consistency, 241, 247

Index 301

change point, 45

Chen, M., 147, 213

Chib, S., 213

Clarke,B,48

closed , 10

compact, 10

conjugate prior, 53

Connor, R.J., 253

consistency

L1, 122, 135

strong, 122

consistency of posterior, 17, 26

consistent estimate, 33

Cooke, G.E., 198

Cram´er, H., 33, 177

cumulative hazard function, 242, 243,

253, 258

Datta, G., 50–52, 229

Dawid, A.P., 51

De Finetti’s theorem, 64

Dembski, W.A., 221, 223, 224

density estimation, 141

Dey, D.K., 213

Dey, J., 257, 259, 271

Diaconis, P., 21, 22, 31, 53, 55, 86, 113,

181–185, 192, 195

Dirichlet density, 62

Dirichlet distribution, 87, 89

polya urn, 94

properties, 89–94

Bayes estimate, 95

Dirichlet mixtures, 143

normal densities, 144, 161, 197, 198,

209, 222

L1-Consistency, 169, 172

weak consistency, 162, 164, 165

uniform densities, see random his-

tograms

Dirichlet process, 96

convergence properties, 105

discrete support, 102

existence, 96

mixtures of, 113

mutual singularity, 110

neutral to the right, 99

posterior, 96

posterior consistency, 106

predictive distribution, 99

Sethuraman construction, 103

support, 104

tail free, 98

Doksum, K.A, 253

Doksum, K.A., 120, 253, 257, 259

Doss, H., 166, 181, 198

Dragichi, L., 120, 257

Dubins, L., 21

Dudley, R., 16, 81

empirical process, 26

entropy, 47

Erickson, R.V., 259, 271

Escobar, M.D., 142, 146, 147

Ferguson, T., 87, 107, 114, 143, 144,

146, 253, 257, 264

ﬁnitely additive prior, 52

Fisher information, 40

Florens, 146

Fortini, S., 86

Freedman, D., 21, 22, 24, 31, 55, 86,

113, 181–185, 192, 195

Gasperini, M., 142, 150, 151

Gaudard, M., 61

302 Index

Gaussian process priors, 174

sample paths, 175, 176

Ghorai, J.K., 161

Ghosal, S., 18, 35, 43, 45, 187, 198, 202,

231

Ghosh, J.K., 18, 33, 35, 39, 40, 43, 45–

47, 50–52, 187, 198, 202, 229,

231

Gill, R., 237, 244

Glivenko-Cantelli theorem, 59

Goldschmidt-Clermont, P.J., 198

Haar measure

left invariant, 51

right invariant, 51

Hannan, J., 14

Hartigan, J.A., 142, 223

Hasminski˘ı, R.Z., 41

Heath, D., 52, 83

Hellinger distance, 41

Hjort, N.L., 28, 103, 142, 143, 245, 253,

254, 264–268

Hoeﬀeding’s inequality, 128, 136

Hollander, M., 110

hyperparameter, 113, 146

Ibragimov, I.A., 41

IH conditions, 41

independent increment process, 253, 258–

260

interior , 10

Ito, K., 260

Jeﬀreys prior, 47, 49, 51, 221, 222, 225,

228

Johansen, S., 244

Johnson, R.A., 35

joint distribution, 16

Joshi, S.N., 35, 39, 40, 45

K-L support, 181

Kadane, J., 35, 54

Kallenberg, O., 260

Kallianpur, G., 35

Kaplan-Meier, 238, 241, 242, 249

Kass, R., 46

Keiding, N., 237

Kemperman, J.H.B., 15

Kim, Y., 254, 279

Kolmogorov, A.N., 64, 227

Kolmogorv strong law, 199, 200

Korwar, R., 110

Kraft, C., 76

Kullback-Leibler

divergence, 14, 126

support, 126, 129, 197

L´evy measure, 253, 261, 263

L´evy representation, 253, 259, 264

Ladelli, L, 86

Laplace, 17, 34

Lauritzen, S.L., 55

Lavine, M., 114, 143, 190, 192, 195

Leadbetter, M.R., 177

LeCam’s inequality, 137

LeCam, L., 34, 137, 231, 232

Lee, J., 254, 279

Lehman, E.L., 28, 29, 34

Lenk, P., 142, 174, 175, 177, 180

Leonard, T., 142, 174, 175

Lindley, D., 35, 48

link function, 198

Lo, A.Y., 142, 143, 161

location parameter, 181

consistency, 185, 186, 188

Dirichlet prior, 182

Index 303

consistency, 185

posterior for θ, 183

log concave, 28

logit, 213

Mahalanobis, D., 222

marginal distribution, 16

Massart, 234

Mauldin, R.D., 94, 114, 116, 119

maximum likelihood

asymptotic normality, 33, 34

consistency, 26, 28

estimate, 26, 249

inconsistency, 29

Mcqueen, J.B., 103

measure of information, 48

merging, 20

Messan, C.A., 198, 202, 215

metric

L1,13

compact, 11, 24

complete separable, 13, 24, 58, 60

Hellinger, 13, 58

separable space, 11

space, 10

supremum, 59, 81

total variation, 13, 58, 60

metric entropy

L1, 135, 137

bracketing, 135, 137

Milhaud, X., 183

Mosimann, J.E., 253

Mueller, P., 142, 146, 147

Mukherjee, R, 46

Mukhopadhyay, C.S., 45

Mukhopadhyay, C., 45

Mukhopadhyay, S., 213, 214

Muliere, P., 257, 270

multinomial, 24, 54, 67

Neal, R., 147

neighborhood base, 13

neutral to the right prior, 253

beta Stacy process, 255

characterization, 257

consistency, 271, 272, 275, 279

deﬁnition, 254

Dirichlet process, 255

existence, 263

inconsistency, 276

posterior, 256

posterior L´evy measure, 264

support, 262

Newton, M.A., 107, 114, 147

nonergodic, 52

noninformative prior, 46

nonregular, 35, 41, 44

nonseparable, 58, 60

nonsubjective prior, 10, 46, 51–53, 221

open,10

packing number, 224

Pericchi, L, 46

Peterson, A.V., 238, 246

Phadia, E., 253, 257

Pollard, D., 11, 26, 28

Polya tree, 142, 209

consistency, 118

existence, 116

Kulback-Leibler support, 190

marginal distribution, 117

on densities, 120

posterior, 117

predictive distribution, 118

304 Index

prior, 73, 198

process, 114

support, 119

posterior distribution, 16

posterior inconsistency, 31

posterior normality, 34, 35, 42

posterior robustness, 18

predictive distribution, 21

probit, 213

product integral, 245, 264

proper centering, 43

Quintana, F. A., 147

Ramamoorthi, R.V., 120, 187, 198, 202,

257, 259, 271

random histograms, 144, 148, 222

L1-consistency, 156, 160

weak consistency, 150, 152

Rao, B.L.S. Prakasa, 35

Rao, C.R., 222

rates of convergence, 141

reference prior, 47, 50, 51

Regazzini, E., 55, 86

regression

coeﬃcient, 198

Schwartz theorem, 197, 198

Rubin, D., 106

Rubin, H, 161

Ryzin, Van J., 238, 241, 249

Salinetti, 122

Samanta, T., 18, 43, 45, 46

Savage, I.R., 103

Schervish, M., 54, 55, 63, 83, 86, 96,

106, 143, 147, 171

Schwartz, 31

Schwarz, G., 40

Sellke, T., 166

Serﬂing, R.J., 33

Sethuraman, J., 79, 96, 103, 105

setwise convergence, 58–61, 81

Shannon, C.L., 47

Shen, X, 231

Shen, X., 221, 230, 232, 234

Silverman, B.W,, 141

Sinha, B.K., 35, 39, 40

Smith, R.L., 41

Srivastava, S.M., 24

Stein, C., 31

Stone, M, 51

strong consistency, 135

Sudderth, W.D., 52, 83, 94, 114, 116,

119

support, 11, 24

topological, 11

survival function, 254

Susarla,V, 238, 241, 249

tail free prior, 71

0-1 laws, 75

consistency, 126

existence, 71

on densities, 76

posterior, 74

Teicher, H., 144

test

exponentially consistent, 127, 129,

203, 204, 214

unbiased, 127, 131

uniformly consistent, 127, 129, 131,

132

theorem

π-λ, 11, 60

Bernstein–von Mises, 33, 42, 44

Index 305

Borel Isomorphism, 24

De Finetti, 55, 63, 83, 86, 95

Doob, 22, 31

Kolmogorov consistency, 64, 66

pormanteau, 80

portmanteau, 12

Prohorov, 13, 60, 80

Schwartz, 33, 129, 181

Stone-Weirstrass, 21

Thorburn ,D., 174

Tierney, L., 35

tight, 13, 79

Tihomirov, V.M., 227

Tiwari, R.C., 79, 103, 105

Turnbull, B., 249

uniform strong law, 24, 26, 27

upper bracketing numbers, 234

Vaart, van der, 12, 26, 141, 231, 234

Von Mises, R., 18

von Mises, R., 34

Wald’s conditions, 27

Wald, A., 27

Walker, A.M., 34

Walker, S., 143, 257, 270

Wasserman, L., 46, 143, 171, 231

Watson, G.N., 194

weak consistency, 122

weak convergence, 12, 13, 60

Wellner, J., 12, 26, 234

West, M., 142, 146, 147

Whittaker E.T., 194

Williams, S.C., 94, 114, 116, 119

Wong, W., 221, 230, 232, 234

Woodroofe, M., 35

Yahav, J.A., 34

Ylvisaker, D., 53

Zhang, Y., 147

Zidek, J.V., 51

The Bayesian Infinitesimal Jackknife for Variance

Preprint

May 2023

The frequentist variability of Bayesian posterior expectations can provide meaningful measures of uncertainty even when models are misspecified. Classical methods to asymptotically approximate the frequentist covariance of Bayesian estimators such as the Laplace approximation and the nonparametric bootstrap can be practically inconvenient, since the Laplace approximation may require an intractable integral to compute the marginal log posterior, and the bootstrap requires computing the posterior for many different bootstrap datasets. We develop and explore the infinitesimal jackknife (IJ), an alternative method for computing asymptotic frequentist covariance of smooth functionals of exchangeable data, which is based on the ``influence function'' of robust statistics. We show that the influence function for posterior expectations has the form of a simple posterior covariance, and that the IJ covariance estimate is, in turn, easily computed from a single set of posterior samples. Under conditions similar to those required for a Bayesian central limit theorem to apply, we prove that the corresponding IJ covariance estimate is asymptotically equivalent to the Laplace approximation and the bootstrap. In the presence of nuisance parameters that may not obey a central limit theorem, we argue heuristically that the IJ covariance can remain a good approximation to the limiting frequentist variance. We demonstrate the accuracy and computational benefits of the IJ covariance estimates with simulated and real-world experiments.

Bayesian Consistency for Long Memory Processes: A Semiparametric Perspective

Preprint

Jun 2024

Clara Grazian

In this work, we will investigate a Bayesian approach to estimating the parameters of long memory models. Long memory, characterized by the phenomenon of hyperbolic autocorrelation decay in time series, has garnered significant attention. This is because, in many situations, the assumption of short memory, such as the Markovianity assumption, can be deemed too restrictive. Applications for long memory models can be readily found in fields such as astronomy, finance, and environmental sciences. However, current parametric and semiparametric approaches to modeling long memory present challenges, particularly in the estimation process. In this study, we will introduce various methods applied to this problem from a Bayesian perspective, along with a novel semiparametric approach for deriving the posterior distribution of the long memory parameter. Additionally, we will establish the asymptotic properties of the model. An advantage of this approach is that it allows to implement state-of-the-art efficient algorithms for nonparametric Bayesian models.

The Calibrated Bayesian Hypothesis Test for Directional Hypotheses of the Odds Ratio in $$2\times 2$$ Contingency Tables

Article

Full-text available

May 2024

Riko Kelter

The $$\chi ^{2}$$ χ 2 test is among the most widely used statistical hypothesis tests in medical research. Often, the statistical analysis deals with the test of row-column independence in a $$2\times 2$$ 2 × 2 contingency table, and the statistical parameter of interest is the odds ratio. A novel Bayesian analogue to the frequentist $$\chi ^{2}$$ χ 2 test is introduced. The test is based on a Dirichlet-multinomial model under a joint sampling scheme and works with balanced and unbalanced randomization. The test focusses on the quantity of interest in a variety of medical research, the odds ratio in a $$2\times 2$$ 2 × 2 contingency table. A computational implementation of the test is developed and R code is provided to apply the test. To meet the demands of regulatory agencies, a calibration of the Bayesian test is introduced which allows to calibrate the false-positive rate and power. The latter provides a Bayes-frequentist compromise which ensures control over the long-term error rates of the test. Illustrative examples using clinical trial data and simulations show how to use the test in practice. In contrast to existing Bayesian tests for $$2\times 2$$ 2 × 2 tables, calibration of the acceptance threshold for the hypothesis of interest allows to achieve a bound on the false-positive rate and minimum power for a prespecified odds ratio of interest. The novel Bayesian test provides an attractive choice for Bayesian biostatisticians who face the demands of regulatory agencies which usually require formal control over false-positive errors and power under the alternative. As such, it constitutes an easy-to-apply addition to the arsenal of already existing Bayesian tests.

A Bayesian approach to functional regression: theory and computation

Preprint

Full-text available

Dec 2023

We propose a novel Bayesian methodology for inference in functional linear and logistic regression models based on the theory of reproducing kernel Hilbert spaces (RKHS's). These models build upon the RKHS associated with the covariance function of the underlying stochastic process, and can be viewed as a finite-dimensional approximation to the classical functional regression paradigm. The corresponding functional model is determined by a function living on a dense subspace of the RKHS of interest, which has a tractable parametric form based on linear combinations of the kernel. By imposing a suitable prior distribution on this functional space, we can naturally perform data-driven inference via standard Bayes methodology, estimating the posterior distribution through Markov chain Monte Carlo (MCMC) methods. In this context, our contribution is twofold. First, we derive a theoretical result that guarantees posterior consistency in these models, based on an application of a classic theorem of Doob to our RKHS setting. Second, we show that several prediction strategies stemming from our Bayesian formulation are competitive against other usual alternatives in both simulations and real data sets, including a Bayesian-motivated variable selection procedure.

Forecasting with a panel Tobit model

Article

Full-text available

Jan 2023

We use a dynamic panel Tobit model with heteroskedasticity to generate forecasts for a large cross‐section of short time series of censored observations. Our fully Bayesian approach allows us to flexibly estimate the cross‐sectional distribution of heterogeneous coefficients and then implicitly use this distribution as prior to construct Bayes forecasts for the individual time series. In addition to density forecasts, we construct set forecasts that explicitly target the average coverage probability for the cross‐section. We present a novel application in which we forecast bank‐level loan charge‐off rates for small banks.

Semiparametric Bayesian doubly robust causal estimation

Article

Full-text available

Dec 2022
J STAT PLAN INFER

Frequentist semiparametric theory has been used extensively to develop doubly robust (DR) causal estimation. DR estimation combines outcome regression (OR) and propensity score (PS) models in such a way that correct specification of just one of two models is enough to obtain consistent parameter estimation. An equivalent Bayesian solution, however, is not straightforward as there is no obvious distributional framework to the joint OR and PS model, and the DR approach involves a semiparametric estimating equation framework without a fully specified likelihood. In this paper, we develop a fully semiparametric Bayesian framework for DR causal inference by bridging a nonparametric Bayesian procedure with empirical likelihood theory via semiparametric linear regression. Instead of specifying a fully probabilistic model, this procedure is only realized through relevant moment conditions. Crucially, this allows the posterior distribution of the causal parameter to be simulated via Markov chain Monte Carlo methods. We show that the posterior distribution of the causal estimator satisfies consistency and the Bernstein–von Mises theorem, when either the OR or PS is correctly specified. Simulation studies suggest that our proposed method is doubly robust and can achieve the frequentist coverage. We also apply this Bayesian method to a real data example to assess the impact of speed cameras on car collisions in England.

Bayesian extreme value models: Asymptotic behavior, hierarchical convergence, and predictive robustness

Article

Apr 2024

Omid M. Ardakani

A Study on Extension of Double Acceptance Sampling Plans Based on Truncated Life Tests on The Inverse Rayleigh Distribution

Article

Full-text available

Dec 2023

Information criteria for model selection

Article

Feb 2023

The rapid development of modeling techniques has brought many opportunities for data‐driven discovery and prediction. However, this also leads to the challenge of selecting the most appropriate model for any particular data task. Information criteria, such as the Akaike information criterion (AIC) and Bayesian information criterion (BIC), have been developed as a general class of model selection methods with profound connections with foundational thoughts in statistics and information theory. Many perspectives and theoretical justifications have been developed to understand when and how to use information criteria, which often depend on particular data circumstances. This review article will revisit information criteria by summarizing their key concepts, evaluation metrics, fundamental properties, interconnections, recent advancements, and common misconceptions to enrich the understanding of model selection in general. This article is categorized under: Data: Types and Structure > Traditional Statistical Data Statistical Learning and Exploratory Methods of the Data Sciences > Modeling Methods Statistical and Graphical Methods of Data Analysis > Information Theoretic Methods Statistical Models > Model Selection Model selection for many applications, such as selecting important variables, identifying time lags for forecasting, and ranking competing models.

Inference on probabilistic surveys in macroeconomics with an application to the evolution of uncertainty in the survey of professional forecasters during the COVID pandemic

Chapter

Jan 2023

Exchangeability, Predictive Distributions and Parametric Models

Article

Full-text available

Jan 2000

In the general setting of predictive inference, when observations are exchangeable and take values in a Polish space, conditions are stated in order that parametric models turn out to be limiting forms of predictive distributions and parameters are limiting forms of suitable predictive sufficient statistics. The treatment is completed by a necessary and sufficient condition in order that a sequence of predictive distributions may be consistent with an exchangeable distribution. Moreover, main properties of predictive sufficiency are revisited in the general setting described above.

Normal approximation to the posterior distribution for generalized linear models with many covariates

Article

Jan 1997

Subhashis Ghosal

We consider generalized linear models and study the asymptotic properties of the posterior distribution where the dimension of the parameter is allowed to grow to infinity with the sample size. Under certain growth restrictions on the dimension, we show that the posterior distribution is consistent and admits a normal approximation. This result can be used to construct procedures with asymptotic Bayesian validity.

Bayesian modeling of correlated binary responses via scale mixture of multivariate normal link functions

Article

Jan 1998

We consider using scale mixtures of multivariate normal links (SMMVN) to model binary responses when binary observations are taken from the same individuals or are taken over time in a longitudinal fashion. SMMVN-links are quite rich, which include multivariate probit, Student’s t links, logit, symmetric stable link, and exponential power link. Fully parametric classical approaches to these are intractable and thus Bayesian methods are pursued using a Markov chain Monte Carlo (MCMC) sampling based approach. Necessary theory involved in Bayesian modeling and computation is provided. In particular, we produce a new look at the multivariate logit model, the most popular model in this context. Further, we develop various efficient computational algorithms for this complex simulation problem. Finally, a real data example from the Indonesian Children’s Health Study is used to illustrate the proposed methodology.

Higher Order Asymptotics

Article

Jan 1994

Jayanta K. Ghosh

On the Asymptotic Behaviour of Posterior Distributions

Article

Jan 1969

A. M. Walker

Let a random sample of size n be taken from a distribution having a density depending on a real parameter θ, and let θ have an absolutely continuous prior distribution with density π(θ). We give a rigorous proof that, under suitable regularity conditions, the posterior distribution of θ will, when n tends to infinity, be asymptotically normal with mean equal to the maximum‐likelihood estimator and variance equal to the reciprocal of the second derivative of the logarithm of the likelihood function evaluated at the maximum‐likelihood estimator, independently of the form of π(θ).

The Empirical Distribution Function with Arbitrarily Grouped, Censored and Truncated Data

Article

Jul 1976

Bruce W. Turnbull

S ummary This paper is concerned with the non‐parametric estimation of a distribution function F , when the data are incomplete due to grouping, censoring and/or truncation. Using the idea of self‐consistency, a simple algorithm is constructed and shown to converge monotonically to yield a maximum likelihood estimate of F. An application to hypothesis testing is indicated.

Non-Informative Priors Via Sieves and Packing Numbers

Conference Paper

Sep 1997

In this paper, we propose methods for the construction of a non-informative prior through the uniform distributions on approximating sieves. In parametric families satisfying regularity conditions, it is shown that Jeffreys’ prior is obtained. The case with nuisance parameters is also considered. In the infinite dimensional situation, we show that such a prior leads to consistent posterior.

Approximating posterior distribution by mixtures

Article

Jan 1993

M West

Kernel density estimation techniques are used to smooth simulated samples from importance sampling function approximations to posterior distributions, resulting in revised approximations that are mixtures of standard parametric forms, usually multivariate normal or T-distributions. Adaptive refinement of such mixture approximations involves repeating this process to home-in successively on the posterior. In fairly low dimensional problems, this provides a general and automatic method of approximating posteriors by mixtures, so that marginal densities and other summaries may be easily computed. This is discussed and illustrated, with comment on variations and extensions suited to sequential Bayesian updating of Monte Carlo approximations, an area in which existing and alternative numerical methods are difficult to apply.

Nonparametric Bayesian Estimation of Survival Curves from Incomplete Observations

Article

Dec 1976

This article presents a nonparametric Bayesian estimator of a survival curve based on incomplete or arbitrarily right-censored data. This estimator, a Bayes estimator under a squared-error loss function assuming a Dirichlet process prior, is shown to be a Bayesian extension of the usual product limit (Kaplan-Meier) nonparametric estimator.

Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution

Article

Mar 1969

Concepts of independence for nonnegative continuous random variables, X1, …, Xk, subject to the constraint ΣXi = 1 are developed. These concepts provide a means of modeling random vectors of proportions which is useful in analyzing certain kinds of data; and which may be of interest in quantifying prior opinions about multinomial parameters. A generalization of the Dirichlet distribution is given, and its relation to the Dirichlet is simply indicated by means of the concepts.The concepts are used to obtain conclusions of biological interest for data on bone composition in rats and scute growth in turtles.

Bayesian Nonparametrics

Abstract

Recommended publications

Most honourable remembrance: the life and work of thomas bayes

estimation

Posterior Consistency in Bayesian Nonparametrics

Introduction: Why Bayesian Nonparametrics—An Overview and Summary

Consistency issues in Bayesian Nonparametrics