ArticlePDF Available

Bayesian Nonparametrics

Authors:

Abstract

Bibliography: p. [285]-299 Includes index
Springer Series in Statistics
Advisors:
P. Bickel, P. Diggle, S. Fienberg, K. Krickeberg,
I. Olkin, N. Wermuth, S. Zeger
J.K. Ghosh R.V. Ramamoorthi
Bayesian Nonparametrics
With 49 Illustrations
J.K. Ghosh R.V. Ramamoorthi
Statistics-Mathematics Division Statistics and Probability
Indian Statistical Institute Michigan State University
203 Barrackpore Trunk Road A431 Wells Hall
Kolkata 70035 East Lansing, MI 48824
India USA
Library of Congress Cataloging-in-Publication Data
Ghosh, J.K.
Bayesian nonparametrics / J.K. Ghosh, R.V. Ramamoorthi.
p. cm. — (Springer series in statistics)
Includes bibliographical references and index.
ISBN 0-387-95537-2 (alk. paper)
1. Bayesian statistical decision theory. 2. Nonparametric statistics. I. Ramamoorthi, R.V.
II. Title. III. Series.
QA279.5 .G48 2002
519.542—dc21 2002026665
ISBN 0-387-95537-2 Printed on acid-free paper.
©2003 Springer-Verlag New York, Inc.
All rights reserved. This work may not be translated or copied in whole or in part without the
written permission of the publisher (Springer-Verlag New York, Inc., 175 Fifth Avenue, New York,
NY 10010, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use
in connection with any form of information storage and retrieval, electronic adaptation, computer
software, or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if
they are not identified as such, is not to be taken as an expression of opinion as to whether or not
they are subject to proprietary rights.
Printed in the United States of America.
987654321 SPIN 10884896
Typesetting: Pages created by the authors using a Springer TEX macro package.
www.springer-ny.com
Springer-Verlag New York Berlin Heidelberg
A member of BertelsmannSpringer Science+Business Media GmbH
to our wives
Ira and Deepa
Preface
This book has grown out of several courses that we have given over the years at
Purdue University, Michigan State University and the Indian Statistical Institute on
Bayesian nonparametrics and Bayesian asymptotics. These topics seemed sufficiently
rich and useful that a book length treatment seemed desirable.
Through the writing of this book we have received support from many people
and we would like to gratefully acknowledge these. Our early interest in the topic
came from discussions with Jim Berger, Persi Diaconis and Larry Wasserman. We
have received encouragement in our effort from Mike Lavine, Steve McEachern, Susie
Bayarri, Mary Ellen Bock, J. Sethuraman and Shanti Gupta, who alas is no longer
with us.
We have enjoyed many years of collaboration with Subashis Goshal and much of
our joint work finds a place in this book. Besides, he looked over an earlier version
of the manuscript and gave very useful comments. The book also includes joint work
with Jyotirmoy Dey, Roy Erickson, Liliana Dragichi, Charles Messan, Tapas Samanta
and K.R.Srikanth. They have helped us with the proof, as have others. In particular,
Tapas Samanta played an invaluable role in helping us communicate electronically
and Charles Messan with computations.
Brendan Murphy, then a graduate student at Yale, gave us very useful feed back
on an earlier version of Chapter 1. We also benefited from many suggestions and
criticisms from Jim Hannan on the same chapter. We like to thank Nils Hjort both
for his interest in the book and comments.
Dipak Dey made Sethuraman’s unpublished notes available to us and these notes
helped us considerably with Chapter 3.
When we first thought of writing a book, it seemed that we would be able to cover
most, if not all, of what was known in Bayesian nonparametrics. However the last few
years have seen an explosion of new work and our goals have turned more modest.
We view this book as an introduction to the theoretical aspects of the topic at the
graduate level. There is no coverage of the important aspect of computations but
given the interest in this area we expect that a book on computations will emerge
before long.
Our appreciation to Vince Melfi for his advice in matters related to Latex. Despite
it, our limitations with Latex and typing skills would be apparent and we seek the
readers’ indulgence.
Contents
Introduction: Why Bayesian Nonparametrics—An Overview and Sum-
mary 1
1 Preliminaries and the Finite Dimensional Case 9
1.1 Introduction................................ 9
1.2 MetricSpaces............................... 10
1.2.1 preliminaries ........................... 10
1.2.2 Weak Convergence ........................ 12
1.3 Posterior Distribution and Consistency ................. 15
1.3.1 Preliminaries ........................... 15
1.3.2 PosteriorConsistencyandPosteriorRobustness ........ 18
1.3.3 DoobsTheorem ......................... 22
1.3.4 Wald-Type Conditions ...................... 24
1.4 Asymptotic Normality of MLE and
Bernstein–von Mises Theorem ...................... 33
1.5 Ibragimov and Hasminski˘ı Conditions ................. 41
1.6 Nonsubjective Priors ........................... 46
1.6.1 Fully Specified .......................... 46
1.6.2 Discussion ............................ 52
1.7 Conjugate and Hierarchical Priors .................... 52
xContents
1.8 Exchangeability, De Finetti’s Theorem,
Exponential Families ........................... 54
2M(X)and Priors on M(X)57
2.1 Introduction................................ 57
2.2 The Space M(X) ............................. 58
2.3 (Prior) Probability Measures on M(X) ................. 62
2.3.1 XFinite .............................. 62
2.3.2 X=R............................... 64
2.3.3 TailFreePriors.......................... 70
2.4 TailFreePriorsand0-1Laws ...................... 75
2.5 Space of Probability Measures on M(R) ................ 78
2.6 De Finetti’s Theorem .......................... 83
3 Dirichlet and Polya tree process 87
3.1 Dirichlet and Polya tree process ..................... 87
3.1.1 Finite Dimensional Dirichlet Distribution ........... 87
3.1.2 Dirichlet Distribution via Polya Urn Scheme .......... 94
3.2 Dirichlet Process on M(R) ....................... 96
3.2.1 ConstructionandProperties................... 96
3.2.2 TheSethuramanConstruction.................. 103
3.2.3 Support of Dα........................... 104
3.2.4 Convergence Properties of Dα.................. 105
3.2.5 Elicitation and Some Applications ................ 107
3.2.6 Mutual Singularity of Dirichlet Priors .............. 110
3.2.7 Mixtures of Dirichlet Process .................. 113
3.3 PolyaTreeProcess ............................ 114
3.3.1 The Finite Case .......................... 114
3.3.2 X=R............................... 116
4 Consistency Theorems 121
4.1 Introduction................................ 121
4.2 Preliminaries ............................... 122
4.3 Finite and Tail free case ......................... 124
4.4 PosteriorConsistencyonDensities ................... 126
4.4.1 Schwartz Theorem ........................ 126
4.4.2 L1-Consistency .......................... 132
Contents xi
4.5 ConsistencyviaLeCamsinequality................... 137
5 Density Estimation 141
5.1 Introduction................................ 141
5.2 PolyaTreePriors............................. 142
5.3 Mixtures of Kernels ............................ 143
5.4 Hierarchical Mixtures ........................... 147
5.5 RandomHistograms ........................... 148
5.5.1 WeakConsistency......................... 150
5.5.2 L1-Consistency .......................... 156
5.6 MixturesofNormalKernel........................ 161
5.6.1 DirichletMixtures:WeakConsistency.............. 161
5.6.2 Dirichlet Mixtures: L1-Consistency ............... 169
5.6.3 Extensions............................. 172
5.7 GaussianProcessPriors ......................... 174
6 Inference for Location Parameter 181
6.1 Introduction................................ 181
6.2 The Diaconis-Freedman Example .................... 182
6.3 ConsistencyofthePosterior ....................... 185
6.4 PolyaTreePriors............................. 189
7 Regression Problems 197
7.1 Introduction................................ 197
7.2 SchwartzTheorem ............................ 198
7.3 ExponentiallyConsistentTests ..................... 201
7.4 Prior Positivity of Neighborhoods .................... 206
7.5 PolyaTreePriors............................. 208
7.6 Dirichlet Mixture of Normals ....................... 209
7.7 BinaryResponseRegressionwithUnknownLink............ 212
7.8 StochasticRegressor ........................... 215
7.9 Simulations ................................ 215
8 Uniform Distribution on Infinite-Dimensional Spaces 221
8.1 Introduction................................ 221
8.2 Towards a Uniform Distribution ..................... 222
8.2.1 TheJereysPrior......................... 222
8.2.2 Uniform Distribution via Sieves and Packing Numbers .... 223
xii Contents
8.3 Technical Preliminaries .......................... 224
8.4 TheJereysPriorRevisited ....................... 225
8.5 Posterior Consistency for Noninformative Priors for
Infinite-Dimensional Problems ...................... 229
8.6 Convergence of Posterior at Optimal Rate ............... 231
9 Survival Analysis—Dirichlet Priors 237
9.1 Introduction................................ 237
9.2 Dirichlet Prior ............................... 238
9.3 Cumulative Hazard Function, Identifiability .............. 242
9.4 Priors via Distributions of (Z, δ)..................... 247
9.5 IntervalCensoredData.......................... 249
10 Neutral to the Right Priors 253
10.1 Introduction ................................ 253
10.2NeutraltotheRightPriors ....................... 254
10.3 Independent Increment Processes .................... 258
10.4BasicProperties.............................. 262
10.5BetaProcesses .............................. 265
10.5.1 Definition and Construction ................... 265
10.5.2 Properties............................. 268
10.6PosteriorConsistency........................... 271
11 Exercises 281
References 285
Index 300
Introduction: Why Bayesian Nonparametrics—An
Overview and Summary
Bayesians believe that all inference and more is Bayesian territory. So it is natural that
a Bayesian should explore nonparametrics and other infinite-dimensional problems.
However, putting a prior, which is always a delicate and difficult exercise in Bayesian
analysis, poses special conceptual, mathematical, and practical difficulties in infinite-
dimensional problems. Can one really have a subjective prior based on knowledge and
belief, in an infinite-dimensional space? Even if one settles for a largely non-subjective
prior, it is mathematically difficult to construct prior distributions on such sets as the
space of all distribution functions or the space of all probability density functions
and ensure that they have large support, which is a minimum requirement because
a largely nonsubjective prior should not put too much mass on a small set. Finally,
there are formidable practical difficulties in the calculation of the posterior, which is
the single most important object in the output of any Bayesian analysis.
Nonetheless, a major breakthrough came with Ferguson’s [61] paper on Dirichlet
process priors. The hyperparameters α(R)andα(·) of these priors are easy to elicit, it
is easy to ensure a large support, and the posterior is analytically tractable. More flex-
ibility was added by forming mixtures of Dirichlet processes, introduced by Antoniak
[4].
Mixtures of Dirichlet have been very popular in Bayesian nonparametrics, espe-
cially in analyzing right censored survival data. In these problems one can combine
analytical work with Markov Chain Monte Carlo (MCMC) to calculate and display
2WHY BAYESIAN NONPARAMETRICS
various posterior quantities in real time. By choosing α(·) equal to the exponential
distribution and by tuning the parameter α(R), one can make the analysis close to
classical analysis based on a parametric exponential or close to classical nonparamet-
rics. However, the whole range of α(R) offers a whole continuum of options that are
not available in classical statistics, where typically one either does a model based
parametric analysis or use, fully nonparametric methods. An interesting example in
survival analysis is presented by Doss [53, 54]. Huber’s pioneering work in classical
statistics on a robust via media between these two extremes has been too technically
demanding to yield a flexible set of methods that pass continuously from one extreme
to the other. These ideas are discussed further in Chapter 3 on Dirichlet priors.
Similarly one can analyze generalized linear models with a nonparametric Bayesian
choice of link functions. Bayesian nonparametrics is known to be a powerful, robust
alternative to regression analysis based on probit or logit models. References are
available in Chapter 7. There is some evidence of gaining an advantage in using
Bayesian nonparametrics to model random effects in linear models for longitudinal
data.
Sometimes things can go wrong if one uses a Dirichlet process prior inappropriately.
Such a prior cannot be used for density estimation without some smoothing, but
smoothing leads to formidable difficulties in calculating the posterior or the Bayes
estimate of the density function. Solution of this computational problem by MCMC
is fairly recent; see Chapter 5 for references and discussion. A major advantage of the
Bayesian method is that choice of the smoothing parameter h, which is still a hard
problem in classical density estimation, is relatively automatic. The Bayesian version
of varying the smoothing parameter over different parts of the data is also relatively
easy to implement. These are some of the major advances in Bayesian nonparametrics
in recent years.
A major theoretical advance has occurred recently in Bayesian semiparametrics.
One has the same advantages of flexibility here as discussed earlier, but unfortu-
nately this is also an area where the Dirichlet process is inappropriate without some
smoothing. Instead one can use Polya tree priors that sit on densities and satisfy some
extra conditions. For details and references see Chapter 6.
A difficulty in Bayesian nonparametrics is that not much was known until recently
about the asymptotic behavior of the posterior and various forms of frequentist vali-
dation. One method of frequentist validation of Bayesian analysis is to see if one can
learn about the unknown true P0with vanishingly small error by examining where the
posterior puts most of its mass. This idea and the first result of this sort are due to
Laplace. A precise statement of this property leads to the notion of consistency of the
AN OVERVIEW AND SUMMARY 3
posterior at P0, due to Freedman [69]. In the case of finite-dimensional parameters,
the posterior is usually consistent, and the data wash away the prior. For an infinite-
dimensional parameter, this is an exception rather than the rule; see, for instance,
examples of Freedman [69] and his theorem: For a multinomial with infinitely many
classes, the set of (P0,Π) for which posterior for the prior Π is consistent at P0,is
topologically small, i.e., of the first category. Freedman had also introduced the notion
of tail free priors for which there is posterior consistency at P0. A striking example of
inconsistency was shown by Diaconis and Freedman [46] when a Dirichlet process is
used for estimating a location parameter. In his discussion of [46], Barron points out
that the use of a Dirichlet process prior in a location problem leads to a pathological
behavior of the posterior for the location parameter. It is clear that inconsistency is a
consequence of this pathology. Diaconis and Freedman [46] also suggested that such
examples would occur even if one uses a prior on densities, e.g., a Polya tree prior
sitting on densities.
Chapter 4 is devoted to general questions of consistency of the posterior and positive
results. Applications appear in many other chapters and in fact run through the whole
book. These results, as well as somewhat stronger results, like rates of convergence,
are fairly recent and due to many authors, including ourselves.
To sum up, Bayesian nonparametrics is sufficiently well developed to take care
of many problems. Computation of the posterior is numerically feasible for several
classes of priors. We now know a fair amount of asymptotic behavior of posteriors
for different priors to ensure consistency at plausible P0s. Most important, Bayesian
nonparametrics provides more flexibility than classical nonparametrics and a more
robust analysis than both classical and Bayesian parametric inference. It deserves to
be an important part of the Bayesian paradigm.
This monograph provides a systematic, theoretical development of the subject. A
chapterwise summary follows:
1. After introducing some preliminaries, Chapter 1 discusses some fundamental
aspects of Bayesian analysis in the relatively simple context of finite dimensional
parameter space with dimension fixed for all sample sizes. Because this subject is
treated well in many standard textbooks, the focus is on aspects such as nonsubjective
priors, also called objective priors, posterior consistency and exchangeability. These
are topics that usually do not receive much coverage in textbooks but are important
for our monograph,
Because elicitation of subjective priors or quantification of expert knowledge is still
not easy, most priors used in practice, especially in nonparametrics, are nonsubjective.
We discuss the standard ways of generating such priors and how to modify them
4WHY BAYESIAN NONPARAMETRICS
when some subjective or expert judgment is available (Section 1.61.7). We also briefly
discuss common criticisms of nonsubjective Bayesian analysis and answers 1.6.2
Posterior consistency is introduced, and the classical theorem of Doob is proved
with all details. Then, in the spirit of classical maximum likelihood theory, posterior
consistency is established under regularity conditions using the uniform strong law
of large numbers. Posterior consistency provides a frequentist validation that is es-
pecially important for inference on infinite-or high dimensional parameters because
even with a massive amount of data, any inadequacy in the prior can still influence
the posterior a lot. Posterior normality (Section 1.4) is a sharpening of posterior
consistency that is related to Laplace approximation and plays an important role in
the construction of reference and probability matching priors. Convergence of poste-
rior distributions is usually studied under regularity conditions. A general approach
that also works for nonregular problems is presented in Section 1.5. Exchangeability
appears in the last sections Chapter 1.
In Chapter 2 we examine basic measure-theoretic questions that arise when we try
to check measurability of a set or function or put a prior on such a large space as
the set P of all probability measures on R. The Kolmogorov construction based on
consistent finite-dimensional distributions does not meet this requirement because the
Kolmogorov sigma-field is too small to ensure measurability of important subsets like
the set of all discrete distributions on Ror the set of all P with a density with respect
to the Lebesgue measure. Questions of measurability and convergence are discussed
in Section 2.2.
An interesting fact is a proof that the set of discrete measures and the set of ab-
solutely continuous probability measures are measurable. The main results in the
chapter are the basic construction theorems 2.3.2 through 2.3.4. Tail free priors, in-
cluding the Dirichlet process prior, may be constructed this way. The most important
type of convergence, namely, weak convergence is discussed is detail in Section 2.5.
The main result is a characterization of tightness in the spirit of Sethuraman and
Tiwari (1982). Section 2.4 contains 0-1 laws for tail free priors as well as a theorem
due to Kraft that can be used to construct a tail free prior for densities.
De Finetti’s theorem appears in the last section.
The reader not interested in measure-theoretic issues may read this chapter quickly
to understand the main results and get a flavor of some of the proofs. A reader with
more measure-theoretic interest will gain a solid theoretical framework for handling
priors for nonparametric problems and will also be rewarded with several measure-
theoretic subtleties that are interesting.
AN OVERVIEW AND SUMMARY 5
The most important prior in Bayesian nonparametrics is the Dirichlet process prior,
which plays a central role here as the normal in finite-dimensional problems. Most of
Chapter 3 is devoted to this prior. The last section is on Polya tree priors.
We introduce a Dirichlet prior (3.1) first in the case of a finite sample space Xand
then for X=Rto help develop intuition for the main results regarding the latter. The
Dirichlet prior Dfor X=Ris usually called the Dirichlet process prior. Section 3.2
contains calculation and justification of a formula for posterior and special properties.
It also contains Sethuraman’s clever and elegant construction, which applies to all X
and suggests how one can simulate from this prior. Other results of interest include a
characterization of support and convergence properties (Section 3.2) and the question
of singularity of two Dirichlet process priors with respect to each other. Part of the
reason why Dirichlet process priors have been so popular is the multitude of interesting
properties mentioned earlier, of which the most important are the ease in calculation
of posterior and the fact that the support is as rich as it should be for a prior for
nonparametric problems.
A second and equally important reason for popularity is the flexibility, at least for
mixtures of Dirichlet, and the relative case with which one can elicit the hyperparam-
eters. These issues are discussed in 3.2.7
The last section extends most of this discussion to Polya tree priors which form
a much richer class. Though not as mathematically tractable as D, they are still
relatively easy to handle and one can use convenient, partly elicited hyperparameters.
As we have argued before, posterior consistency is a useful validation for a par-
ticular prior, especially in nonparametric problems. Chapter 4 deals with essentially
three approaches to posterior consistency for three kinds of problems, namely, purely
nonparametric problems of estimating a distribution function or its weakly contin-
uous functionals, semiparametrics, and density estimation. The Dirichlet and, more
generally, tail free priors have good consistency properties for the first class of prob-
lems. Posterior consistency for tail free priors is discussed in the first few pages of the
chapter.
In Bayesian semiparametrics, for example estimation of a location parameter (Chap-
ter 6) or the regression coefficient (Chapter 7), addition of Euclidean parameters de-
stroys the tail free property of common priors like Dirichlet process and Polya tree.
Indeed, the use of Dirichlet leads to a pathological posterior. Posterior consistency in
this case is based on a theorem of Schwartz for a prior on densities. The two crucial
conditions are that the true probability measure lie in the Kullback-Leibler support of
the prior and there has to be uniformly exponentially consistent tests for H0:f=f0
6WHY BAYESIAN NONPARAMETRICS
VS H1:fVc,whereVis a neighborhood whose posterior probability is being
claimed to converge to one. This is presented in Section 4.2.
The Schwartz theorem is well suited for semiparametrics but not for density es-
timation because the second condition in the theorem does not hold for a Vequal
to an L1-neighborhood of f0. Barron (unpublished) has suggested a weakening of
one of these conditions, suitably compensated by a condition on the prior. His con-
ditions are necessary and sufficient for a certain form of exponential convergence of
the posterior probability of Vto one. Ghosal, Ghosh and Ramamoorthi (1999) make
use of this theorem and some ideas of Barron, Schervish and Wasserman (1999) to
modify Schwartz’s result to make it suitable for showing posterior consistency with
L1-neighborhoods for a prior sitting on densities. All these results appear in Section
4.2.
Finally, Section 4.3 is devoted to another approach based on an inequality of
LeCam, which bypasses the verification of the first condition of Schwartz.
Applications of these results are made in Chapters 5 through 8. Somewhat different
but direct calculations leading to posterior consistency appear in Chapters 9 and 10.
Chapter 5 focuses on three kinds of priors for density estimation: Dirichlet mix-
tures of uniform, Dirichlet mixtures of normal, and Gaussian process priors. Dirichlet
mixtures of normal are the most popular and the most studied. The Gaussian pro-
cess priors seem very promising but have not been studied well. Dirichlet mixtures of
uniform are essentially Bayesian histograms and have a relatively simple theory.
The chapter begins with fairly general construction of priors on densities in sections
5.2 and 5.3 and then specializes to Bayesian histograms and their consistency in
Sections 5.40, 5.4.1, and 5.4.2. Dirichlet mixtures of normals are studied in Sections
5.6 and 5.7. The L1-consistency of the posterior applies to the prior of Escobar and
West in [168]. The final section contains an introduction to what is known about
Gaussian process priors.
Interesting issues that emerge from this rather technical chapter is that checking the
Kullback-Leibler support condition is especially hard for densities with Ras support,
whereas densities with bounded support are much easier to handle. A second source of
technical difficulty is the need for efficient calculation of packing or covering numbers,
also called Kolmogorov’s metric entropy. These numbers play a basic role in Chapters
4,5 and 8.
Chapter 6 begins with the famous Diaconis-Freedman (1986) example where a
Dirichlet process prior and a euclidean location parameter lead to posterior inconsis-
tency. Barron (1986) has pointed out that there is a pathology in this case which is
even worse than inconsistency. We argue, as suggested in Chapter 4, that the main
AN OVERVIEW AND SUMMARY 7
problem leading to posterior inconsistency is that the tail free property does not hold.
It is good to have a density but that does not seem to be enough. However, no counter
example is produced.
The main contribution of the chapter is to suggest in Section 6.3 a strategy for
proving posterior consistency for the location parameter in semiparametric setting
and to provide in Section 6.4 a class of Polya tree priors which satisfy the conditions
of Section 6.3 for a rich class of true densities. A major assumption needed in Section
6.3 holds only for densities with Ras support. Later in the section we show how to
extend these results to densities with bounded support. Whereas in density estimation
bounded support helps, the converse seems to be true when one has to estimate a
location parameter.
The discussion of Bayesian semiparametrics is continued in Chapter 7 . We assume
a standard regression model
Y=α+βx +
with the error having a nonparametric density f. The main object is to estimate
the regression coefficient but one may also wish to estimate the intercept αas well as
the true density of . The classical counterpart of this is Bickel [19].
Because Y’s are no longer i.i.d, the Schwartz theorem of Chapter 6 does not apply.
In Section 7.2 - we prove a generalization that is valid for n independent but not
necessarily identically distributed random variables.
The theorem needs two conditions which are exact analogues of the two conditions
in Schwartz’s theorem and one additional condition on the second moment of a log
likelihood ratio. Verification of these conditions is discussed in Section 7.4.
In Section 7.3 we discuss sufficient conditions for the existence of uniformly consis-
tent tests for βalone or (α, β)or(α, β, f ).
Finally in sections 7.6 we verify the remaining two conditions for Polya tree priors
and Dirichlet mixtures of normals. Verification of conditions require methods that are
substantially different from those in Chapter 5.
Chapter 8 deals with three different but related topics, namely, three methods of
construction of nonsubjective priors in infinite dimensional problems involving densi-
ties, consistency proof for such priors using LeCam’s inequality and rates convergence
for such and other priors. They are discussed in sections 8.2,8.5 and 8.6 respectively.
In several examples it is shown that the rates of convergence are the best possible.
However, for most commonly used priors getting rates of convergence is still a very
hard open problem.
8WHY BAYESIAN NONPARAMETRICS
Chapters 9 and 10 deal with right censored data. Here, the object of interest is the
distribution of a positive random variable X, viewed as survival time. What we have
areobservationsofisZ=XY,∆=I(XY), where Yis a censoring random
variable, independent of X.
Chapter 9 begins with a model studied by Susarla and Van Ryzin [155] where the
distribution of Xis given a Dirichlet process prior. We give a representation of the
posterior and establish its consistency. Section 2 is a quick review of the notion of
cumulative hazard function and identifiability of the distribution of Xfrom that of
(Z, ∆). This is then used in the next section where we start with a Dirichlet prior
for the distribution of (Z, ∆) and use the identifiability result to transfer it to a prior
for the distribution of X. We expect that this method will be useful in constructing
priors for other kind of censored data. Section 9.4 is a preliminary study of Dirichlet
priors for interval censored data. We show that, unlike the right censored case, letting
α(R)0 does not give the nonparametric maximum likelihood estimate.
Chapter 10 deals with neutral to right priors. These priors were introduced by
Doksum in 1974 [48] and after some initial work by Ferguson and Phadia [64] remained
dormant. There has been renewed interest in these priors since the introduction of
Beta processes by Hjort [100]. Neutral to right priors, via the cumulative hazard
function, gives rise to independent increment processes which in turn are described
by their L´evy representations. In Section 10.1 after giving the definition and basic
properties of neutral to right priors we move onto Section 10.2 where we briefly review
the connection to independent increment processes and L´evy representations. Section
10.3 describes some properties of the prior in terms of the L´evy measure and Section
10.4 is devoted to Beta processes. The remaining parts of the chapter is devoted to
posterior consistency and is partly driven by a surprising example of inconsistency
due to Kim and Lee [114].
Chapter 11 contains some exercises. These were not systematically developed. How-
ever we have included in the hope that going through them will give the reader some
additional insight into the material.
Most work on Bayesian nonparametrics concentrates on estimation. This mono-
graph is no exception. However there is interesting new work on Bayes Factors and
their consistency [13], [37]. Even in the context of estimation, in the context of cen-
sored data, not much has been done beyond the independent right censored model.
There certainly is lot more to be done.
1
Preliminaries and the Finite Dimensional Case
1.1 Introduction
The basic Bayesian model consists of a parameter θand a prior distribution Π for θ
that reflects the investigator’s belief regarding θ. This prior is updated by observing
X1,X
2,...,X
n, which are modeled as i.i.d. Pθgiven θ. The updating mechanism is
Bayes theorem, which results in changing Π to the posterior Π(·|X1,X
2,...,X
n).
The posterior reflects the investigator’s belief as revised in the light of the data
X1,X
2,...,X
n. One may also report the predictive distribution of the future ob-
servations or summary measures like the posterior mean or variance. If there is a
decision problem with a specified loss function, one can choose the decision that min-
imizes the expected loss, with the associated loss calculated under the posterior. This
decision is the Bayes solution, or the Bayes rule. Ideally, a prior should be chosen
subjectively to express personal or expert knowledge and belief. Such evaluations and
quantifications are not easy, especially in high- or infinite-dimensional problems. In
practice, mathematically tractable priors, for example, conjugate priors, are often
used as convenient and partly nonsubjective models of knowledge and belief. Certain
aspects of these priors are chosen subjectively.
Finally, there are completely nonsubjective priors, the choice of which also leads to
useful posteriors. For the finite-dimensional case a brief account appears in Section
1.6. For a moderate amount of data, i.e., for a moderate n, the effect of prior on the
10 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
posterior is often negligible. In such cases the posterior arising from a nonsubjective
prior may be considered a good approximation for the posterior that one would have
gotten from a subjective prior.
The posterior, like the prior, is a probability measure on the parameter space Θ,
except that it depends on X1,X
2,...,X
nand the study of the posterior as n→∞is
naturally connected to the theory of convergence of probability measures. In Section
1.2.1, we present a brief survey of weak convergence of probability measures as well
as relations between various metrics and divergence measures.
A recurring theme throughout this monograph is posterior consistency, which helps
validate Bayesian analysis. Section 1.3 contains a formalization and brief discussion of
posterior consistency for separable metric space Θ. In Sections 1.3 and 1.4 we study
in some detail the case when Θ is finite-dimensional and θ→ Pθis smooth. This is the
framework of conventional parametric theory. Most of the results and asymptotics are
classical, but some are relatively new. While the main emphasis of this monograph is
in the nonparametric, and hence infinite-dimensional situation, we hope that Sections
1.3 and 1.4 will serve to clarify the points of contact and points of difference with the
finite-dimensional case.
1.2 Metric Spaces
1.2.1 preliminaries
Let (S) be a metric space so that ρsatisfies (i) ρ(s1,s
2)=ρ(s2,s
1), (ii) ρ(s1,s
2)
0andρ(s1,s
2)=0 is1=s2and (iii) ρ(s1,s
3)ρ(s1,s
2)+ρ(s2,s
3).
Some basic properties of metric spaces are summarized here.
A sequence snin Sconverges to siff ρ(sn,s)0. The bal l with center s0and
radius δis the set B(s0)={s:ρ(s0,s)}. A set Uis open if every sin Uhas a
ball B(s, δ) contained in U.AsetVis closed if its complement Vcis open. A useful
characterization of a closed set is: Vis closed iff snVand snsimplies sV.
The intersection of closed sets is a closed set. For any set AS, the smallest closed
set containing A, which is the intersection of all closed sets containing A, is called
the closure of Aand will be denoted by ¯
A. Similarly Ao, the union of all open sets
contained in Ais called the interior of A. The boundary ∂A of the set Ais defined as
∂A =¯
A(Ac).
A subset Aof Sis compact if every open cover of Ahas a finite subcover, i.e.,
if {Uα:αΛ}areopensetsandA⊂∪
αΛUα, then there exists α1
2,...,α
n
1.2. METRIC SPACES 11
such that A⊂∪
n
1Uαi.AsetAis compact iff every sequence in Ahas a convergent
subsequence with limit in A.
The metric space Sis separable if it has a countable dense subset, i.e., if there is
a countable set S0with ¯
S0=S.Most of the sets that we consider are separable. In
particular, if Sis compact metric it is separable. Let Sbe separable and let S0be
a countable dense set. Consider the countable collection {B(si,1/n):siS0;n=
1,2,...}.IfUis an open set and if sU,thenforsomen>1,thereisaball
B(s, 1/n)U.LetsiS0with ρ(si,s)<1/2n. Then sis in B(si,1/2n)and
B(si,1/2n)B(s, 1/n)U. This shows that in a separable space every open set is
a countable union of balls. This fact fails to hold when Sis not separable.
The Borel σ-algebra on Sis the σ-algebra generated by all open sets and will
be denoted by B(S). The remarks in the last paragraph show that if Sis separable
then B(S) is the same as the σ-algebra generated by open balls. In the absence of
separability these two σ-algebras will be different.
It would sometimes be necessary to check that a given class of sets Cis the Borel
σ-algebra. A useful device to do this is the π-λtheorem given below. See Pollard
[[140], Section 2.10] for a proof and some discussion.
Theorem 1.2.1. [π-λtheorem] A class Dof subsets of Sis a π-system if it is
closed under finite intersection, i.e., if A, B are in Dthen AB∈D. A class Cof
subsets of Sisaλ-system if
(i) Sis in C;
(ii) An∈Cand AnA, then A∈C;
(iii) A, B ∈Cand AB, then BA∈C.
If Cis a λ-system that contains a π-system D, then Ccontains the σ-algebra generated
by D.
Remark 1.2.1.An easy application of the π-λtheorem shows that if two probability
measures on Sagree on all closed sets then they agree on B(S).
Remark 1.2.2.If two probability measures on RKagree on all sets of the form
(a1,b
1]×(a2,b
2],...×(ak,b
k] then they agree on all Borel sets in Rk.
Definition 1.2.1. Let Pbe a probability measure on (S,B(S)).The smallest closed
set of P-measure 1 is called the support, or more precisely the topological support, of
P.
12 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
When Sis separable the support of Palways exists. To see this let U0={U:
Uopen, P(U)=0}, then U0=U∈U0Uis open. Because U0is a countable union of
balls in U0,P(U0) = 0. It follows easily that F=Uc
0is the support of P. The support
can be equivalently defined as a closed set Fwith P(F) = 1 and such that if sF
then P(U)>0 for every neighborhood Uof s.IfSis not separable then the support
of Pmay not exist.
1.2.2 Weak Convergence
We need elements of the theory of weak convergence of probability measures. The
details of the material discussed below can be found, for instance, in Billingsley [[21],
Chapter 1].
Let Sbe a metric space and B(S)betheBorel σ-algebra on S. Denote by C(S)
the set of all bounded continuous functions on S. Note that every function in C(S)is
B(S) measurable.
Definition 1.2.2. A sequence {Pn}of probability measures on Sis said to converge
weakly to a probability measure P, written as {Pn}→Pweakly, if
fdP
nfdP for all fC(S)
The following “Portmanteau” theorem gives most of what we need.
Theorem 1.2.2. The following are equivalent:
1. {Pn}→P weakly;
2. fdP
nfdPfor all fbounded and uniformly continuous;
3. lim sup Pn(F)P(F)for all Fclosed;
4. lim inf Pn(U)P(U)for all Uopen;
5. lim Pn(B)=P(B)for all B∈B(S)with P(∂B)=0.
In applications, Pns are often distributions on Sinduced by random variables Xns
taking values in S.IfSis not separable, then Pnis defined on a σ-algebra much smaller
than B(S). In this case, to avoid measurability problems inner and outer probabilities
have to be used. For a version of Theorem 1.2.2 in this more general setting see van
der Vaart and Wellner [[161], 1.3.4]. The other useful result is Prohorov’s theorem.
1.2. METRIC SPACES 13
Theorem 1.2.3. [Prohorov] If Sis a complete separable metric space, then every
subsequence of Pnhas a weakly convergent subsequence iff Pnis tight, i.e., for every
>0, there exists a compact set Kwith Pn(K)>1for all n.
When Sis a complete separable metric space the space M(S)-the space of probabil-
ity measures on S-is also metrizable, complete, and separable under weak convergence.
In this case if fdP
nfdPfor fin a countable dense set in C(S), then PnP
weakly.WenotethatsetsinM(S)oftheform
Q:fidP fidQ
,i=1,2,...,k;fiC(S)
constitute a base for the neighborhoods at P, i.e., any open set is a union of family
of sets of the form displayed above. The space M(S) and the space of probability
measures on M(S) are of considerable interest to us. We will return to a detailed
analysis of these spaces later; here are a few preliminary facts used later in this
chapter.
The space M(S) has many natural metrics.
Weak convergence. As discussed earlier M(S) is metrizable, i.e., there is a metric ρ
on M(S) such that ρ(Pn,P)0iPnPweakly [see section 6 in Billingsley
[21]]. The exact form of this metric is not of interest to us.
Total variation of L1.The total variation distance between Pand Qis given by
PQ1=2sup
B|P(B)Q(B)|.Ifpand qare densities of Pand Qwith
respect to some measure µ, then PQ1is the L1-distance |pq|between
pand q. Sometimes, when there can be no confusion with other metrics, we will
omit the subscript 1 and denote the L1distance by just PQor in terms of
densities as pq.
Hellinger metric. If pand qare densities of Pand Qwith respect to some σ-finite
measure µ, the Hellinger distance between Pand Qis defined by H(P, Q)=
(pq)21/2. This distance is convenient in the i.i.d. context because
A(Pn,Q
n)=An(P, Q), where A(P, Q)=pqdµ, is called the affinity
between Pand Qand
H2(Pn,Q
n)=2(1(A(P, Q))n)
The Hellinger metric is equivalent to the L1-metric. The next proposition shows
this.
14 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
Proposition 1.2.1.
PQ2
1H2(P, Q)2(1+A(P, Q)) ≤PQ12(1 + A(P, Q))
Proof. Let µdominate Pand Qand let p, q, be densities of Pand Qwith respect to
µ. Then
|pq|2
=|pq||p+q|2
(pq)2(p+q)2
which is the first inequality. Also H2(P, Q)≤PQ1because
(pq)2p+qmin(p, q)=|pq|
As a corollary to the above proposition, we have the following.
Corollary 1.2.1. Replacing A(P, Q)by its upper bound 1 gives
PQ2
14H2(P, Q)4PQ1
Writing H2(P, Q)=2(1A(P, Q)) in the first inequality, a bit of algebra gives
A(P, Q)1PQ2
1
4
Note that none of the three quantities discussed-the L1metric, the Hellinger metric,
or the affinity A(P, Q)-depends on the dominating measure µ. The same holds for the
Kullback- Leibler divergence(K-L divergence) which is considered next.
Kullback-Leibler divergence. The Kullback-Leibler divergence between two prob-
ability measures, though not a metric, has played a central role in the classical
theory of testing and estimation and will play an important role in the later
chapters of this text. Let Pand Qbe two probability measures and let p, q be
their densities with respect to some measure µ. Then
K(P, Q)=plog p
q(1 q
p)dP 0
and K(P, Q)=0iP=Q. Here is a useful refinement due to Hannan [92].
1.3. POSTERIOR DISTRIBUTION AND CONSISTENCY 15
Proposition 1.2.2.
K(P, Q)PQ2
1
4
Proof.
plog(p/q)=2(log q
p)pdµ
2(1(q/p)) pdµ
=2(1A(P, Q)) = H2(P, Q)
The corollary to the previous proposition yields the conclusion.
Kemperman [112] has shown that K(P, Q)≥PQ2
1/2 and that this inequality
is sharp.
Much of our study involves the convergence of sequences of functions of the form
Tn(X1,X
2,...,X
n):Ω→ M(Θ) where Ω = (X,A) with a measure P
0. The
different metrics on M(Θ) provide ways of formalizing the convergence of Tnto T.
Thus
(i) Tn
weakly
Talmost surely P0if
P
0ω:Tn(ω)weakly
T(ω)=1
(ii) Tn
weakly
Tin P0probability if
P
0{ω:ρ(Tn(ω),T(ω)) >}→0
where ρis a metric that generates weak convergence.
Tn
L1
Talmost surely P0or in P0-probability can be defined similarly.
1.3 Posterior Distribution and Consistency
1.3.1 Preliminaries
We begin by formalizing the setup. Let Θ be the parameter space. We assume that
Θ is a complete separable metric space endowed with its Borel σ-algebra B(Θ). For
16 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
each θΘ, Pθis a probability measure on a measurable space (X,A) such that, for
each A∈A→ Pθ(A)isB(Θ) measurable.
X1,X
2,... is a sequence of X-valued random variables that are, for each θΘ,
independent and identically distributed as Pθ. It is convenient to think of X1,X
2,...
as the coordinate random variables defined on Ω = (X,A)andP
θas the i.i.d.
product measure defined on Ω. We will denote by Ωnthe space (Xn,An)andbyPn
θ
the n-fold product of Pθ. When convenient we will also abbreviate X1,X
2,...,X
nby
Xn.
Suppose that Π is a prior, i.e., a probability measure on (Θ,B(Θ)). For each n
and the Pθs together define a joint distribution of θand Xnnamely, the probability
measure λn,Πon Ωnby
λn,Π(B×A)=B
Pn
θ(A)dΠ(θ)
The marginal distribution λnof X1,X
2,...,X
nis
λn(A)=λn,Π×A)
These notions also extend to the infinite sequence X1,X
2,... .WedenotebyλΠ
the joint distribution of θ,X1,X
2,... and by λthe marginal distribution on Ω.
Any version of the conditional distribution of θgiven X1,X
2,...,X
nis called a
posterior distribution given X1,X
2,...,X
n. Formally, a function Π(·|·):B(Θ)×n→
[0,1] is called a posterior given X1,X
2,...,X
nif
(a) for each ωn,Π(·|ω) is a probability measure on B(Θ);
(b) for each B∈B(Θ), Π(B)isAnmeasurable; and
(c) for each B∈B(Θ) and A∈A,
λn,Π(B×A)=A
Π(B|ω)n(ω)
In the case that we consider, namely, when the underlying spaces are complete and
separable, a version of the posterior always exists [Dudley [58], 10.2]. By condition
(b), Π(·|ω) is a function of X1,X
2,...,X
nand hence we will write the postrior
conveniently as Π(·|X1,X
2,...,X
n)ora(·|Xn).
Typically, a candidate for the posterior can be guessed or computed heuristically
from the context. What is then required is to verify that it satisfies the three conditions
1.3. POSTERIOR DISTRIBUTION AND CONSISTENCY 17
listed earlier. When the Pθs are all dominated by a σ- finite measure µ,itiseasyto
see that, if pθ=dPθ/dµ, then
Π(A|Xn)=An
1pθ(Xi)dΠ(θ)
Θn
1pθ(Xi)dΠ(θ)
Thus in the dominated case, n
1pθ(Xi)/n
1pθ(Xi)dΠ(θ) is a version of the den-
sity with respect to Π of Π(·|Xn).
In the last expression the posterior given X1,X
2,...,X
nisthesameasthatgivena
permutation Xπ(1),X
π(2),...,X
π(n). Said differently, the posterior depends only on the
empirical measure (1/n)n
1δXi,whereforanyx,δxdenotes the measure degenerate
at x. This property holds also in the undominated case. A simple sufficiency argument
shows that there is a version of the posterior given X1,X
2,...,X
nthat is a function
of the empirical measure.
Definition 1.3.1. Foreachn,le(·|Xn) be a posterior given X1,X
2,...,X
n.
The sequence {Π(·|Xn)}is said to be consistent at θ0if there is a Ω0Ωwith
P
θ0(Ω0) = 1 such that if ωis in Ω0, then for every neighborhood Uof θ0,
Π(U|Xn(ω)) 1
Remark 1.3.1.When Θ is a metric space {θ:ρ(θ, θ0)<1/n :n1}forms a base
for the neighborhoods of θ0, and hence one can allow the set of measure 1 to depend
on U. In other words, it is enough to show that for each neighborhood Uof θ0,
Π(U|Xn(ω)) 1 a.e. P
θ0
Further, when Θ is a separable metric space it follows from the Portmanteau theo-
rem that consistency of the sequence {Π(·|Xn)}at θ0is equivalent to requiring that
{Π(·|Xn)}weakly
δθ0a.e.Pθ0.
Thus the posterior is consistent at θ0,ifwithPθ0probability 1, as ngets large, the
posterior concentrates around θ0.
Why should one require consistency at a particular θ0? A Bayesian may think of
θ0as a plausible value and question what would happen if θ0were indeed the true
value and the sample size nincreases. Ideally the posterior would learn from the data
and put more and more mass near θ0. The definition of consistency captures this
requirement.
The idea goes back to Laplace, who had shown the following. If X1,X
2,...,X
nare
i.i.d. Bernoulli with Pθ(X=1)=θand π(θ) is a prior density that is continuous and
18 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
positive on (0,1), then the posterior is consistent at all θ0in (0,1). Von Mises [162]
calls this the second fundamental law of large numbers; the first being Bernoulli’s
weak law of large numbers.
An elementary proof of Lapalace’s result for a beta prior may be of some interest.
Let the prior density with respect to Lebesgue measure on (0,1) be
Π(θ)= Γ(α+β)
Γ(α(β)θα1(1 θ)β1
Then the posterior density given X1,X
2,...,X
nis
Γ(α+β+n)
Γ(α+r(β+(nr))θα+r1(1 θ)β+(nr)1
where ris the number of Xis equal to 1. An easy calculation shows that the posterior
mean is
E(θ|X1,X
2,...,X
n)=α+β
α+β+nα
α+β+n
α+β+nr
n
which is a weighted combination of the consistent estimate r/n of the true value θ0
and the prior mean α/(α+β). Because the weight of r/n goes to 1,
E(θ|X1,X
2,...,X
n)θ0a.e. Pθ0
A similar easy calculation shows that the posterior variance
Var(θ|X1,X
2,...,X
n)= (α+r)(β+(nr))
(α+β+n)2(α+β+n+1)
goes to 0 with probability 1 under θ0. An application of Chebyshev’s inequality com-
pletes the proof.
1.3.2 Posterior Consistency and Posterior Robustness
Posterior consistency is also connected with posterior robustness. A simple result is
presented next [84].
Theorem 1.3.1. Assume that the family {Pθ:θΘ}is dominated by a σ-
finite measure µand let pθdenote the density of Pθ.Letθ0be an interior point of Θ
and π1
2be two prior densities with respect to a measure ν, which are positive and
1.3. POSTERIOR DISTRIBUTION AND CONSISTENCY 19
continuous at θ0.Letπi(θ|Xn),i =1,2denote the posterior densities of θgiven Xn.
If πi(·|Xn),i=1,2are both consistent at θ0then
lim
n→∞ |π1(θ|Xn)π2(θ|Xn)|(θ)=0 a.s Pθ0
Proof. We will show that with P
θ0probability 1,
Θ
π2(θ|Xn)
1π1(θ|Xn)
π2(θ|Xn)
(θ)0
Fix δ>0,η >0, and >0 and use the continuity at θ0to obtain a neighborhood
Uof θ0such that for all θU
π1(θ)
π2(θ)π1(θ0)
π2(θ0)
and |πj(θ0)πj(θ)|for j=1,2.
By consistency there exists Ω0,P
θ0(Ω0) = 1, such that for ω0,
Πj(U|Xn(ω)) = Un
1pθ(Xi(ω)) πj(θ)(θ)
Θn
1pθ(Xi(ω)) πj(θ)(θ)1
Fix ω0and choose n0such that, for n>n
0,
Πj(U|Xn(ω)) 1ηfor j=1,2
Note that
π1(θ|Xn)
π2(θ|Xn)=π1(θ)
π2(θ)Θn
1pθ(Xi)π2(θ)(θ)
Θn
1pθ(Xi)π1(θ)(θ)
Hence for n>n
0and θU, after some easy manipulation, we have
π1(θ0)
π2(θ0)δ(1 η)Un
1pθ(Xi(ω)) π2(θ)(θ)
Un
1pθ(Xi(ω)) π1(θ)(θ)
π1(θ|Xn(ω))
π2(θ|Xn(ω))
π1(θ0)
π2(θ0)+δ(1 η)1Un
1pθ(Xi(ω)) π2(θ)(θ)
Un
1pθ(Xi(ω)) π1(θ)(θ)
20 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
and by the choice of U,
(πj(θ0)δ)U
n
1
pθ(Xi(ω)) (θ)U
n
1
pθ(Xi(ω))πj(θ)(θ)
(πj(θ0)+δ)U
n
1
pθ(Xi(ω)) (θ)
(1.1)
Using (1.1) we have, again for θU,
π1(θ0)
π2(θ0)δ(1 η)π2(θ0)δ
π1(θ0)+δπ1(θ|Xn(ω))
π2(θ|Xn(ω))
π1(θ0)
π2(θ0)+δ(1 η)1π2(θ0)+δ
π1(θ0)δ
so that for δ, η small
π1(θ|Xn(ω))
π2(θ|Xn(ω)) 1
<
Hence, for n>n
0,
|π1(θ|Xn(ω)) π2(θ|Xn(ω))|(θ)
U
π2(θ|Xn(ω))
1π1(θ|Xn(ω))
π2(θ|Xn(ω))
(θ)+2η
(1 η)+2η
This completes the proof.
Another notion related to Theorem 1.3.1 is that of merging where, instead of the
posterior, one looks at the predictive distribution of Xn+1,X
n+2,... given X1...,X
n.
Here the attempt is to formalize the idea that two Bayesians starting with different
priors Π1and Π2would eventually agree in their prediction of the distribution of
future observations.
For a prior Π if we define, for any measurable subset Cof Ω
λΠ(C|Xn)=Θ
P
θ(C)Π(|Xn)
1.3. POSTERIOR DISTRIBUTION AND CONSISTENCY 21
then, λΠ(·|Xn) is a version of the predictive distribution of Xn+1,X
n+2,... given
X1,X
2,...,X
n. Note that given Xn, the predictive distribution is a probability mea-
sure on Ω = R.
Let λΠ1(·|Xn)andλΠ2(·|Xn) be two predictive distributions, corresponding to
priors Π1and Π2.
An early result in merging is due to Blackwell and Dubins [24]. They showed that
if Π2is absolutely continuous with respect to Π1,thenforθin a set of Π2probability
1, the total variation distance between λΠ1(·|Xn)andλΠ2(·|Xn)goesto0almost
surely P
θ.
The connection with consistency was observed by Diaconis and Freedman [46].
Towards this, say that the predictive distributions merge weakly with respect to Pθ0if
thereexistsΩ
0ΩwithP
θ(Ω0) = 1, such that for each ω0,
φ(ω)λΠ1(|Xn(ω)) φ(ω)λΠ2(|Xn(ω))0
for all bounded continuous functions φon Ω.
Proposition 1.3.1. Assume that θ→ Pθis 1-1 and continuous with respect to
weak convergence. Also assume that there is a compact set Ksuch that Pθ(K)=1
for all θ.
If Π1and Π2are two priors such that the posteriors Π1(·|Xn)and Π2(·|Xn)are
consistent at θ0, then the predictive distributions λΠ1(·|Xn)and λΠ2(·|Xn), merge
weakly with respect to Pθ0.
Proof. Let Gbe the class of all functions on Ω that are finite linear combinations of
functions of the form
φ(ω)=
k
1
fi(ωi)
where f1,f
2,...,f
kare continuous functions on K.Itiseasytoseethatifφ∈Gthen
θ→ φ(ω)dP
θ(ω) is continuous. Further, by the Stone-Weirstrass theorem Gis
dense in the space of all continuous functions on K.
From the definition of λΠ1(·|Xn)andλΠ2(·|Xn), if Ω0is the set where the posterior
converges to δθ0,thenforω0,forφ∈G,
φ(ω)λΠi(|Xn(ω)) = Θ
φ(ω)dP
θ(ω
i(|(Xn(ω))
22 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
The inside integral gives rise to a bounded continuous function of θ. Hence by weak
consistency at θ0,forbothi=1,2 the right-hand side converges to φ(ω)dP
θ0(ω).
This yields the conclusion.
Further connections between merging and posterior consistency is explored in Di-
aconis and Freedman[46].
Note a few technical remarks: According to the definition, posterior consistency is
a property that is specific to the fixed version Π(·|Xn). Measure theoretically, the
posterior is unique only up to λnnull sets. So the posterior is uniquely defined up to
Pθ0if Pn
θ0is dominated by λn. Without this condition it is easy to construct examples
of two versions {Π1(·|Xn)}and {Π2(·|Xn)}such that one is consistent and the other
is not. It is easy to show that if {PθΘ}are all mutually absolutely continuous and
{Π1(·|Xn)}and {Π2(·|Xn)}are two versions of the posterior, then {Π1(·|Xn)}is
consistent iff {Π2(·|Xn}is.
1.3.3 Doob’s Theorem
An early result on consistency is the following theorem of Doob [49].
Theorem 1.3.2. Suppose that Θand Xare both complete separable metric spaces
endowed with their respective Borel σ-algebras B(Θ) and Aand let θ→ Pθbe 1-1.
Let Πbe a prior and {Π(·|Xn)}be a posterior. Then there exists a Θ0Θ, with
Π(Θ0)=1such that {Π(·|Xn)}n1is consistent at every θΘ0.
Proof. The basic idea of the proof is simple. On the one hand, because for each θ
the empirical distribution converges a.s. P
θto Pθ, given any sequence of xi’s we can
pinpoint the true θ. On the other hand, any version of the posterior distributions
Π(·|Xn), via the martingale convergence theorem, converge a.s. with respect to the
marginal λΠ, to the posterior given the entire sequence. One then equates these two
versions to get the result. A formal proof of these observations needs subtle measure
theory.
As before let, Ω= XN,Bbe the product σ-algebra on Ω, λΠdenote both the joint
distribution of θand X1,X
2,... and the marginal distribution of X1,X
2,... .LetC
be a subset of Θ, then by the martingale convergence theorem, as n→∞,
Π(C|X1,X
2,...,X
n)E(IC|X1,X
2,... ).
=fa.e. λΠ
We point out that the functions considered above are, formally, functions of two
variables (θ, ω). IC, is to be interpreted as IC×and fis to be thought of as f(θ, ω)=
f(ω) and so on.
1.3. POSTERIOR DISTRIBUTION AND CONSISTENCY 23
We shall show that there exists a set Θ0with Π(Θ0) = 1 such that
for θΘ0C, f =1a.e.P
θ(1.2)
This would establish the theorem. To see this, take U={U1,U
2,...,}a base for the
open sets of Θ. Take C=Uiin the above step and obtain the corresponding Θ0iΘ
satisfying (1.2). If we set Θ0=iΘ0ithen (1.2) translates into “ the posterior is
consistent at all θΘ0”.
To establish (1.2), let A0be a countable algebra generating A.Let
E={(θ, ω) : lim
n→∞
1
n
n
1
δXi(ω)(A)=Pθ(A) for all A∈A
0}
The set E, since it arises from the limit of a sequence of measurable functions, is
a measurable set and further by the law of large numbers for each θthe sections Eθ
satisfy
(i) for all θ, P
θ(Eθ)=1
(ii) if θ=θ,E
θEθ=
Define
f(ω)=1if∈∪
θCEθ
0 otherwise.
It is a consequence of a deep result in set theory that θCEθis measurable, from
which it follows that fis measurable.
From its definition, fsatisfies:
1. for all θC,f=1a.e.P
θ
2. for all θnot in C,f=0a.e.P
θ
In other words for all θ,f=IC(θ)fa.e. P
θ
We cla i m that fis a version of E(IC|X1,X
2,... ). For any measurable set B∈B,
IBfΠ=IBIC(θ)fdP
θdΠ(θ)=IC(θ)P
θ(B)dΠ(θ)=λΠ(C×B)
Since fand fare both versions of E(IC|X1,X
2,... ), we have
f=fa.e. λΠ
24 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
By Fubini’s theorem, there exists a set Θ0with Π(Θ0) = 1, such that for θin Θ0
f=fa.e.P
θ
(1.2) follows easily from the properties 1 and 2 of fmentioned earlier.
This completes the proof.
Remark 1.3.2.A well known result in set theory, the Borel Isomorphism theorem,
states that any two uncountable Borel sets of complete separable metric spaces are
isomorphic [[153],Theorem 3.3.13 ]. The result that we used from set theory is a
version of this theorem which states that if Sand Tare Borel subsets of complete
metric spaces and if φis a 1-1 measurable function from Sinto T, then, the range of
φis a measurable set and φ1is also measurable. To get the result that we used, just
set S=E,T=Ωandφ(θ, ω)=ω.
Remark 1.3.3.Another consequence of the Borel Isomorphism theorem is that
Doob’s theorem holds even when Θ and Xare just Borel subsets of a complete
separable metric space.
Many Bayesians are satisfied with Doob’s theorem, which provides a sort of internal
consistency but fails to answer the question of consistency at a specific θ0of interest
to a Bayesian. Moreover in the infinite-dimensional case, the set of θ0values where
consistency holds may be a very small set topologically [70] and may exclude infinitely
many θ0s of interest. Disturbing examples and general results of this kind appear in
Freedman [69] in the context of an infinite-cell multinomial.
If θ0is not in the support of the prior Π then there exists an open set Usuch
that Π(U) = 0. This implies that Π(U|Xn)=0a.sλn. Hence,it is not reasonable to
expect consistency outside the support of Π. Ideally, one might hope for consistency
at all θ0in the support of Π. This is often true for a finite-dimensional Θ. However,
for an infinite-dimensional Θ this turns out to be too strong a requirement. We will
often prove consistency for a large set of θ0s . A Bayesian can then decide whether it
includes all or most of the θ0sofinterest.
1.3.4 Wald-Type Conditions
We begin with a uniform strong law.
Theorem 1.3.3. Suppose that Kis a compact subset of a separable metric space.
Let T(·,·)be a real-valued function on θ×Rsuch that
(i) for each x, T (·,x)is continuous in θ, and
1.3. POSTERIOR DISTRIBUTION AND CONSISTENCY 25
(ii) for each θ, T(θ, ·)is measurable.
Let X1,X
2,... i.i.d. random variables defined on (Ω,A,P)with E(T(θ, X1)) = µ(θ)
and assume further that
Esup
θK|T(θ, Xi)|<
Then, as n→∞,
sup
θK
1
n
n
1
T(θ, Xi)µ(θ)0a.s. P
Proof. Continuity of T(., x) and separability ensures that sup
θK|T(θ, Xi)|is measurable.
It follows from the dominated convergence theorem that θ→ µ(θ) is continuous.
Another application of the dominated convergence theorem shows that for any θ0K,
lim
δ0Esup
|θθ0||T(θ, X1)µ(θ)T(θ0,X
1)µ(θ0)|=0
Let Zji =sup
ρ(θ,θi)i|T(θ, Xj)µ(θ)T(θi,X
j)µ(θi)|. By compactness of K,there
exist θ1
2,...,θ
kand δ1
2,...,δ
ksuch that K=k
1{θ:ρ(θ, θi)
i},andEZ1i<
for i=1,2,...,k.
By the strong law of large numbers, since E(Z1,i)<for i=1,2,...,k,thereisa
0with P(Ω0) = 1 such that for ω0,n>n(ω), for i=1,2,...,k,
1
n
n
1
Zj,i <2
and
1
n
n
j=1
T(θi,X
j)µ(θi)
<
Now if θ∈{θ:ρ(θ, θi)
i},
1
nT(θ, Xj(ω)) µ(θ)
1
nZj,i(ω)+
1
nT(θi,X
j(ω)) µ(θi)3
Hence sup
θK|1
nT(θ, Xj(ω)) µ(θ)|<3k.
26 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
Remark 1.3.4.A very powerful approach to uniform strong laws is through em-
pirical processes. One considers a sequence of i.i.d. random variables Xiand studies
uniformity over a family of functions Fwith an integrable envelope function φ, i.e.,
E(φ)<,and|f(x)|≤|φ(x)|,f ∈F. Good references are Pollard [[139], II.2] and
Van der Vaart and Wellner [[161], 2.4].
Here is an easy consequence of the last theorem. First a definition: Let Θ be a space
endowed with a σ-algebra and θ→ Pθbe 1-1. For each θin Θ, let X1,X
2,... be
i.i.d. Pθ. Assume that Pθs are dominated by a σ-finite measure µand pθ=dPθ/dµ.
Definition 1.3.2. A measurable function ˆ
θn(X1,X
2,...,X
n) taking values in Θ is
called a maximum likelihood estimate (MLE) if the likelihood function at X1,X
2,...,X
n
attains its maximum at ˆ
θn(X1,X
2,...,X
n) or formally,
n
1
pˆ
θn(X1,X2,...,Xn)(Xi)=sup
θ
n
1
pθ(Xi)
Theorem 1.3.4. Let Θbe compact metric. For a fixed θ0, let
T(θ, x)=log(pθ(x)/pθ0(x))
If T(θ, Xi)satisfy the assumptions of Theorem 1.3.3 with P=Pθ0, then
1. any MLE ˆ
θnis consistent at θ0;
2. if Πis a prior on Θand if θ0is in the support of Πthen the posterior defined
by the density (with respect to Π)n
1pθ(Xi)/n
1pθ(Xi)dΠ(θ)is consistent at
θ0.
Proof. (i) Take any open neighborhood Uof θ0and let K=Uc. Note that µ(θ)=
Eθ0(T(θ, Xi)) = K(θ0)<0 for all θand hence by the continuity of µ(·), sup
θK
µ(θ)<
0.
On the one hand, by Theorem 1.3.3, given 0 <<|sup
θK
µ(θ)|, there exists n(ω),
such that for n>n(ω),
sup
θK
1
nT(θ, Xi)µ(θ)
<
On the other hand, (1/n)T(ˆ
θn,X
i)0.So ˆ
θn∈ Kand hence ˆ
θnU.
1.3. POSTERIOR DISTRIBUTION AND CONSISTENCY 27
As a curiosity, we note that we have not used the measurability assumption on ˆ
θn.
We have shown that the samples where the MLE is consistent contain a measurable
set of P
θ0measure 1.
(ii) Let Ube a neighborhood of θ0. We shall show that Π(U|X1,X
2,...,X
n)1
a.s Pθ0. As before, let K=Ucand T(θ, Xi)=log(pθ(Xi)/pθ0(Xi)) and Uδ={θ:
ρ(θ, θ0)}.Let
A1= inf
θ¯
Uδ
µ(θ)andA2=sup
θK
µ(θ)
Clearly A1<0,A
2<0. Choose δsmall enough so that UδUand |A1|<|A2|. This
can be done because µ(θ) is continuous and as δ0,inf
θ¯
Uδ
µ(θ)0.
Choose >0 such that A1>A
2+. By applying the uniform strong law of
large numbers to Kand ¯
Uδ,forωin a set of Pθ0-measure 1, there exists n(ω)such
that for n>n(ω),
1
nT(θ, Xi)µ(θ)
< θK¯
Uδ
Now
Π(U|X1,X
2,...,X
n)= Uen
1T(θ,Xi)dΠ(θ)
Uen
1T(θ,Xi)dΠ(θ)+Ucen
1T(θ,Xi)dΠ(θ)
1/1+ Ken
1T(θ,Xi)dΠ(θ)
Uδen
1T(θ,Xi)dΠ(θ)
1/1+ Π(K)en(A2+)
Π(Uδ)en(A1)
Since A2A1+2<0 and Π(Uδ)>0, the last term converges to 1 as n→∞.
Remark 1.3.5.Theorem 1.3.4 is related to Wald’s paper [163]. His conditions and
proofs are similar but he handles the noncompact case by assumptions of the kind
given next which ensure that the MLE ˆ
θnis inside a compact set eventually, almost
surely. Here are two assumptions; we will refer to them as Wald’s conditions:
1. Let Θ = Kiwhere the Kis are compact and K1K2.... For any sequence
θiKc
(i1) Ki,lim
ip(x, θi)=0.
2. Let φi(x)= sup
θKc
(i1)
(log p(x, θ)/p(x, θ0)). Then Eθ0φ+
i(X1)<for some i.
28 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
Assumption (1) implies that lim
i→∞ φi(x)=−∞. Using Assumption (2), the monotone
convergence theorem and the dominated convergence theorem one can show
lim
i→∞ Eθ0φi(X1)=−∞
Thus, given any A3<0, we can choose a compact set Kjsuch that
Eθφj=Eθ0sup
θKc
(j1)
log p(Xi)Eθ0p(Xi
0)<A
3
Using
1
nsup
θKc
j
n
1
log p(Xi)/p(Xi
0)1
n
n
1
sup
θKc
j
log p(Xi)/p(Xi
0)
and applying the usual SLLN to 1/n n
i=1 φj(Xi), it can be concluded that eventually
it is 0 a.s. Pθ0. This implies that eventually, ˆ
θnKja.s Pθ0. This result for the
compact case can now be used to establish consistency of ˆ
θn.
Remark 1.3.6.Suppose Θ is a convex open subset of Rpand for θΘ,
log fθ(xi)=A(θ)+
p
1
θjxi+ψ(xi)
and log fθ
∂θ ,2log fθ
∂θ2exist. Then by Lehman[123]
I(θ)=Eθlog fθ
∂θ 2
=Eθ2log fθ
∂θ2=d2A(θ)
2>0
Thus the likelihood is log concave. In this case also the MLE is consistent without
compactness by a simple direct argument using Theorem 1.3.4. Start with a bounded
open rectangle around θ0and let Kbe its closure. Because Kis compact, the MLE
ˆ
θK,withKas the parameter space exists and given any open neighborhood VKof
θ0,ˆ
θKlies in Vwith probability tending to 1. If ˆ
θKVit must be a local maximum
and hence a global maximum because of log concavity. This completes the proof. In
the log concave situation more detailed and general results are available in Hjort and
Pollard [101]
Remark 1.3.7.Under the assumptions of either of the last remarks it can be verified
that the posterior is consistent.
1.3. POSTERIOR DISTRIBUTION AND CONSISTENCY 29
The next two examples show that even in the finite-dimensional case consistency
of the MLE and the posterior do not always occur together.
Example. This example is due to Bahadur. Our presentation follows Lehman [124].
Here Θ = {1,2,...,}.Foreachθ, we define a density fθon [0,1] as follows:
Let h(x)=e1/x2. Define a0=1andanby an1
an(h(x)C)dx =1Cwhere
0<C<1. Because 1
0e1/x2dx =it is easy to show that ans are unique and tend
to0asn→∞.
Define fk(x)on[0,1] by
fk(x)=h(x)ifak<x<a
k1
Cotherwise
For ea ch k,letX1,X
2,...,X
nbe i.i.d. fk. Denoting min(X1,X
2,...,X
n)byX(n)
1,we
can write the likelihood function as
LX1,X2,...,Xn(k)=Cnif ak1<X
(n)
1
diif ak1>X
(n)
1
where di=IAi(Xi)h(Xi)+IAc
i(Xi)Cand Ai=(ai,a
i1].
Because h(x)>1, the likelihood function attains its maximum in the finite set
{k:ak>X
(n)
1}, and hence an MLE exists.
Fix jΘ. We shall show that any MLE ˆ
θnfails to be consistent at jby showing
Pjn
1
log fˆ
θn(Xi)
fj(Xi)>11
Actually, we show more, namely, for each j,ˆ
θnconverges in Pjprobability to .
Fix mand consider the set Θ1={1,2,...,m}⊂Θ. It is enough to show as n→∞,
Pj{ˆ
θn∈ Θ1}→1
Define k(X1,X
2,...,X
n)tobekif X(n)
1(ak,a
k1). Because the likelihood func-
tion at ˆ
θnis larger than that at kit suffices to show that
n
1
log fK
n(Xi)
fj(Xi)→∞in Pjprobability
30 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
Towards this first note that for any kand j,
n
1
log fk(Xi)
fj(Xi)=
(k)
log h(Xi)
C
(j)
log h(Xi)
C
where (k)is the sum over all isuch that Xi(ak,a
k1).With k
nin place of k,we
have
n
1
log fk
n(Xi)
fj(Xi)=
()
log h(Xi)
C
(j)
log h(Xi)
C
where ()is the sum over all isuch that Xi(ak
n,a
k
n1).
Because for each x,h(x)/C > 1, the first sum on the right-hand side is larger than
log(h(X(n)
(1) )/C), one of the terms appearing in the sum. Formally,
()
log h(Xi)
Clog h(X(n)
(1) )
C
On the other hand, because his decreasing
(j)
log h(Xi)
Cνk,n log h(aj)
C
where νk,n is the number of Xisin(aj,a
j1).
Thus n
1
1
nlog fk
n(Xi)
fj(Xi)1
nlog h(X(n)
1)
C1
nνk,n
log h(aj)
C
Because (1/n)νk,n Pj(aj,a
j1), the second term converges to a finite constant.
We complete the proof by showing
1
nlog h(X(n)
1)= 1
n(X(n)
1)2→∞
in Pjprobability.
Toward this, consider XPjand YU(0,1/C). Then for all x,
P(X>x)P(Y>x)
1.3. POSTERIOR DISTRIBUTION AND CONSISTENCY 31
To see this, P(Y>x)=1Cx and for P(X>x)notethat
if x>a
j1then P(X>x)=C(1 x)<1Cx
If x(aj,a
j1) then P(X>x)1ajC1Cx
and
if x<a
j,then P(X>x)=1Cx
Consequently X(n)
(1) is stochastically smaller than Y(n)
(1) and because his decreasing
P{h(X(n)
(1) )>x}≥P{h(Y(n)
(1) )>x}.
Therefore to show that (1/n)logh(X(n)
(1) )→∞in Pjprobability, it is enough to
show that (1/n)logh(Y(n)
(1) )→∞in U(0,1/C) probability. This follows because
1
nlog h(Y(n)
(1) )= 1
n(Y(n)
(1) )2
and easy computation shows that nY (n)
(1) has a limiting distribution and is hence
bounded in probability and Y(n)
(1) 0 a.s.
On the other hand, Θ being countable, Doob’s theorem assures consistency of the
posterior at all jΘ. This result also follows from Schwartz’s theorem which provides
more insight on the behavior of the posterior.
Intuitively, a Bayesian with a proper posterior is better off in such situations be-
cause a proper prior assigns a small probability to large values of K, which cause
problems for ˆ
θn. For an illuminating discussion of integrating rather than maximizing
the likelihood, see the discussion of a counterexample due to Stein in [9].
Example. This is an example where the posterior fails to be consistent at θ0in
the support of Π. This example is modeled after an example of Schwartz [145], but is
much simpler. In the next example Θ is finite-dimensional. In the infinite-dimensional
case there are many such examples due to Freedman [69] and Diaconis and Freedman
[46], [45].
LetΘ=(0,1) (2,3) and X1,X
2,...,X
nbe i.i.d U(0). Let θ0=1. Π is a prior
with density π, which is positive and continuous on Θ with π(θ)=e1/(θθ0)2on
(0,1).Because 1
0π(θ)dθ < 1, there exists such a prior density π, which is also positive
on (2,3).
32 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
We will argue that the posterior density fails to be consistent at θ0by showing that
the posterior probability of (2,3) goes to one in Pθ0probability. The proof rests on
the following facts both of which are easy to verify:
Let X(n)denote the maximum of X1,X
2,...,X
n. Then under Pθ0, i.e., under U(0,1),
n(X(n)θ0)=OP(1). In fact, n(X(n)θ0) converges to an exponential distribution.
The second fact is (1/n)log(1 Xn1
(n))0inPθ0probability, because by direct
calculation the distribution of (1 Xn1
(n))w
U(0,1).
Now the posterior probability of (2,3) is given by
3
2
1
θnI(0)(X(n))π(θ)
1
0
1
θnI(0)(X(n))π(θ)+3
2
1
θnI(0)(X(n))π(θ)
Because 0 X(n)1 a.e. Pθ0, the numerator is equal to 3
2(1n)π(θ)and the
first integral in the denominator is 1
X(n)
1
θnπ(θ). So the posterior probability of
(2,3) reduces to
1
1+1
X(n)θnπ(θ)
3
2θnπ(θ)=1
1+I1
I2
Now
I1π(X(n))1
X(n)
θn=π(X(n))
n1
(1 Xn1
(n))
Xn1
(n)
and (1/n)logI1is less than
n1
nlog X(n)log(n1)
n+1
nlog(1 Xn1
(n))+ 1
nlog π(X(n))
As n→∞the first two terms on the right side go to 0. The third goes to 0 by the
second remark. The last term, using the explicit form of πon (0,1), goes to −∞ in
Pθ0probability. Thus (1/n)logI1→−in Pθ0probability.
On the other hand
1
3nΠ(2,3) <3
2
1
θnπ(θ)dθ < 1
2nΠ(2,3)
Hence
(log 3)Π(2,3) 1
nlog I2≤−(log 2)Π(2,3)
1.4 ASYMPTOTIC NORMALITY 33
and thus log(I1/I2)→−in Pθ0probability. Equivalently, I1/I20inPθ0proba-
bility.
In this example, the MLE is consistent. We could have taken the parameter space
to be [, 1] [2,3] and ensured compactness. What goes wrong here, as we shall later
recognize, is the lack of continuity of the Kullback-Leibler information and, of course,
the behavior of Π in the neighborhood of θ0. If a prior Π satisfies Π(θ0
0+h)>
0,for all h>0, then similar calculations or an application of the Schwartz theorem,
to be proved later, show that the posterior is consistent.
Remark 1.3.8.We have seen that consistency of MLE neither implies nor is implied
by consistency of the posterior. The following condition implies both. Let Vbe any
open set containing θ0. Then the condition is
sup
θVc
n
1
fθ(Xi)/fθ0(Xi)0a.sθ0
Theorem 1.3.4 implies this stronger condition.
1.4 Asymptotic Normality of MLE and
Bernstein–von Mises Theorem
A standard result in the asymptotic theory of maximum likelihood estimates is its
asymptotic normality. In this section we briefly review this and its Bayesian parallel-
the Bernstein–von Mises theorem-on the asymptotic normality of the posterior dis-
tribution. A word about the asymptotic normality of the MLE: This is really a result
about the consistent roots of the likelihood equation log fθ/∂θ = 0. If a global MLE
ˆ
θnexists and is consistent, then under a differentiability assumption it is easy to see
that for each Pθ0,ˆ
θnis a consistent solution of the likelihood equation almost surely
Pθ0. On the other hand, if fθis differentiable in θ, then for each θ0it is possible to
construct [Serfling [147] 33.3; Cram´er [35]] a sequence Tnthat is a solution of the like-
lihood equation and that converges to θ0. The problem, of course, is that Tndepends
on θ0and so will not qualify as an estimator. If there exists a consistent estimate
θ
n, then a consistent sequence that is also a solution of the likelihood equation can
be constructed by picking ˆ
θnto be the solution closest to θ
n. For a sketch of this
argument, see Ghosh [89].
As before, let X1,X
2,...,X
nbe i.i.d. fθ,wherefθis a density with respect to
some dominating measure µand θΘ, and Θ is an open subset of R.Wemakethe
following regularity assumptions on fθ:
34 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
(i) {x:fθ(x)>0}is the same for all θΘ
(ii) L(θ, x) = log fθ(x) is thrice differentiable with respect to θin a neighborhood
(θ0δ, θ0+δ). If ˙
L, ¨
L,and ...
Lstand for the first, second, and third derivatives,
then Eθ0˙
L(θ0)andEθ0¨
L(θ0) are both finite and
sup
θ(θ0δ,θ0+δ)|...
L(θ, x)|<M(x)andEθ0M<
(iii) Interchange of the order of expectation with respect to θ0and differentiation at
θ0are justified, so that
Eθ0˙
L(θ0)=0,E
θ0¨
L(θ0)=Eθ0(˙
L(θ0))2
(iv) I(θ0).
=Eθ0(˙
L(θ0))2>0.
Theorem 1.4.1. If {fθ:θΘ}satisfies conditions (i)–(iv) and if ˆ
θnis a consis-
tent solution of the likelihood equation then n(ˆ
θnθ0)D
N(0,1/I(θ0)).
Proof. Let Ln(θ)=n
1L(θ, Xi). By Taylor expansion
0= ˙
Ln(ˆ
θn)= ˙
Ln(θ0)+(
ˆ
θnθ0)¨
Ln(θ0)+(ˆ
θnθ0)2
2
...
Ln(θ)
where θ0
<ˆ
θn.Thus,
n(ˆ
θnθ0)=
1
n˙
Ln(θ0)
1
n¨
Ln(θ0)1
2(ˆ
θnθ0)1
n
...
Ln(θ)
By the central limit theorem, the numerator converges in distribution to N(0,I(θ0));
the first term in the denominator goes to I(θ0) by SLLN; the second term is oP(1) by
the assumptions on ˆ
θnand ...
L.
We next turn to asymptotic normality of the posterior. We wish to prove that if
ˆ
θnis a consistent solution of the likelihood equation, then the posterior distribution
of n(θˆ
θn) is approximately N(0,1/I(θ0)). Early forms of this theorem go back to
Laplace, Bernstein, and von Mises [see [46] for references]. A version of this theorem
appears in Lehmann [124]. Condition (v) in Theorem 1.4.2 is taken from there. Other
related references are Bickel and Yahav [20], Walker [164], LeCam [121], [120] and
1.4 ASYMPTOTIC NORMALITY 35
Borwanker et al. [27]. Ghosal [75, 76, 77] has developed posterior normality results in
cases where the dimension of the parameter space is increasing. Further refinements
developing asymptotic expansions appear in Johnson [107],[108] , Kadane and Tierney
[158] and Woodroofe [173]. Lindley [129], Johnson [108] and Ghosh et al. [82], provide
expansions of the posterior that refine posterior normality. See the next section for
an alternative unified treatment of regular and nonregular cases.
Theorem 1.4.2. Suppose {fθ:θΘ}satisfies assumptions (i)–(iv) of the Theo-
rem 1.4.1 and ˆ
θnis a consistent solution of the likelihood equation. Further, suppose
(v) for any δ>0, there exists an >0such that
Pθ0sup
|θθ0|
1
n(Ln(θ)Ln(θ0)) ≤−1
(vi) The prior has a density π(θ)with respect to Lebesgue measure, which is con-
tinuous and positive at θ0.
Let Xnstand for X1,X
2,...,X
nand fθ(Xn)for its joint density. Denote by π(s|Xn)
the posterior density of s=n(θˆ
θn(Xn)). Then as n→∞,
R
π(s|Xn)I(θ0)
2πes2I(θ0)
2
ds Pθ0
0 (1.3)
Proof. Because s=n(θˆ
θn),
π(s|Xn)= π(ˆ
θn+s
n)fˆ
θn+s/n(Xn)
Rπ(ˆ
θn+t
n)fˆ
θn+t/n(Xn)dt
To avoid notational mess, we suppress the Xnand rewrite the last line as
π(ˆ
θn+s
n)eLn(ˆ
θn+s/n)Ln(ˆ
θn)
Rπ(ˆ
θn+t/n)eLn(ˆ
θn+t
n)Ln(ˆ
θn)dt
Thus we need to show
R
π(ˆ
θn+s/n)eLn(ˆ
θn+s/n)Ln(ˆ
θn)
Rπ(ˆ
θn+t/n)eLn(ˆ
θn+t/n)Ln(ˆ
θn)dt I(θ0)
2πes2I(θ0)
2
ds Pθ0
0 (1.4)
36 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
It is enough to show that
R
π(ˆ
θn+t
n)eLn(ˆ
θn+t/n)Ln(ˆ
θn)π(θ0)et2I(θ0)
2
dt Pθ0
0 (1.5)
To see this, note that writing Cnfor R
π(ˆ
θn+t/n)eLn(ˆ
θn+t/n)Ln(ˆ
θn)dt, (1.4) is
C1
nR
π(ˆ
θn+s
n)eLn(ˆ
θn+s/n)Ln(ˆ
θn)CnI(θ0)
2πes2I(θ0)
2
dsPθ0
0
Because (1.5) implies that Cnπ(θ0)2π/I(θ0) it is enough to show that the
integral inside the brackets goes to 0 in probability, and this term is less than I1+I2,
where
I1=R
π(ˆ
θn+s
n)eLn(ˆ
θn+s/n)Ln(ˆ
θn)π(θ0)es2I(θ0)
2
ds
and
I2=R
π(θ0)es2I(θ0)
2CnI(θ0)
2πes2I(θ0)
2
ds
Now I1goes to 0 by (1.5) and I2is equal to
π(θ0)CnI(θ0)
2πR
es2I(θ0)
2ds
which goes to 0 because Cnπ(θ0)2π/I(θ0).
To achieve a further reduction, set
hn=1
n
n
1
¨
L(ˆ
θn,X
i)=1
n¨
Ln(ˆ
θn,X
i)
Because as n→∞,hnI(θ0) a.s. Pθ0, to verify (1.5) it is enough if we show that
R
π(ˆ
θn+t
n)eLn(ˆ
θn+t/n)Ln(ˆ
θn)π(ˆ
θn)et2hn
2
dt Pθ0
0 (1.6)
To show (1.6), given any δ,c>0, we break Rinto three regions:
A1={t:|t|<clog n},
1.4 ASYMPTOTIC NORMALITY 37
A2={t:clog n<|t|
n},and
A3={t:|t|
n}.
We begi n w ith A3.
A3
π(ˆ
θn+t
n)eLn(ˆ
θn+t/n)Ln(ˆ
θn)π(ˆ
θn)et2hn
2
dt
A3
π(ˆ
θn+t
n)eLn(ˆ
θn+t/n)Ln(ˆ
θn)dt +A3
π(ˆ
θn)et2hn
2dt
The first integral goes to 0 by assumption (v). The second is seen to go to 0 by the
usual tail estimates for a normal.
Because ˆ
θnθ0, by Taylor expansion, for large n,
Ln(ˆ
θn+t
n)Ln(ˆ
θn)= t2
2n¨
Ln(ˆ
θn)+1
6(t
n)3...
Ln(θ)=t2hn
2+Rn
for some θ(θ0,ˆ
θn). Now consider
A1
π(ˆ
θn+t
n)et2hn
2+Rnπ(ˆ
θn)et2hn
2
dt
A1
π(ˆ
θn+t
n)et2hn
2+Rnet2hn
2dt +A1
π(ˆ
θn+t
n)π(ˆ
θn)
et2hn
2dt
Because πis continuous at θ0, the second integral goes to 0 in Pθ0probability. The
first integral equals
A1
π(ˆ
θn+t
n)et2hn
2eRn1dt
A1
π(ˆ
θn+t
n)et2hn
2e|Rn||Rn|dt
(1.7)
Now,
sup
tA1
Rn=sup
tA1
(t
n)3...
Ln(θ)c3(log n)3
nOP(1) = oP(1)
and hence (1.7) is
sup
tA1
π(ˆ
θn+t
n)A1
et2hn
2e|Rn||Rn|dt =oP(1)
38 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
Next consider
A2
π(ˆ
θn+t
n)et2hn
2+Rnπ(ˆ
θn)et2hn
2
dt
A2
π(ˆ
θn+t
n)et2hn
2+Rndt +A2
π(ˆ
θn)et2hn
2dt
The second integral is
2π(ˆ
θn)ehnclog n
2[δnclog n]
(ˆ
θn)n
nchn/4
so that by choosing clarge, the integral goes to 0 in Pθ0probability.
For the first integral, because tA2,and clog n<|t|
n,wehave|t|/n<
δ.Thus|Rn|=(|t|
n)31
6
...
Ln(θ)δt2
6
1
n
...
Ln(θ)
Because sup
θ(θ0δ,θ0+δ)
(1/n)
...
Ln(θ)is OP(1), by choosing δsmall we can ensure that
Pθ0|Rn|<t2
4hnfor all tA2>1for n>n
0(1.8)
or
Pθ0t2hn
2+Rn<t2hn
4for all tA2>1(1.9)
Hence, with probability greater than 1 ,
A2
π(ˆ
θn+t
n)et2hn
2+Rndt
sup
θA2
π(ˆ
θn+t/ t
n)A2
et2hn/4dt
0asn→∞
Finally, the three steps can be put together, first by choosing a δto ensure ( 1.8) and
then by working with this δin steps 1 and 3.
An asymptotic normality result also holds for Bayes estimates.
Theorem 1.4.3. In addition to the assumptions of Theorem 1.4.2 assume that
|θ|π(θ)dθ < .Letθ
n=RθΠ(|X1,X
2,...,X
n)be the Bayes estimate with
respect to squared error loss. Then
1.4 ASYMPTOTIC NORMALITY 39
(i) n(ˆ
θnθ
n)0in Pθ0probability
(ii) n(θ
nθ0)converges in distribution to N(0,1/I(θ0)).
Proof. The assumption of finite moment for πand a slight refinement of detail in the
proof of Theorem 1.4.2 strengthens the assertion to
R
(1 + |t|)π(ˆ
θn+t
n)eLn(ˆ
θn+t/n)Ln(ˆ
θn)et2hn
2dt Pθ0
0 (1.10)
Consequently
R
(1 + |t|)
π(t|Xn)I(θ0)
2πet2I(θ0)
2
dt Pθ0
0
and hence Rt
π(t|Xn)(I(θ0)/2π)et2I(θ0)
2
dt
Pθ0
0. Note that because
I(θ0)
2πR
tet2I(θ0)
2dt =0
we have R
(dt|Xn)0.
To relate these observations to the theorem, note that
θ
n=R
θΠ(|X1,X
2,...,X
n)=R
(ˆ
θn+t
n)π(dt|Xn)
and hence n(θ
nˆ
θn)=R
(dt|Xn).
Assertion (ii) follows from (i) and the asymptotic normality of ˆ
θndiscussed earlier.
Remark 1.4.1.This theorem shows that the posterior mean of θcan be approxi-
mated by ˆ
θnup to an error of oP(n1/2). Actually, under stronger assumptions one
can show [82] that the error is of the order of n1. A result of this type also holds
for the posterior variance.
Remark 1.4.2.With a stronger version of assumption (v), namely, for any δ,
sup
|θθ0|
1
n[Ln(θ)Ln(θ0)] ≤−eventually a.e. Pθ0
and ˆ
θnθ0a.s., we can have the L1-distance in (1.3) go to 0 a.s. Pθ0.
40 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
Remark 1.4.3.If we have almost sure convergence at each θ0, then by Fubini, the
L1-distance evaluated with respect to the joint distribution of θ,X1,X
2,...,X
ngoes
to 0. For refinements of such results see [82].
Remark 1.4.4.Multiparameter extensions follow in a similar way.
Remark 1.4.5.It follows immediately from (1.5) that
log R
n
1
fθ(Xi)π(θ)=Ln(ˆ
θn) + log Cn1
2log n
=Ln(ˆ
θn)1
2log n+1
2log 2π1
2log I(θ0) + log π(θ0)+oP(1)
In the multiparameter case with a pdimensional parameter, this would become
log R
n
1
fθ(Xi)π(θ)=Ln(ˆ
θn)p
2log n+p
2log 2π1
2log ||I(θ0)||+logπ(θ0)+oP(1)
where ||I(θ0)|| stands for the determinant of the Fisher information matrix.
This is identical to the approximation of Schwarz [146] needed for developing his
BIC (Bayes information criteria) for selecting from Kgiven models. Schwarz rec-
ommends the use of the penalized likelihood under model jwith a pj-dimensional
parameter, namely,
Ln(ˆ
θn)pj
2log n
to evaluate the jth model. One chooses the model with highest value of this criterion.
The proof suggested here does not assume exponential families as in Schwarz[146]
but assumes that the true density f0is in the model being considered. To have a
similar approximation when f0is not in the model, one assumes
inf
θf0log f0
fθ
is attained at θ0. We use this θ0in the assumptions of this section.
Remark 1.4.6.The main theorem in this section remains true if we replace the
normal distribution N(0,1/I(θ0)byN(0,1/a)wherea=(1/n)(d2log L/dθ2)|ˆ
θnis
the observed Fisher information per unit observation. To a Bayesian, this form of the
theorem is more appealing because it does not involve a true (but unknown) value
θ0. The proof requires very little change.
1.5. IBRAGIMOV AND HASMINSKI˘
I CONDITIONS 41
1.5 Ibragimov and Hasminski˘ı Conditions
Ibragimov and Hasminski˘ı, henceforth referred to as IH, in their text [102] used a
very general framework for parametric models that includes both the regular model
treated in the last section and nonregular problems like U(0). In fact, IH verify
their conditions for various classes of nonregular problems and some stochastic pro-
cesses. Within their framework we will provide a necessary and sufficient condition
for a suitably normed posterior to have a limit in probability. This theorem includes
Theorem 1.4.2 on posterior normality under slightly different conditions and with
results on nonregular cases. It also answers some questions on nonregular problems
raised by Smith [152].
We begin with notations and conditions appropriate for this section. Let Θ be an
open set in Rk. For simplicity we take kto be 1.
The joint probability distribution of X1,X
2,...,X
nis denoted by Pn
θand its density
with respect to Lebesgue measure (or any other σ- finite measure) by p(Xn). Let
φnbe a sequence of positive constants converging to 0. If k>1 then φnwould be
ak-dimensional vector of such constants. In the so-called regular case treated in the
last section, φn=1/n. In the nonregular cases, typically φn0 at a faster rate.
Consider the map Udefined by U(θ)=φ1
n(θθ0), where θ0is the true value. Let
Unbe the range of this map, i.e., Un={U(θ):θΘ}. The variable uis a suitably
scaled deviation of θfrom θ0. The likelihood ratio process is defined as
Zn(u, Xn)=p(Xn
0+φnu)
p(Xn
0)
The IH conditions can be thought of as two conditions on the Hellinger distance
and one on weak convergence of finite-dimensional distributions of Zn.
IH conditions
1. For some M>0, m10, α>0, n01,
Eθ0Z
1
2
n(u1)Z
1
2
n(u2)
2M(1 + Am1)|u1u2|α
u1,u
2∈U
nwith |u1|≤A, |u2|≤A
for all nn0.
Note that the left-hand side is the square of the Hellinger distance between
p(Xn
0+φnu1)andp(Xn
0+φnu2). The condition is like a Lipschitz condition
in the rescaled parameter space but uniformly in n.
42 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
2. For all u∈U
nand nn0,
Eθ0Z
1
2
n(u)egn(|u|)
where gnis a sequence of real-valued functions satisfying the following condi-
tions:
(a) for each n1, gn(y)↑∞as y→∞,
(b) for any N>0,
lim
y→∞
n→∞
yNegn(y)=0
3. The finite-dimensional distributions of {Zn(u):u∈U
n}converge to those of a
stochastic process {Z(u):uR}.
For i.i.d. X1,X
2,...,X
nwith compact Θ, condition 2 will hold if φ1
nis bounded
by a power of n, as is usually the case. This may be seen as follows: Note that
Eθ0Z
1
2
n(u)=[A(θ0
0+φnu)]n
where [A(θ0
0+φnu)]nis the affinity between pθ0and p(θ0+φnu)given by
pθ0p(θ0+φnu)dx. Define
gn(y)=nlog A(θ0
0+φny)ify∈U
n
otherwise
Condition 2(a) and 2(b) follow trivially. For non compact cases the condition is
similar to the Wald conditions. The following result appears in IH (theorem I.10.2).
Theorem 1.5.1. Let Πbe a prior with continuous positive density at θ0with respect
to the Lebesgue measure. Under the IH conditions and with squared error loss, the nor-
malized Bayes estimate φ1
n(˜
θnθ0)converges in distribution to uZ(u)du/ Z(u)du.
A similar result holds for other loss functions. This result of IH is similar to the
result that was derived as a corollary to the Bernstein–von Mises theorem on posterior
normality. So it is natural to ask if such a limit, not necessarily normal, exists for the
posterior under conditions of IH.
We begin with a fact that immediately follows from the Hewitt-Savage 0-1 law.
1.5. IBRAGIMOV AND HASMINSKI˘
I CONDITIONS 43
Proposition 1.5.1. Suppose X1,X
2,...,X
nare i.i.d. and Πis a prior.
Let ˆ
θ(X1,X
2,...,X
n)be a symmetric function of X1,X
2,...,X
n.Let
t=φ1
nθˆ
θ(X1,X
2,...,X
n)
and let Abe a Borel set. Suppose
Π(tA|X1,X
2,...,X
n)Pθ0
YA
Then YAis constant a.e. Pθ0.
In view of this, the following definition of convergence of posterior seems appropri-
ate, at least in the i.i.d. case.
Definition 1.5.1. For some symmetric function ˆ
θ(X1,X
2,...,X
n) the posterior
distribution of t=φ1
nθˆ
θ(X1,X
2,...,X
n)has a limit Qif
sup
A{|Π(tA|X1,X
2,...,X
n)Q(A)|} Pθ0
0
In this case, ˆ
θ(X1,X
2,...,X
n) is called a proper centering.
We now state our main result.
Theorem 1.5.2. Suppose the IH conditions hold and Πis a prior with continuous
positive density at θ0with respect to the Lebesgue measure. If a proper centering
ˆ
θ(X1,X
2,...,X
n)exists, then there exists a random variable Wsuch that
(a) φ1
n(θ0ˆ
θ(X1,X
2,...,X
n)) converges in distribution to W.
(b) For almost all ηR, with respect to the Lebesgue measure ξ(ηW)=q(η)is
nonrandom, where ξ(u)=Z(u)/RZ(u)du,uR.
Conversely if b holds for some random variable W, then the posterior mean given
X1,X
2,...,X
n, is a proper centering with Q(A)=Aq(t)dt.
Remark 1.5.1.Under the IH conditions it can be shown that the posterior mean
given X1,X
2,...,X
nexists. (See the proof of IH theorem 10.2)
Remark 1.5.2.It is proved in Ghosal et al. [79] that under IH conditions the poste-
rior with centering at θ0converges weakly to ξ(.) a.s. Pθ0. Theorem 1.5.2 shows that
if weak convergence is to be strengthened to convergence in probability by centering
at a suitable ˆ
θ(X1,X
2,...,X
n), then conditions (a) and (b) are needed.
44 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
Example 1.5.1. We sketch how the current theorem leads to (a version of) the
Bernstein–von Mises theorem. Assume that the Xis are i.i.d. and conditions 1 and 2
of IH hold and that the following stochastic expansion used earlier in this chapter is
valid.
log Zn(u)= u
n
n
1
log p(Xi)
∂θ |θ0u2
2I(θ0)+oP(1).
Then
log Zn(u)D
uV u2
2I(θ0)whereVis a N(0,I(θ0)) random variable.
Let log Z(u)=uV (u2/2)I(θ0). This implies that
(log Zn(u1),log Zn(u2),...log Zn(um))
converges in distribution to
(log Z(u1),log Z(u2),...log Z(um))
i.e., condition 3 of IH holds. An elementary calculation now shows that W=V/I(θ0)
and q(η) is the normal density at ηwith mean 0 and variance I1(θ0).
Some feeling about condition 1 in the regular case may be obtained as follows: Easy
calculation shows
Eθ0Z
1
2
n(u1)(Z
1
2
n(u2)=A(u1,u
2)n
If we expand (pθ0+(u/n))=
1
2up to the quadratic term and integrate, we get the
following approximation.
{1+C(u1u2)2
n+3Rn}
Because
Eθ0(Z
1
2
n(u1)Z
1
2
n(u2)
2
=22A(u1,u
2)n
it can be bounded as required in condition 2 under appropriate conditions on the
negligibility of the remainder term Rn. A useful sufficient condition is provided in
lemma 1.1 of IH.
Example. The following is a nonregular case where the posterior converges to a
limit:
p(x, θ)=e(xθ)x>θ
0 otherwise
1.5. IBRAGIMOV AND HASMINSKI˘
I CONDITIONS 45
The norming constant φnis n1and a convenient centering is ˆ
θ(X1,X
2,...,X
n)=
min(X1,X
2,...,X
n). Conditions 1 and 2 of IH are verified in chapter 5 of IH under
very general assumptions that cover the current example. We shall verify the easy
condition 3 and the necessary and sufficient condition of Theorem 1.5.2. Let Vn=
n(ˆ
θ(X1,X
2,...,X
n)θ)andWbe a random variable exponentially distributed on
(−∞,0) with mean 1. Then Vnand Whave the same distribution for all n. Also
Zn(u)=euif uVn<0
0 otherwise
Define Z(u) similarly with Wreplacing Vn. Because Wand Vnhave the same distri-
bution, the finite-dimensional distributions of Znand Zare the same. Moreover
ξ(u)=eu+Wif u+W<0
0 otherwise
and so q(η)=eηif η<0 and 0 otherwise. The case when Pθis uniform can be
reduced to this case by a suitable transformation of Xand θ.
Example. This example deals with the hazard rate change point problem. Consider
X1,X
2,...,X
ni.i.d. with hazard rate
fθ(x)
1Fθ(x)=aif 0 <x<θ
bif x>θ
Typically ais much bigger than b. This density has been used to model electronic
components with initial high hazard rate and cancer relapse times. For details see
Ghosh et al.[85].
Let ˆ
θ(X1,X
2,...,X
n)betheMLEofθ. It can be shown that φn=n1is the right
norming constant and that the IH conditions hold. But the necessary condition that
ξ(uW) is nonrandom fails. On the other hand, if a, b are also unknown, it can be
shown that the posterior distribution of n(aˆa),n(bˆ
b)has a limit in the
sense of theorem 1.5.2. For details see [85] and [79]
Remark 1.5.3.Ghosal et al. [84] show that typically in non-regular examples the
necessary condition of Theorem 1.5.2 fails.
Remark 1.5.4.Theorems 2.2 and 2.3 of [84] imply consistency of the posterior
under conditions of IH.
Remark 1.5.5.If φs
n<for some s>0, then posterior consistency holds in
the a.s sense.
46 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
1.6 Nonsubjective Priors
This section contains a brief discussion of nonsubjective priors. This term has been
generally used in the literature for the so-called noninformative priors. In this section
we use it as a generic description of all priors that are not elicited in a fully subjective
manner.
1.6.1 Fully Specified
Fully specified nonsubjective priors try to quantify low information in one sense or
another. Because there is no completely satisfactory definition of information, many
choices are available. Only the most common are discussed. A comprehensive survey is
by Kass and Wasserman [111]. A quick overview is available in Ghosh and Mukherjee
[86] and Ghosh [83]. In particular, we use this term to describe conjugate priors and
their mixtures.
ForconveniencewetakeΘ=Rp.The use of uniform distribution, namely, the
Lebesgue measure, as a prior goes back to Bayes and Laplace. It has been criticized
as being improper (i.e., total measure is not finite), a property that applies to all
the priors considered in this section, and is a consequence of Θ being unbounded.
An improper prior may be used only if it leads to a proper posterior for all samples.
This posterior may then be used to calculate Bayes estimates and so on. However,
even then there arise problems with testing hypotheses and model selection. Because
we will not consider testing for infinite-dimensional Θ we will not pursue this. For
finite-dimensional Θ, attractive possibilities are available. See, for example, Berger
and Pericchi [16] and Ghosh and Samanta [88]
As pointed out by Fisher, choice of uniform distribution is not invariant in the
following sense. Take a smooth 1-1 function η(θ)ofθ. Argue that if one has no
information about θthenthesameistrueofη(θ), and hence one can quantify this
belief by a uniform distribution for η. Going back to θone gets a nonuniform prior π
for θsatisfying
π(θ)=|
|
where |dη/dθ|is the Jacobian, i.e., the determinant of the p×pmatrix [∂ηi/∂θj].
It appears that Fisher’s criticism led to the decline of Bayesian methods based on
uniform priors. This also helped the growth of methods based on maximizing the
likelihood. However, Basu [9] makes a strong case for a uniform distribution after a
suitable finite discrete approximation to Θ. This idea will be taken up in Chapter 8.
1.6. NONSUBJECTIVE PRIORS 47
A natural Bayesian answer to Fisher’s criticism is to look for a method that pro-
duces priors π1(θ), π2(η)forθand ηsuch that one can pass from one to the other by
the usual Jacobian formula
π1(θ)=π2(η(θ))|
|(1.11)
Suppose the likelihood satisfies regularity conditions and the p×pFisher’s infor-
mation matrix
I(θ)=Eθlog fθ
∂θi·log fθ
∂θj
is positive definite. Then Jeffreys suggested the use of
π1(θ)={det I(θ)}1/2
This is known as the Jeffreys prior. It is easily verified that (1.11) is satisfied if we
set
π2(η)=det Eθlog fθ
∂ηi·log fθ
∂ηj1/2
using the Fisher information matrix in the η-space. One apparently unpleasant as-
pect is the dependence of the prior on the experiment. This is examined in the next
subsection.
The Jeffreys prior was the most popular nonsubjective prior until the introduction
of reference priors by Bernardo [18]. The algorithm described next is due to Berger
and Bernardo [14], [15]. We follow the treatment given in Ghosh [83].
For a discrete random variable or vector Wwith probability function p(w), the
Shannon entropy is
S(p)=S(W)=Ep(log p(W))
It can be axiomatically developed and is a basic quantity in information and com-
munication theory. Maximization of entropy, which is equivalent to minimizing infor-
mation, leads to a discrete uniform distribution, provided Wassumes only finitely
many values.
Unfortunately, no such universally accepted measure exists if W is not discrete. In
the general case we may still define
S(p)=S(W)=Ep(log p(W))
where pis the density with respect to some σ-finite measure µ. Unfortunately, this
S(p) depends on µand is rarely used directly in information or communication theory.
Further, if one maximizes S(p)onegetsp=constant, i.e. one gets essentially µ.
48 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
A different measure, also due to Shannon, was used by Lindley [128] and Bernardo
[18]. Consider two random vectors V,W with joint density p. Then
S(p)S(V,W)=S(V)+SV(W)
where
SV(W)=E(I(W|V))
I(W|V)=E{log p(W|V)|V}
Here SV(W) is the part of the entropy of Wthat can be explained by its dependence
on V. The residual entropy is
S(W)SV(W)=EElog p(W|V)
p(W)|V0
Because p(w|v)log(p(w|v)/p(w)) µ(dw)0
this quantity is taken as a measure of entropy in the construction of reference priors.
Let X=(X1,X
2,...,X
n) have density p(x|θ) and let the prior be p(θ) and posterior
density be p(θ|x). Lindley’s measure of information in Xis
S(X, p(θ)) = EElog p(θ|x)
p(θ)|X (1.12)
So it is a measure of how close the prior is to the posterior. If the prior is most in-
formative, i.e., degenerate at a point, then the quantity is 0. Maximizing the quantity
should therefore make the prior as noninformative as possible provided S(X,p(θ)) is
the correct measure of entropy.
Bernardo[18] recommended taking a limit first as n→∞and then maximizing.
Taking a limit seems to introduce some stability and removes dependence on n. Sub-
sequent research has shown that maximizing for a fixed nmay lead to discrete priors,
which are unacceptable as noninformative.
To ensure that a limit exists, one assumes i.i.d. observations with enough regularity
conditions for posterior normality in a sufficiently strong sense. Details are available
in Clarke and Barron [33].
Suppose Kiis an increasing sequence of compact sets whose union is the whole
parameter space Θ. To avoid confusion with the density pthe dimension of θis taken
as d. Then using the posterior normality
S(x, p)=E(log p(θ)) + E(log p(θ|X))
=E(log p(θ)) + Elog N(θ)+o(1)
1.6. NONSUBJECTIVE PRIORS 49
where Nis the normal density with mean ˆ
θand dispersion matrix I1(ˆ
θ)/n.
The second term on the right equals
nE (θiˆ
θi)(θjˆ
θj)Iij(ˆ
θ)
2+Elog det I(ˆ
θ)1/2+d
2log n
2π
If we approximate I0(ˆ
θ)byI0(θ)andE(θiˆ
θi)(θjˆ
θj)byIij(θ)/n,S(x, p) simplifies
to d
2log n
2πe +Ki
p(θ)log{det I(θ)}1/2Ki
p(θ)logp(θ)+o(1) (1.13)
Thus as n→∞,S(X, p) is decomposed into a term that does not depend on p(θ)
and
J(p, Ki)=Ki
p(θ)log {det I(θ)}1/2
p(θ)
which is maximized at
pi(θ)=const. {det I(θ)}1/2if θK1
= 0 otherwise
If one lets i→∞,pis may be regarded as converging to the Jeffreys prior. This
is a rederivation of the Jeffreys prior from an information theoretic point of view
by Bernardo [18]. To get a reference prior, one writes θ=(θ1
2), where θ1is the
parameter of interest and θ2is a nuisance parameter. Let dibe the dimension of θi,
and for convenience take Θ = Θ1×Θ2.
For a fixe d θ1,letp(θ2|θ1) be a conditional prior for θ2given θ1. By integrating out
θ2, one is left with θ1and X. Then one finds the marginal prior p(θ1) as described
earlier. This depends on the choice p(θ2|θ1). Bernardo [18] recommended use of the
Jeffreys prior const ·det{I22(θ)}1/2, treating θ2as variable and with θ1held constant.
Here I22(θ)=[Iij (θ),i,j,=d1+1,...,d
1+d2].
Fix compact sets Ki1,K
i2of Θ1and Θ2. Consider priors concentrating on Ki1×Ki2.
Let pi(θ2|θ1) be a given conditional prior. Our first object is to maximize the entropy
in θ1and find the marginal p(θ1).
Let
S(X, pi(θ1)) = Elog pi(θ1|X)
pi(θ1)
=S(X, pi(θ1
2)) Ki1
pi(θ1)S(X, pi(θ2|θ1)) 1
(1.14)
50 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
Assuming that one can interchange integration with respect to θ1, using the asymp-
totic form of (1.13) of S(X, p(θ1
2),
S(X, pi(θ1)) = d1
2log n
2πe +Ki1
pi1(θ1)log ψi(θ1)
pi(θ1)1+o(1)
where
ψi(θ1)=expKi1
pi(θ2|θ1)logdet I(θ)
det I22(θ)1/2
2
Maximizing S(X, pi(θ1)) asymptotically,
pi(θ1)=const ψi(θ1)onKi1
where the constant is for normalization.
Then
pi(θ1
2)=constant ψi(θ1)pi(θ2|θ1)onKi1×Ki2
0 elsewhere
Finally take
p(θ2|θ1)=ci(θ1){det I22(θ)}1/2on Ki2
0 otherwise
To choose a limit in some sense, fix θ0=(θ10
20) and assume
lim pi(θ1
2)/pi(θ10
20)=p(θ1
2)
exists for all θΘ. Then p(θ1
2) is the reference prior when θ1is more important
than θ2. If the convergence to p(θ1
2) is uniform on compacts, then for any pair of
sets B1,B
2contained in a fixed Ki01×Ki02
lim B1pi(θ1
2)
B2pi(θ1
2)=B1p(θ1
2)
B2p(θ1
2)
Berger and Bernardo [15] recommend a d-dimensional break up of θas (θ1
2,...,θ
d)
and a d-step algorithm starting with
p(θd|θ1,...,θ
d1)=c(θ1
2,...,θ
d1)Idd(θ)onKid
Some justification for this is provided in Datta and Ghosh [38].
1.6. NONSUBJECTIVE PRIORS 51
There is still another class of nonsubjective priors obtained by matching what a
frequentist might do (because, presumably, that is how a Bayesian without prior in-
formation would act). Technically, this amounts to matching posterior and frequentist
probabilities up to a certain order of approximation. This leads to a differential equa-
tion involving the prior. For d= 1 the Jeffreys prior is the unique solution. For d>1,
reference priors are often a solution of the matching equation. More details are given
in Ghosh [83].
Finally, there is one class of problems in which there is some sort of consensus
on what nonsubjective prior to use. These are problems where a nice group Gof
transformations leaves the problem invariant and either acts transitively on Θ, i.e.,
{g(θ0); gG}= Θ, or reduces Θ to a one-dimensional maximal invariant parameter.
See, for example, Berger [13]. In the next example Gacts transitively. In such problems
the right invariant Haar measure is a common choice and is a reference prior. The
Jeffreys prior is a left invariant Haar measure which causes problems [see, e.g., Dawid,
Stone, and Zidek [39]). For examples involving one-dimensional maximal invariants,
see Datta and Ghosh [38]. Here also reference priors do well.
Example 1.6.1. Xis are i.i.d. normal with mean θ2and variance θ1;θ1is the
parameter of importance. The information matrix is
I(θ)=1
2θ2
10
01
θ1
and so the reference prior may be obtained through the following steps:
pi(θ2|θ1)=dion Ki2
ψi(θ1)=exp[
Ki2
dilog 1
2θ1
]2
pi(θ1
2)=ci
1
θ1
on Ki2×Ki2
pi(θ1
2)=θ101
which is also known to arise from the right invariant Haar measure for (µ, σ). The
Jeffreys prior is proportional to θ3
1, which corresponds to the left invariant Haar
measure.
If the mean is taken to be θ1and variance θ2, then the reference prior is proportional
to θ1
1. But, in general, a reference prior depends on how the components are ordered.
52 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
1.6.2 Discussion
Nonsubjective priors are best thought of as providing a tool for calculating posteriors.
Theorems like posterior normality indicate that the effect of the prior washes away as
the sample size increases. Hence a posterior obtained from a nonsubjective prior may
be thought of as an approximation to a posterior obtained from a subjective prior.
Though there is no unique choice for a nonsubjective prior, the posterior obtained
from different nonsubjective priors will usually be close to each other, even for mod-
erate values of n. Thus lack of uniqueness may not matter very much.
It is true that a nonsubjective prior usually depends on the experiment, e.g.,
through the information matrix I(θ). This would not seem paradoxical if one remem-
bers that nonsubjective priors have low information, and it seems that information
cannot be defined except in the context of an experiment. The measure of information
used by Bernardo [18] clarifies this.
Nonsubjective priors are typically improper but some justification comes from the
work of Heath and Sudderth [97], [96]. They show that, at least for amenable groups,
the posterior obtained from a right invariant measure can be obtained from a proper
finitely additive prior.
For improper priors one has to verify that the posteriors are proper. In many
cases this is not easy. Some Bayesians use improper priors and restrict it to a large
compact set. In general, this is not advisable. It is a remarkable fact that for the
Jeffreys or reference priors, the posteriors are often proper, but there exist simple
counterexamples; see for example, [38]. If the likelihood shows marked inhomogeneities
asymptotically, as in the so-called nonergodic cases, one must take these into account
through suitable conditioning.
1.7 Conjugate and Hierarchical Priors
Let Xis be i.i.d. Consider exponential densities with a special parametrization
fθ(x)=exp{A(θ)+
p
1
θjTj(x)+ψ(x)}
Given X1,X
2,...,X
n, the sufficient statistic is (n
1T1(xi),...,n
1Tp(xi)). Assume
Θisanopenp- dimensional rectangle. Because
Eθlog fθ
∂θj=0
1.7. CONJUGATE AND HIERARCHICAL PRIORS 53
one has ∂A(θ)
∂θj
=Eθ(Tj)=ηj(θ)
η=(η1
2,...,η
p) provides another natural parametrization. Note that the MLE
ˆη=T/n.
A class of priors Cis said to be a conjugate family if given pCthe posterior for
all nbelongs to C. One can generate such families by choosing a σ-finite measure ν
on Θ and defining elements of Cby
p(θ|m, t1,t
2,...,t
p)=const. exp{mA(θ)+
p
1
θjtj}(1.15)
where mis a positive integer and t1,t
2,...,t
pare elements in the sample space of
T1,T
2...,T
p. The constants m, t1,t
2,...,t
pare parameters of the prior distribution
chosen such that the prior is proper.
Usually, νis a nonsubjective prior. Then the prior displayed in (1.15) can be in-
terpreted as a posterior when the prior is νand one has a conceptual sample of size
myielding values of sufficient statistics T=(t1,t
2,...,t
p), i.e., compared with νit
represents prior information equivalent to a sample of size m.
The case when νis the Lebesgue measure deserves special attention. Under certain
conditions, one can prove the following by an argument involving integration by parts,
E(η|X1,X
2,...,X
n)=mE(η)+nˆη
m+n(1.16)
which shows that the posterior mean is a convex combination of the prior mean
and a suitable frequentist estimate. The relation strengthens the interpretation of m
as a measure of information in the prior. The elements of Ccorresponding to the
Lebesgue measure are usually called conjugate priors. Diaconis and Ylvisaker [47]
have shown that these are the only priors that satisfy (1.16). One can elicit the values
of t1,t
2,...,t
pby eliciting the prior mean and mby comparing prior information
with information from a sample. This makes these priors relatively easy to elicit,
but because one is only eliciting some aspects of the prior, a conjugate prior is a
nonsubjective prior with some parameters reflecting prior belief.
Example. fθis normal density with mean µand standard deviation σ. Here θ1=
µ/2σ2,θ2=12,A(θ)=(µ2/2σ2)log σ,andT1(x)=x, T2(x)=x2. A conjugate
prior is of the form
p(θ)=Const. emA(θ)+t1θ1+t2θ2
54 1. PRELIMINARIES AND THE FINITE DIMENSIONAL CASE
which can be displayed as the product of a normal and inverse gamma.
Example. fθis Bernoulli with parameter θ. Conjugate priors are beta distributions.
Example. fθis multinomial with parameters θ1
2,...,θ
p,whereθi0,θi=1.
Conjugate priors are Dirichlet distributions discussed in the next chapter.
Conjugate priors have been criticized on two grounds. The relation (1.16) may not
be reasonable if there is conflict between the prior and the data. For example, if p=1
and the prior mean is 0 and ˆηis 20, should one believe the data or the prior? A convex
combination of two incompatible estimates is unreasonable.
For N(µ, σ2), a t-prior for µand a nonsubjective prior for σensures that in cases
like this the posterior mean shifts more toward the data, i.e., a choice of such a prior
means that, in cases of conflict, one trusts the data. The t-prior is a scale mixture of
normal. In general, it seems that mixtures on conjugate priors will possess this kind
of property, but we have not seen any general investigation in the literature.
The other criticism of conjugate priors is that only one parameter mis left to model
the prior belief on uncertainty. Once again, a mixture of conjugate priors offers more
flexibility.
These mixtures may be thought of as modeling prior belief in a hierarchy of stages
called hierarchical priors. The reason for their current popularity in Bayesian analysis
is that they are flexible and posterior quantities can be calculated by Markov chain
Monte Carlo. A good source is Schervish [144].
1.8 Exchangeability, De Finetti’s Theorem,
Exponential Families
Subjective priors can be elicited in special simple cases, a relatively recent treatment is
Kadane et al. [109]. However there is one class of problems where subjective judgments
can be made relatively easily and can lead to both a model and a prior.
Suppose {Xi}is a sequence of random variables. This sequence is said to be ex-
changeable if for any ndistinct i1,i
2,...,i
n,
P{Xi1B1,X
i2B2,...,X
inBn}=P{X1B1,X
2B2,...,X
nBn(1.17)
Suppose {Xi}take values in {0,1}. One may be able to judge if the {Xi}sare
exchangeable. In some sense, such judgments are fundamental to science when one
makes inductions about future based on past experience. The next theorem of De
1.8 EXCHANGEABILITY 55
Finetti shows that this subjective judgment leads to a model and affirms the existence
of a prior.
Theorem 1.8.1. If a sequence of random variables {Xi}is exchangeable and if
each Xitakes values in {0,1}then there exists a distribution Πsuch that
P{X1=x1,X
2=x2,...,X
n=xn1}=1
0
θr(1 θ)nrdΠ(θ)
with r=n
1xi
The theorem implies that one has a Bernoulli model and a prior Π. To specify a
prior, one needs additional subjective judgments. For example, if given X1,X
2,...,X
n
one predicts Xn+1 =(α+xi)/(α+β+n), Π then must be a beta prior.
Regazzini [67] has shown that judgments on Exchangeability, along with certain
judgments on predictive distributions of Xn+1 given X1,X
2,...,X
nlead to a similar
representation theorem, which leads to an exponential model along with a mixing
distribution, which may be interpreted as a prior. Earlier Bayesian derivations of
exponential families is due to Lauritzen [117] and Diaconis and Freedman [44]. A
good treatment is in Schervish [144] where partial exchangeability and its modeling
through hierarchical priors is also discussed.
2
M(X) and Priors on M(X)
2.1 Introduction
As mentioned in Chapter 1, in the nonparametric case the parameter space Θ is
typically the set of all probability measures on X. We denote the set of all probability
measures on Xby M(X). The cases of interest to us are when Xis a finite set and
when X=R. The Bayesian aspect requires prior distributions on M(X), in other
words, probabilities on the space of probabilities. In this chapter we develop some
measure-theoretic and topological features of the space M(X) and discuss various
notions of convergence on the space of prior distributions.
The results in this chapter, except for the last section, are mainly used to assert
the existence of the priors discussed later. Thus, for a reader who is prepared to
accept the existence theorems mentioned later, a cursory reading of this chapter would
be adequate. On the other hand, for those who are interested in measure-theoretic
aspects, a careful reading of this chapter will provide a working familiarity with the
measure-theoretic subtleties involved. The last section where formal definitions of
consistency are discussed, can be read independently. While we generally consider
the case X=R, most of the arguments would go through when Xis a complete
separable metric space.
58 2. M(X)AND PRIORS ON M(X)
2.2 The Space M(X)
As before, let Xbe a complete separable metric space with Bthe corresponding
Borel σ-algebra on X. Denote by M(X) the space of all probability measures on
(X,B).
As seen in the chapter 1 there are many reasonable notions of convergence on the
space M(X) , but they are not all equally convenient for our purpose. We begin with
a brief discussion of these.
Total Variation Metric. Recall that the total variation metric was defined by
PQ=2sup
B|P(B)Q(B)|
If pand qare densities of Pand Qwith respect to some σ-finite measure µ, then
PQis just the L1-distance |pq|between pand q. The total variation
metric is a strong metric. If x∈Xand δxis the probability degenerate at x, then
Ux={P:Pδx<}={P:P(x)>1}is a neighborhood of δx. Further
if x=xthen UxUx=. Thus, when Xis uncountable, {Ux:x∈X}is an
uncountable collection of disjoint open sets, the existence of which renders M(X)
nonseparable. Further, no sequence of discrete measures can converge to a continuous
measure and vice versa. These properties make the total variation metric uninteresting
when considered on all of M(X).
The total variation metric when restricted to sets of the form Lµ—all probability
measures dominated by a σ-finite measure µ—is extremely useful and interesting. In
this context we will refer to the total variation as the L1-metric. It is a standard result
that Lµwith the L1-metric is complete and separable.
Hellinger Metric. This metric was also discussed in Chapter 1. Briefly the Hellinger
distance between Pand Qis defined by
H(P, Q)=(pq)21/2
where pand qare densities with respect to µ. The Hellinger metric is equivalent to
the L1metric. Associated with the Hellinger metric is a useful quantity A(P, Q) called
affinity,denedasA(P, Q)=pqdµ. The relation H2(Pn,Q
n)=22(A(P, Q))n,
where Pn,Q
nare n-fold product measures, makes the Hellinger metric convenient in
the i.i.d. context.
Setwise convergence. The metrics defined in the last section provide corresponding
notions of convergence. Another natural way of saying Pnconverges to Pis to require
2.2. THE SPACE M(X)59
that Pn(B)P(B) for all Borel sets B. A way of formalizing this topology is as
follows. Let Fbe the class of functions {P→ P(B):B∈B}.OnM(X) give the
smallest topology that makes the functions in Fcontinuous. It is easy to see that
under this topology, if fis a bounded measurable function, then P→ fdPis
continuous. Sets of the form {P:|P(Bi)P0(Bi)|<
i,B
1,B
2,...,B
k∈B}give a
neighborhood base at P0.
Setwise convergence is an intuitively appealing notion, but it has awkward topo-
logical properties that stem from the fact that convergence of Pn(B)toP(B)forsets
in an algebra does not ensure the convergence for all Borel sets. We summarize some
additional facts as a proposition.
Proposition 2.2.1. Under setwise convergence:
(i) M(X)is not separable,
(ii) If P0is a continuous measure then P0does not have a countable neighborhood
base, and hence the topology of setwise convergence is not metrizable.
Proof. (i) Ux={P:P{x}>1}is a neighborhood of δx,andasxvaries form
an uncountable collection of disjoint open sets.
(ii) Suppose that there is a countable base for the neighborhoods at P0.LetB0be
a countable family of sets such that sets of the type
U={P:|P(Bi)P0(Bi)|<
i,B
1,B
2,...,B
k∈B
0}
form a neighborhood base at P0. It then follows that Pn(B)P(B) for all
Borel sets Biff Pn(B)P(B) for all sets in B0.
Let Bn=σ(B1,B
2,...,B
n)whereB1,B
2,...is an enumeration of B0.Denote by
Bn1,B
n2,...B
nk(n)the atoms of Bn. Define Pnto be the discrete measure that
gives mass P0(Bni)toxni where xni is a point in Bni. Clearly Pn(Bmj )P0(Bmj)
for all mj.OntheotherhandPn(i,m {xmi}) = 1 for all nbut P0((i,m {xmi})=
0.
These shortcomings persist even when we restrict attention to subsets M(X)ofthe
form Lµ.
Supremum Metric. When Xis R, the Glivenko-Cantelli theorem on convergence
of empirical distribution suggests another useful metric, which we call the supremum
60 2. M(X)AND PRIORS ON M(X)
metric. This metric is defined by
dK(P, Q)=sup
t|P(−∞,t]Q(−∞,t]|
Under this metric M(X) is complete but not separable.
Weak Convergence. In many ways weak convergence is the most natural and useful
topology on M(X). Say that
PnPweakly or Pn
weakly
Pif
fdP
nfdP
for all bounded continuous functions fon X. For any P0a neighborhood base consists
of sets of the form k
1{P:fidP0fidP <}where fi,i =1,2,...,k are
bounded continuous functions on X. One of the things that makes the weak topology
so convenient is that under weak convergence M(X) is a complete separable metric
space.
The main results that we need with regard to weak convergence are the Portman-
teau theorem and Prohorov’s theorem given in Chapter 1.
Because M(X) is a complete separable metric space under weak convergence, we
define the Borel σ-algebra BMon M(X) to be the smallest σ-algebra generated by
all weakly open sets, equivalently all weakly closed sets. This σ-algebra has a more
convenient description as the smallest σ-algebra that makes the functions {P→
P(B):B∈B}measurable. Let B0be the σ-algebra generated by all weakly open
sets. Consider all Bsuch that P→ P(B)isB0-measurable. This class contains all
closed sets, and from the π-λtheorem (Theorem 1.2.1) it follows easily that BMis
the σ-algebra generated by all weakly open sets.
We have discussed two other modes of convergence on M(X) : the total variation
and setwise convergence. It is instructive to pause and investigate the σ-algebras
corresponding to these and their relationship with BM.
Because these are nonseparable spaces, there is no good acceptable notion of a
Borel σ-algebra. In the case of total variation metric, the two common σ-algebras
considered are
(i) Bo—the σ-algebra generated by open sets and
(ii) Bb—the σ-algebra generated by open balls.
2.2. THE SPACE M(X)61
The σ-algebra Bogenerated by open sets is much larger than BM. To see this, restrict
the σ-algebra to the space of degenerate measures DX={δx:x∈X}.Theneachδx
is relatively open, and this will force the restriction of Boto DXto be the power set.
On the other hand, BMrestricted to DXis just the inverse of the Borel σ-algebra on
Xunder the map δx→ x.
Because every open ball is in BM,soiseverysetinthe σ-algebra generated by
these balls. It can be shown that Bbis properly contained in BM.
Similar statements hold when we consider the σ-algebras for setwise convergence.
The corresponding σ-algebras here would be those generated by open sets and those
generated by basic neighborhoods at a point. A discussion of these different σ-algebras
can be found in [71].
We next discuss measurability issues on M(X) . Following are a few of elementary
propositions.
Proposition 2.2.2. (i) If B0is an algebra generating Bthen
σ{P→ P(B):B∈B
0}=BM
(ii) σP→ fdP:fbounded measurable=BM
Proof. (i) Let ˜
B={B:P→ P(B)isBMmeasurable}. Then ˜
Bis a σ-algebra and
contains B0. The result now follows from Theorem 1.2.1.
(ii) It is enough to show that P→ fdPis BMmeasurable. This is immediate for
fsimple, and any bounded measurable fis a limit of simple functions.
Proposition 2.2.3. Let fP(x)be a bounded jointly measurable function of (P, x).
Then P→ fP(x)dP (x)is BMmeasurable.
Proof. Consider
G=FM(X)×X such that P(FP)isBMmeasurable
Here FPis the P-section {x:(P, x)F}of F.Gis a λ-system that contains the
π-classofallsetsoftheformC×B;C∈B
M,B∈B, and by Theorem 1.2.1 is the
product σ-algebra on M(X)×X. This proves the proposition when fP(x)=IF(P, x).
The proof is completed by verifying when fP(x) is simple, and by passing to limits.
Proposition 2.2.3 can be used to prove the measurability of the set of discrete
probabilities.
62 2. M(X)AND PRIORS ON M(X)
Proposition 2.2.4. The set of discrete probabilities is a measurable subset of
M(X).
Proof. If E={(P, x):P{x}>0}is a measurable set, then setting fP(x)=IE(P, x),
the set of discrete measures is just {P:fP(x)dP =1}and would be measurable by
Proposition 2.2.3. To see that E={(P,x):P{x}>0}is measurable, we show that
(P, x)→ P{x}is jointly measurable in (P, x). Consider the set of all a measurable
subsets Fof Xsuch that (P, x)→ P(Fx) is measurable in (P, x). As before,
Fx={y:(x, y)F}. This class contains all Borel sets of the form B1×B2and is
aλ-system, and by Theorem 1.2.1 is the Borel σ-algebra on X.In particular
(P, x)→ P(Fx) is measurable when F={(x, x):x∈X}is the diagonal and
E={(P, x):P(Fx>0)}.
Consider fP(x) used in Proposition 2.2.4. Then Pis continuous iff fP(x)dP =0.
It follows that the set of continuous measures is a measurable set.
If µis a σ-finite measure on R, then Lµis a measurable subset of M(X). To see
this, assume without loss of generality that µis a probability measure. Let Bnbe an
increasing sequence of algebras, with finitely many atoms, whose union generates B.
Denote the atoms of Bnby Bn1,B
n2,...B
k(n), and for any probability measure P,
set fP(x) = lim k(n)
1P(Bni)(Bni) when it exists and 0 otherwise. To complete the
argument note that Lµ={P:fP(x)=1}.
2.3 (Prior) Probability Measures on M(X)
2.3.1 XFinite
Suppose X={1,2,...,k}.In this case M(X) can be identified with the (k1)
dimensional probability simplex Sk={p1,p
2,...,p
k:0pi1,pi=1}.One
way of defining a prior on M(X) is by defining a measure on Sk. Any such measure
defines the joint distribution of {P(A):A⊂X}, because for any A, P (A)=iApi,
where pk=1k1
1pi.
An example of a prior distribution on Skis the uniform distribution—the normal-
ized Lebesgue measure on {p1,p
2,...,p
k1:0pi1,pi1}.Another example
is the Dirichlet density which is given by
Π(p1,p
2,...,p
k1)=Γ(k
1αi)
Γ(αi)pα11
1pα21
2...p
αk11
k1(1
k1
1
pi)αk1
2.3. (PRIOR) PROBABILITY MEASURES ON M(X)63
where α1
2,...,α
kare positive real numbers. This density will be studied in greater
detail later.
A different parametrization of M(X) yields another method of constructing a prior
on M(X). Assume for ease of exposition that Xcontains 2kelements {x1,x
2,...,x
2k}.
Let
B0={x1,x
2,...,x
2k1}and B1={x2k1+1,x
2k1+2,...,x
2k}
be a partition of Xinto two sets. Let B00,B
01 be a partition of B0into two halves
and B10,B
11 be a similar partition of B1. Proceeding this way we can get partitions
B12...i0,B
12...i1of B12...iwhere each iis 0 or 1 and i<k. Clearly, this partition
stops at i=k.
We next note that the partitions can be used to identify Xwith Ek={0,1}k.
Any x∈Xcorresponds to a sequence 1(x)2(x)...
k(x)wherei(x)=0ifxis in
B1(x)2(x)...i1(x)0 and 1 if xis in B1(x)2(x)...i1(x)1 . Conversely, any sequence 12...
k
corresponds to the point k
1B12...i. Thus there is a correspondence—depending on
the partition—between the set M(X) of probability measures on Xand the set
M(Ek) of probability measures on Ek.
Any probability measure on Ekis determined by quantities like
y12...k=P(i+1 =0|1,
2,...,
i)
Specifically, let E
kbe the set of all sequences of 0 and 1 of length less than k, including
the empty sequence .If0y1 is given for all E
k, then there is a probability
on Ekby
P(12...
k)=
k
i=1,i=0
y12...i1
k
i=1,i=1
(1 y12...i1)
where i= 1 corresponds to the empty sequence . Hence construction of a prior on
Ekamounts to a specification of the joint distribution for {y:E
k}.
A little reflection will show that all we have done is to reparametrize a probability
Pon Xby
P(B0),P(B00|B0),P(B10|B1),...,P(B12...k10|B12...k10)
Of interest to us is the case where the Ys, equivalently P(B0|B)s, are all indepen-
dent. The case when these are independent beta random variables—the Polya tree
processes—will be studied in Chapter 3
Yet another method of obtaining a prior distribution on M(X) is via De Finetti’
theorem. De Finetti’s theorem plays a fundamental role in Bayesian inference, and
we refer the reader to [144] for an extensive discussion.
64 2. M(X)AND PRIORS ON M(X)
Let X1,X
2,...,X
nbe X-valued random variables. X1,X
2,...,X
nis said to be ex-
changeable if X1,X
2,...,X
nand Xπ(1),X
π(2),...,X
π(n)have the same distribution for
every permutation πof {1,2,...,n}. A sequence X1,X
1,...is said to be exchangeable
if X1,X
2,...,X
nis exchangeable for every n.
Theorem 2.3.1. [De Finetti] A sequence of X-valued random variables is ex-
changeable iff there is a unique measure Πon M(X)such that for all n,
M(X)
n
1
p(xi)dΠ(p)=Pr {X1=x1,X
2=x2,...,X
n=xn}
In general it is not easy to construct Π from the distribution of the Xis. Typically,
we will have a natural candidate for Π. By uniqueness, it is enough to verify the
preceding equation. On the other hand, given Π, the behavior of X1,X
1,... often
gives insight into the structure of Π.
As an example, let X={x1,x
2,...,x
k}.Letα1
2,...,α
kbe positive integers. Let
¯α(i)=αi/αj. Consider the following urn scheme: Suppose a box contains balls of
k- colors, with αiballs of color i. Choose a ball at random, so that P(X1=i)=¯α(i).
Replace the ball and add one more of the same color. Clearly, P(X2=j|X1=i)=
(αj+δi(j))/(αi+1) whereδi(j)=1ifi=jand 0 otherwise.Repeat this process
to obtain X3,X
4,... Then
(i) X1,X
2,... are exchangeable; and
(ii) the prior Π for this case is the Dirichlet density on Sk.
2.3.2 X=R
We next turn to construction of measures on M(X) . Because the elements of M(X)
are functions on B,M(X) can be viewed as a subset of [0,1]Bwhere the product
space [0,1]Bis equipped with the canonical product σ-algebra, which makes all the
coordinate functions measurable. Note that the restriction of the product σ-algebra
to M(X)isjustBM. A natural attempt to construct measures on M(X) would be to
use Kolomogorov’s consistency theorem to construct a probability measure on [0,1]B,
which could then be restricted to M(X) . However M(X) is not measurable as a
subset of [0,1]B, and that makes this approach somewhat inconvenient. To see that
M(X) is not measurable, note that singletons are measurable subsets of M(X) but
not so in the product space.
When X=R, distribution functions turn out to be a useful crutch to construct
priors on M(R). To elaborate:
2.3. (PRIOR) PROBABILITY MEASURES ON M(X)65
(i) Let Qbe a dense subset of Rand let Fbe all real-valued functions on Qsuch
that
(a) Fis right-continuous on Q,
(b) Fis nondecreasing, and
(c) limt→∞ =F(t)=1,limt→−∞ F(t)=0.
(ii) Let Fbe all real-valued functions on Rsuch that
(a) Fis right-continuous on R,
(b) Fis non decreasing, and
(c) limt→∞ F(x)=1,limt→−∞ F(x)=0.
(iii) M(R)={P:Pis a probability measure on R}
There is a natural 1-1 correspondence between these three sets: Let φ1:M(R)→
Fbe the function that takes a probability measure Pto its distribution function
FP(t)=P(−∞,t] and let φ2:F→F
be the function that maps a distribution
function to its restriction on Q. These maps are 1-1, onto, and bi-measurable. Thus
any probability measure on Fcan be transferred to a probability on Fand then
to M(R). A prior on Fonly involves the distributions of
(F(t1),F(t2)F(t1),...,F(tk)F(tk1))
for tisinQ. However, because any F(t) is a limit of F(tn),t
nQ, the distributions of
quantities like (F(t1),F(t2)F(t1),...,F(tk)F(tk1)) for ti-real can be recovered,
at least as limits. On the other hand since a general Borel set Bhas no simple
description in terms of intervals, one can assert the existence of a distribution for
P(B) that is compatible with the prior on F, but it may not be possible to arrive
at anything resembling an explicit description of the distribution.
It is convenient to use the notation L(·|Π) to stand for the distribution or law of a
quantity under the distribution Π.
Theorem 2.3.2. Let Q be a countable dense subset of R. Suppose for every kand
every collection t1<t
2< ... < t
kwith {t1,t
2,...,t
k}⊂Q,Πt1,t2,...,tkis a probability
measure on [0,1]kwhich is a specification of a distribution of ((F(t1),F(t2),...,F(tk))
such that
(i) if {t1,t
2,...,t
k}⊂{s1,s
2,...,s
l}then the marginal distribution on (t1,t
2,...,t
k)
obtained from Πs1,s2,...,slis Πt1,t2,...,tk;
66 2. M(X)AND PRIORS ON M(X)
(ii) if t1<t
2then Πt1,t2{F(t1)F(t2)}=1;
(iii) if (t1n,t
2n,...,t
kn)(t1,t
2,...,t
k)then Π(t1n,t2n,...,tkn)converges in distribution
to Π(t1,t2,...,tk); and
(iv) if tn↓−then Πtn0in distribution and if tn↑∞then Πtn1in
distribution.
then there exists a probability measure Πon M(R)such that for every t1<t
2<...<
tk,with {t1,t
2,...,t
k}⊂Q,
L((F(t1),F(t2),,...,F(tk)) |Π) = Πt1,t2,...,tk.
Proof. By the Kolomogorov consistency theorem (i) ensures the existence of a proba-
bility measure Π on [0,1]Qwith Π(t1,t2,...,tk)as marginals. We will argue that Π(F)=1
Suppose F
1=ti<tjF[0,1]Q:F(ti)F(tj). Because Qis countable by (ii),
Π(F
1)=1.
Next, fix tand a sequence tnin Qdecreasing to t.OnF
1,F(tn) as a function of F
is decreasing in nand hence has a limit. If F(t) = limnF(tn) then F(t)F(t)and
by assumption (iii) EΠF(t)=EΠF(t), so that F(t)=F(t)a.e.Consequently
Π{F∈F
1: F is right-continuous at t}=1
and the countability of Qyields
Π{F:Fis monotone and Fis right-continuous at all tQ}=1
A similar argument shows that with Π probability 1, for Fin F
1,limt→∞ =F(t)=
1,and limt→−∞ F(t) = 0. This shows that Π(F)=1.
Thus we have established the existence of a probability measure on F. Using the
discussion preceding the theorem this prior can be lifted to all of M(R).
The assumptions of Theorem 2.3.2 require specification of finite-dimensional dis-
tribution only for tisinQand the conclusion also involves only the finite dimensional
distributions for tisinQ. It is easy to see that if one starts with Π(t1,t2,...,tk)with ti’s
real and satisfying the conditions of Theorem 2.3.2 then one would get a Π for which
the marginals are Π(t1,t2,...,tk)for tis real.
A convenient way of specifying the distribution of (F(t1),F(t2),...,F(tk)) for t1<
t2<...,t
k, is by specifying the distribution, say Π
t1,t2,...,tk,of
(F(t1),F(t2)F(t1),...,F(tkF(tk1))
2.3. (PRIOR) PROBABILITY MEASURES ON M(X)67
The convenience arises from the fact that (−∞,t
1],(t1,t
2],...,(tk,) can be thought
of as k+ 1 cells and (p1,p
2,...,p
k+1) as the corresponding multinomial probabili-
ties. Note that Π
t1,t2,...,tkis a probability measure on Sk={(p1,p
2,...,p
k:pi
0,
k
1
pi1}. If the specifications of the collection Π
t1,t2,...,tksatisfy assumptions
(ii),(iii), and (iv) of Theorem 2.3.2, then so would the collection Πt1,t2,...,tk=L((p1,p
1+
p2,...,k
1pi)|Π
t1,t2,...,tk).These observations give the following easy variant of Theo-
rem 2.3.2.
Theorem 2.3.3. Suppose that for every kand every collection t1<t
2< ... < t
k
with {t1,t
2,...,t
k}⊂R,Πt1,t2,...,tkis a probability measure on Sk={(p1,p
2,...,p
k):
pi0,
k
1
pi1}such that
(i) if {t1,t
2,...,t
k}⊂{s1,s
2,...,s
l}then the marginal distribution on (t1,t
2,...,t
k)
obtained from Πs1,s2,...,slis Πt1,t2,...,tk;
(ii) if (t1n,t
2n,...,t
kn)(t1,t
2,...,t
k)then Π(t1n,t2n,...,tkn)converges in distribution
to Π(t1,t2,...,tk); and
(iii) if tn↓−then Πtn0in distribution and if tn↑∞then Πtn1in
distribution.
then there exists a probability measure Πon F(equivalently on M(R)such that for
every t1<t
2<...< t
k,with {t1,t
2,...,t
k}⊂R,
L((F(t1),F(t2)F(t1),...,F(tk)F(tk1)) |Π) = Πt1,t2,...,tk
Suppose (B1,B
2,...,B
k) is a collection of disjoint subsets of R; the next theorem
shows that if the distribution of P(B1),P(B2),...,P(Bk) are themselves prescribed
consistently then the prior Π would have the prescribed marginal distribution for
(P(B1),P(B2),...,P(Bk)).
Theorem 2.3.4. Suppose for each collection of disjoint Borel sets (B1,B
2,...,B
k)
a distribution ΠB1,B2,...,Bkis assigned for (P(B1),P(B2),...,P(Bk)) such that
(i) ΠB1,B2,...,Bkis a probability measure on k-dimensional probability simplex Skand
if A1,A
2,...,A
lis another collection of disjoint Borel sets whose elements are
68 2. M(X)AND PRIORS ON M(X)
union of sets from (B1,B
2,...,B
k)then
ΠA1,A2,...,Al=distribution of
BiA1
P(Bi),
BiA2
P(Bi),...,
BiAl
P(Bi)
(ii) if Bn↓∅; and ΠBn0in distribution,
(iii) P(R)1.
Then there exists a probability measure Πon M(R)such that for any collection of
disjoint Borel sets (B1,B
2,...,B
k), the marginal distribution of (P(B1),...,P(Bk))
under Πis ΠB1,B2,...,Bk.
Remark 2.3.1.Given ΠB1,B2,...,Bkas earlier, we can extend the definition to obtain
ΠA1,A2,...,Amfor any collection (not necessarily disjoint) of Borel sets A1,A
2,...,A
m.
Toward this, let B1=A1,B
i=Ai−∪j<iAj, and define ΠA1,A2,...,Amas the distribution
of (P(B1,P(B1)+P(B2)+...,m
1P(Bj)) under ΠB1,B2,...,Bm. The following proof
shows that the marginal distribution under Π of (P(A1),P(A2),...,P(Ak)) of any
collection of Borel sets is ΠA1,A2,...,Ak.
Proof. As in the Theorem 2.3.3 start with partitions of the form Bi=(ti1,t
i]for
i=1,2,...,k; and let Π be the measure obtained on F.Letφ2be the map from Fto
M(R) defined by φ2(F)=PF,wherePFis the probability measure corresponding to
F. It is easy to see that this map is 1-1 and measurable. We will continue to denote
by Π the induced measure on M(R).
ΠbyconstructionsitsonM(R). What we then need to show is that the marginal
distribution of (P(B1),P(B2),...,P(Bk)) under Π is ΠB1,B2,...,Bk.
Step 1 (ii) implies that
if (B1n,B
2n,...,B
kn)(B1,B
2,...,B
k1) then
(P(B1n),P(B2n),...,P(Bkn)) (P(B1),P(B2),...,P(Bk)) in distribution.
To see this,
((P(B1n),P(B2n),...,P(Bkn))
=(P(B1)+(P(B1n)P(B1)),P(B2)+(P(B2n)P(B2)),...,
P(Bk)+(P(Bkn)P(Bk)))
2.3. (PRIOR) PROBABILITY MEASURES ON M(X)69
and for each i,(Bin Bi)↓∅and hence (P(Bin)P(Bi)) goes to 0 in distribution
and hence in probability. As a result, the whole vector
((P(B1n)P(B1)),(P(B2n)P(B2)),...,(P(Bkn)P(Bk))) 0 in probability
Step 2 Denote by B0the algebra generated by intervals of the form (a, b]. For
any B1,B
2,...,B
k,letL(P(B1),P(B2),...,P(Bk)|Π) denote the distribution of the
vector (P(B1),P(B2),...,P(Bk)) under Π. Fix k.LetCi=(ai,b
i],i =2,...,k.
Consider
ˆ
B=B1:L(P(B1),P(C2),...,P(Ck)|Π) = Π(B1,C2,...,Ck)
Then ˆ
Bcontains all sets of the form (a, b], is closed under disjoint unions of such
sets, and hence contains B0. In addition, by Step 1 this is a monotone class. So ˆ
Bis
B.
Step 3 Now consider
B2:L(P(B1),P(B2),P(C3),...,P(Ck)|Π) = Π(B1,B2,C3,...,Ck)
From step 2, this class contains all sets of the form (a, b], and their finite disjoint
unions and hence contains B0. Further, it is a monotone class and so is B. Continuing
similarly, it follows that for any Borel sets B1,B
2,...,B
k,
L(P(B1),P(B2),...,P(Bk)|Π) = ΠB1,B2,...,Bk
.
Example 2.3.1. Let αbe a finite measure on R. For any partition (B1,B
2,...,B
k),
let ΠB1,B2,...,Bkon Skbe a Dirichlet (α(B1)(B2),...,α(Bk)). We will show in Chap-
ter 3 that this assignment satisfies the conditions of Theorem 2.3.4.
Remark 2.3.2.Theorem 2.3.4 on constructing a measure Π on Fthrough finite-
dimensional distribution can be viewed from a different angle. Toward this, for each
n, divide the interval [2n,2n] into intervals of length 2nand let 2n=tn1<
tn2<...< t
nk(n)=2
ndenote the endpoints of the intervals. These define a partition
of Rinto k(n) + 1 cells in an obvious way. Any probability (p1,p
2,...,p
k(n)+1)on
these k(n) + 1 cells corresponds to a distribution function on R,whichisconstanton
each interval and thus any probability Πtn1,tn2,...,tnk(n)on Sk(n)+1 defines a probability
measure µnon Fn= all distribution functions, which are constant on the interval
(tni,t
ni+1]. The consistency assumption on Πtn1,tn2,...,tnk(n)shows that the marginal
distribution on Fnobtained from µn+1 is just µn.Nowitcanbeshownthat
70 2. M(X)AND PRIORS ON M(X)
1. {µn}n1is tight as a sequence of probability measures on F. To see this, let
εi0 and let Kibe a sequence of compact subsets of R. Then
{P:P(Ki)1εifor all i}
is a compact subset of M(R). What is needed to show tightness is that given δ,
there is a set of the form given earlier with µnmeasure greater than 1δfor all n.
Use assumptions (i) and (iii) of Theorem 2.3.4 and show that for each i, you can
get an nisuch that for all n, µn{F:F(tni0)
iand 1F(tni,k(ni))
i}/2i;
2. {µn}converges to a measure Π; and
3. Π satisfies the conclusions of Theorem 2.3.4.
2.3.3 Tail Free Priors
When Xis finite, we have seen that by partitioning Xinto
{B0,B
1},{B00,B
01,B
10,B
11},...
and reparametrizing a probability by P(B0),P(B00 |B0)..., we can identify measures
on M(X)withEk—the set of sequences of 0s and 1s of length k. Tail free priors arise
when these conditional probabilities are independent. In this section we extend this
method to the case X=R.
Let Ebe all infinite sequences of 0s and 1s, i.e., E={0,1}N.Denote by Ekall
sequences 1,
2,...,
kof 0s and 1s of length k, and let E=kEkbe all sequences
of 0s and 1s of finite length. We will denote elements of Eby .
Start with a partition
T0={B0,B
1}
of Xinto two sets. Let
T1={B00,B
01,B
10,B
11,}
where B00,B
01 is a partition of B0and B10,B
11 is a partition of B1. Proceeding this
way,let Tnbe a partition consisting of sets of the form B,whereEnand further
B1,B0is a partition of B.
We assume that we are given a sequence of partitions T={Tn}n1constructed as
in the last paragraph such that the sets {B:E}generate the Borel σ-algebra.
2.3. (PRIOR) PROBABILITY MEASURES ON M(X)71
Definition 2.3.1. ApriorΠonM(R)issaidtobetail free with respect to
T={Tn}n1if rows in
{P(B0)}
{P(B00|B0),P(B10|B1)}
{P(B000|B00 ),P(B000|B00),P(B010|B01 ),P(B100|B10),P(B110|B11 )}
·········
are independent.
To turn to the construction of tail free priors on M(R), start with a dense set of
numbers Q, like the binary rationals in (0,1), and write it as Q={a:E}such
that for any 0<<1 and construct the following sequence of partitions of R:
T0={B0,B
1}is a partition of Rinto two intervals, say
B0=(−∞,a
0],B
1=(a0,)
Let T1={B00,B
01,B
10,B
11,},where
B00 =(−∞,a
00],B
01 =(a00,a
0]
and
B10 =(a0,a
01],B
11 =(a01,)
Proceeding this way, let Tnbe a partition consisting of sets of the form B1,2,...,n,
where 1,
2,...,
nare 0 or 1 and further B1,2,...,n0,B
1,2,...,n1is a partition of
B1,2,...,n.
The assumption that Qis dense is equivalent to the statement that the sequence
of partitions T={Tn}n1constructed as in the last paragraph are such that the sets
{B:E}generate the Borel σ-algebra.
For ea ch E, let Ybe a random variable taking values in [0,1].Ifweset
Y=P(B0|B), then for each k,{Y:∈∪
ikEi}define a joint distribution for
P(B):Ek. By construction, these are consistent. In order for these to define a
prior on M(R) we need to ensure that the continuity condition (ii) of Theorem 2.3.2
holds.
Theorem 2.3.5. If Y=P(B0|B), where Y:Eis a family of [0,1] valued
random variables such that
(i)
Y⊥{Y0,Y
1}⊥{Y00,Y
01,Y
10,Y
11}⊥...
72 2. M(X)AND PRIORS ON M(X)
(ii) for each E,
Y0Y00Y000 ... =0and Y1Y11 ... =0 (2.1)
then there exists a tail free prior Πon M(R)(with respect to the partition under
consideration) such that Y=P(B0|B).
Proof. As noted earlier we need to verify condition (ii) of Theorem 2.3.2. In the
current situation it amounts to showing that if o=0
10
2...
0
kand as n→∞,an
decreases to a0, then the distribution of Fanconverges to Fa0. Because any
sequence of adecreasing to a0is a subsequence of a01,a
010,a
0100,···,
Fa010...0=Fa0+P(B010...0)
and
P(B01,0...0)=P(B0)(1 Y0)Y01Y010 ...
the result follows from (ii).
These discussions can be usefully and elegantly viewed by identifying Rwith the
space of sequences of 0s and 1s.
As before, let Ebe {0,1}N.Any probability on Egives rise to the collection of
numbers {y:E},wherey12...n=P(n+1 =0|12...
n). Conversely, setting
y12...n=P(n+1 =0|12...
n), any set numbers {y:E},with0y1
determines a probability on E.Inotherwords,
P(12...
k)=
k
i=1,i=0
y12...i1
k
i=1,i=1
(1 y12...i1) (2.2)
Hence, to define a prior on M(E), we need to specify a joint distribution for {Y:
E},whereeachYis between 0 and 1.
As in the finite case, we want to use the partitions T={Tn}n1to identify R
with sequences of 0s and 1s. and Let xR.φ(x) is the function that sends xto the
sequence in E,where
1(x)=0 ifxB01(x)=1 ifxB1
i(x)=0 ifxB1,2,...,i10i(x)=1 ifxB12...i11
Because each Tnis a partition of R,φdefines a function from Rinto E.φis 1-
1, measurable but not onto E. The range of φwill not contain sequences that are
2.3. (PRIOR) PROBABILITY MEASURES ON M(X)73
eventually 0. This is another way of saying that with binary expansions we consider
the expansion with 1 in the tails rather than 0s. If D={E:i= 0 for all i
nfor some n}∪{:i= 1 for all i}, then φis 1-1, measurable from Ronto DcE.
Further, φ1is measurable on DcE. Thus, as before, the set of probability measures
M(R) can be identified with M0(E)—the set of probability measures on Ethat give
mass 0 to D. This reduces the task of defining a prior on M(R) to one of defining a
prior on M0(E).
The condition P(D) = 0 gets translated to
y0(y00)... = 0 for all Eand y1y11 ... =0 (2.3)
As before, defining a prior on M(R), equivalently on M0(E), amounts to defining
{Y:E}such that (2.3) is satisfied almost surely. Satisfying (2.3) almost surely
corresponds to condition (ii) in Theorem 2.3.5.
A useful way to specify a prior on M(E)isbyhavingYfor of different lengths
be mutually independent, which yields tail free priors. In Chapter 3, we return to this
construction, to develop Polya tree priors.
Tail free prior are conjugate in the sense that if the prior is tail free, then so is the
posterior. To avoid getting lost in a notational mess we first state an easy lemma.
Lemma 2.3.1. Let ξ1
2,...,ξ
kbe independent random vectors (not necessarily
of the same dimension) with joint distribution µ=k
1µi.LetJbe a subset of
{1,2,...,k}and let µbe the probability with
=C
jJ
ξj
Then ξ1
2,...,ξ
kare independent under µ.
Proof. Clearly C=jJ[ξjj]1. Further,
Prob(ξiBi:1ik)=(ξiBi:1ik)
C[
jJ
ξj]
=
i/J
µi(Bi)
jJBjξjj
ξjj
74 2. M(X)AND PRIORS ON M(X)
Theorem 2.3.6. Suppose Πis a tail free prior on M(R)with respect to the sequence
of partitions {T k}k1. Given P, let X1,X
2,...,X
nbe,i.i.d. P; then the posterior is
also tail free with respect to {T k}k1.
Proof. We will prove the result for n= 1; the general case follows by iteration.
Consider the posterior distribution given Tk. Because {B:Ek}are the atoms of
Tk, it is enough to find the posterior distribution given XBfor each Ek.
Let =12...
k. Then the likelihood of P(B)is
k
1
P(B1,2,...,j|B1,2,...,j1)
so that the posterior density of {P(B1|B)}with respect to Π is
C
n
i=1,i=0
P(B12...i|B12...i1)
n
i=1,i=1
(1 P(B12...i|B12...i1)
From Lemma 2.3.1
{P(B1|B):E1},{P(B1|B):E2},...,{P(B1|B):Ek1}
are independent under the posterior.
In particular if m<k, independence holds for
{P(B1|B):E1},{P(B1|B):E2},...,{P(B1|B):Em1}.
Letting k→∞, an application of the martingale convergence theorem gives the
conclusion for the posterior given X1.
In this section we have discussed two general methods of constructing priors on
M(R) . There are several other techniques for obtaining nonparametric priors. There
are priors that arise from stochastic processes. If fis the sample path of a stochastic
process then ˆ
f=k1(f)efyields a random density when k(f)=Eefis finite. We
will study a method of this kind in the context of density estimation. Or one can
look at expansions of a density using some orthogonal basis and put a prior on the
coefficients. A class of priors called neutral to the right priors, somewhat like tail free
priors, will be studied in Chapter 10 on survival analysis.
2.4. TAIL FREE PRIORS AND 0-1LAWS 75
2.4 Tail Free Priors and 0-1 Laws
Suppose Π is a prior on M(R)and{B:E}is a set of partitions as described
in the last section. To repeat, for each n,Tn={B:En}is a partition of Rand
B0,B1is a partition of B. Further B=σ{B:E}. Unlike the last section it
is not required that Bbe intervals. The choice of intervals as sets in the partition
played a crucial role in the construction of a probability measure on M(R). Given a
probability measure on M(R), the following notions are meaningful, even if the B
are not intervals.
For notational convenience, as before, denote by Y=P(B0|B). Formally, Yis a
random variable defined on M(R)withY(P)=P(B0|B). Recall that Π is said to
be tail free with respect to the partition T={Tn}n1if
Y⊥{Y0,Y
1}⊥{Y00,Y
01,Y
10,Y
11}⊥...
Theorem 2.4.1. Let λbe any finite measure on R, with λ(B)>0for all .If
0<Y
<1for all then
Π{P:P<<λ}=0or 1
Proof. Assume without loss of generality that λis a probability measure.
Let Z0=Y,Z1={Y0,Y
1},Z
2={Y00,Y
01,Y
10,Y
11},.... By assumption, Z1,Z
2,...
are independent random vectors. The basic idea of the proof is to show that L(λ)=
{P:P<}is a tail set with respect to the Zis. The Kolmogorov 0 1law
then yields the conclusion. In the next two lemmas it is shown that for each n,L(λ)
depends only on Zn,Z
n+1,... and is hence a tail set.
Lemma 2.4.1. When P(B)>0, define P(·|B)to be the probability P(A|B)=
P(AB)/P (B). Define λ(·|B)similarly. Fix n; then
L(λ)={P:P(·|B)<< λ(·|B)for all Ensuch that P(B)>0}
Proof. Because
P(A)=
En
P(A|B)P(B)andλ(A)=
En
λ(A|B)λ(B)
the result follows immediately.
76 2. M(X)AND PRIORS ON M(X)
Lemma 2.4.2. Let Y={Y(P):E,P M(R)}. The elements yof Yare
thus a collection of conditional probabilities arising from a probability. Conversely
any element yof Ygives rise to a probability which we denote by Py. Then for each
En, for all A∈B, and for every yin Y
Py(A|B)depends only on Zn,Z
n+1,...
Proof. Let
B0=A: for all y,P
y(A|B) depends only on Zn,Z
n+1,...
Because 0 <Y
<1 for all E,Py(B)>0 for all E. Hence B0contains the
algebra of finite disjoint unions of elements in {B:∈∪
m>nEm}and is a monotone
class. Hence B0=B.
Remark 2.4.1.Let Π be tail free with respect to T={Tn}n1such that 0 <Y
<
1; for all E. Argue that Pis discrete iff P(.|B) is discrete for all En.Now
use the Kolmogorov 0-1 law to conclude that Π{P:Pis discrete }= 0 or 1.
The next theorem, due to Kraft, is useful in constructing priors concentrated on
sets like L(λ).
Let Π,{B:E},{Y:E}be as in the Theorem 2.4.1, and, as before
given any realization y={y:E},letPydenote the corresponding probability
measure on R.
Theorem 2.4.2. Let λbe a probability measure on Rsuch that λ(B)>0for all
E. Suppose
fn
y(x)=
En
Py(B)
λ(B)IB(x)=
Enk
i=1,i=0 y12...i1k
i=1,i=1(1 y12...i1)
λ(B)
If supnEΠ fn
y(x)!2Kfor all xthen Π{P:P<<λ}=1
Proof. For ea ch yY, by the martingale convergence theorem fn
yconverges almost
surely λtoafunctionfy. Consider the measure Π ×λ, which is the joint distribution
of yand x,on
E
Y×R.
2.4. TAIL FREE PRIORS AND 0-1LAWS 77
Because for each y,fn
yfya.s λ,wehavefn
yfya.s Π ×λ. Further, under
our assumption fn
y(x):n1is uniformly integrable with respect to Π ×λand
hence EΠ×λfn
y(x)fy(x)0. Now for each y, by Fatou’s lemma, Eλfy1.
On the other hand, EΠ×λfn
y(x) = 1 for all n, and by the L1-convergence mentioned
earlier, EΠ×λfy(x)=1.ThusEλfy=1a.e.πand this shows π{L(λ)}=1.
The next theorem is an application of the last theorem. It shows how, given a
probability measure λ, by suitably choosing both the partitions and the parameter
of the Ys , we can obtain a prior that concentrates on L(λ).
Theorem 2.4.3. Let λbe a continuous probability distribution on R. Denote by
Fthe distribution function of λand construct a partition as follows:
B0=F1(0,1/2] B1=F1(1/2,1]
B00 =F1(0,1/4],B
01 =F1(1/4,1/2] B10 =F1(1/2,3/4],B
11 =F1(3/4,1]
and in general
B1,2,...,n=F1n
1
i
2n,
n
1
i
2n+1
2n
Suppose E(Y)=1/2for all Eand sup
En
V(Y)bn,with bn<. Then
the resulting prior satisfies Π(L(λ)) = 1.
Proof. λ(B)>0, because λ(B0|B)=1/2, for all B. Fix x.IfxB12,...n, then
fn
Y(x)=
n
i=0
Y1i
12,...i1(1 Y12,...i1)i
1/2
and
E[fn
Y(x)]2=
n
o
4E [Y2
12,...i1]1i[(1 Y12,...i1)2]i!
n
0
4ai
where ai=maxEY 2
12,...i1,E(1 Y12,...i1)2.Now
78 2. M(X)AND PRIORS ON M(X)
EY 2
12,...i1=V(Y12,...i1)+(1/2)2bi+1/4
and
E1Y12,...,i1)2bi+1/4
Thus n
14ain
1(1+4bi) converges, because bn<.
2.5 Space of Probability Measures on M(R)
We next turn to a discussion of probability measures on M(R). To get a feeling for
what goes on we begin by asking when are two probability measures Π1and Π2
equal?
Clearly Π1
2if for any finite collection B1,B
2,...,B
kof Borel sets,
(P(B1),P(B2),...,P(Bk))
has the same distribution under both Π1and Π2. This is an immediate consequence
of the definition of BM.
Next suppose that (C1,C
2,...,C
k) are Borel sets. Consider all intersections of the
form
C1
1C2
2∩···Ck
k
where i=0,1, C1
i=Ciand C0
i=Cc
i. These intersections would give rise to a
partition of X, and since every Cican be written as a union of elements of this
partition, the distribution of (P(C1),P(C2),...,P(Ck)) is determined by the joint
distribution of the probability of elements of this partition. In other words, if the
distribution of (P(B1),P(B2),...,P(Bk)) under Π1and Π2are the same for every
finite disjoint collection of Borel sets then Π1
2.Following is another useful
proposition.
Proposition 2.5.1. Let B0={Bi:iI}be a family of sets closed under finite
intersection that generates the Borel σ-algebra Bon X. If for every B1,B
2,...,B
k
in B0,(P(B1),P(B2),...,P(Bk)) has the same distribution under Π1and Π2, then
Π1
2.
Proof. Let B0
M={E∈B
M
1(E)=Π
2(E)}. Then B0
Mis a λ-system. For any J
finite subset of I, by our assumption Π1and Π2coincide on the σ-algebra BJ
M—the
σ-algebra generated by {P(Bj):jJ}and hence BJ
M⊂B
0
M.Further the union of
BJ
Mover all finite subsets of Iforms a π-system. Because these also generate BM,
B0
M=BM.
2.5. SPACE OF PROBABILITY MEASURES ON M(R)79
Remark 2.5.1.A convenient choice of B0is the collection of all open balls, all closed
balls, etc. When X=Ra very useful choice is the collection {(−∞,a]:aQ},where
Qis a dense set in R.
As noted earlier M(R) when equipped with weak convergence becomes a complete
separable metric space with BMas the Borel σ-algebra. Thus a natural topology
on M(R) is the weak topology arising from this metric space structure of M(R).
Formally, we have the following definitions.
Definition 2.5.1. A sequence of probability measure {Π}non M(R)issaidto
converge weakly to a probability measure Π if
φ(P)dΠnφ(P)dΠ
for all bounded continuous functions φon M(R).
Note that continuity of φis with respect to the weak topology on M(R). If fis a
bounded continuous function on Rthen φ(P)=fdP is bounded and continuous on
M(R) . However in general there is no clear description of all the bounded continuous
functions on M(R). If Xis compact metric, then the following description is available.
If Xis compact metric then, by Prohorov’s theorem, so is M(X) under weak
convergence. It follows from the Stone-Weirstrass theorem that the set of all functions
of the form
ki
j=1
φri
fi,j
where φri
fi,j (P)=fi,j(x)dP (x)withfi,j(x) continuous on X, is dense in the space of
all continuous functions on M(X).
The following result is an extension of a similar result in Sethuraman and Tiwari
[149].
Theorem 2.5.1. A family of probability measures {Πt:tT}on M(R)is tight
with respect to weak convergence on M(R)iff the family of expectations {EΠt:tT},
where EΠt(B)=P(B)dΠt(P), is tight in R.
Proof. Let µt=EΠt.Fix δ>0. By the tightness of {µt:tT}, for every positive
integer dthere exists a sequence of compact sets Kdin R, such that sup
t
µt(Kc
d)
6δ/(d3π2).
80 2. M(X)AND PRIORS ON M(X)
For d=1,2,...,let, Md={PM(R):P(Kc
d)1/d}, and let M=dMd. Then,
by the pormanteau and Prohorov theorems, Mis a compact subset of M(R), in the
weak topology. Further, by Markov’s inequality,
Πn(Mc
d)dEΠt(P(Kc
d))
=t(Kc
d)
6δ
d2π2
Hence, for any tT, Πt(M)d6δ/(d3π2)=δ. This proves that {µt}tTis
tight. The converse is easy.
Theorem 2.5.2. Suppose Π,Πn,n 1are probability measures on M. If any of
the following holds then Πnconverges weakly to Π.
(i) For any (B1,B
2,...,B
k)of Borel sets
LΠn(P(B1),P(B2),...,P(Bk)) →L
Π(P(B1),P(B2),...,P(Bk))
(ii) For any disjoint collection (B1,B
2,...,B
k)of Borel sets
LΠn(P(B1),P(B2),...,P(Bk)) →L
Π(P(B1),P(B2),...,P(Bk))
(iii) For any (B1,B
2,...,B
k)where for =i=1,2,...,k,Bi=(ai,b
i],
LΠn(P(B1),P(B2),...,P(Bk)) →L
Π(P(B1),P(B2),...,P(Bk))
(iv) For any (B1,B
2,...,B
k)where for =i=1,2,...,k,Bi=(ai,b
i],ai,b
irationals,
LΠn(P(B1),P(B2),...,P(Bk)) →L
Π(P(B1),P(B2),...,P(Bk))
(v) For any (B1,B
2,...,B
k)where for =i=1,2,...,k,Bi=(−∞,t
i],
LΠn(P(B1),P(B2),...,P(Bk)) →L
Π(P(B1),P(B2),...,P(Bk))
(vi) For any (B1,B
2,...,B
k)where for =i=1,2,...,k,Bi=(−∞,t
i],tirationals
LΠn(P(B1),P(B2),...,P(Bk)) →L
Π(P(B1),P(B2),...,P(Bk))
2.5. SPACE OF PROBABILITY MEASURES ON M(R)81
Proof. Because (vi) is the weakest, we will show that (vi) implies Πn
weakly
Π. Note
that for all rationals t,EΠn(P(−∞,t)) EΠ(P(−∞,t)) and hence EΠnconverges
weakly to EΠ. By the Theorem 2.5.1 this shows that {Πn}is tight. If Πis the limit
of any subsequence of {Πn}, then it follows, using Proposition 2.5.1, that Π.
Remark 2.5.2.Note that Πn
weakly
Π does not imply any of the preceding. The
modifications are easy, however. For example (i) would be changed to “For any
(B1,B
2,...,B
k) of Borel sets such that (P(B1),P(B2),...,P(Bk)) is continuous a.e
Π.”
We have considered other topologies on M(R) namely, total variation, setwise con-
vergence and the supremum metric. It is tempting to consider the weak topologies on
probabilities on M(R) induced by these topologies. But as we have observed, these
topologies possess properties that make the notion of weak convergence awkward to
define and work with. Besides, the σ-algebras generated by these topologies, via either
open sets or open balls do not coincide with BM[57]. Our interests do not demand
such a general theory. Our chief interest is when the limit measure Π is degenerate
at P0, and in this case we can formalize convergence via weak neighborhoods of P0.
When Π = δP0
n
weakly
δP0iff Πn(U)Π(U) for every open neighborhood U.
Because weak neighborhoods of P0areoftheformU={P:fidP fidP0},
weak convergence to a degenerate measure δP0can be described in terms of continuous
functions of Rrather than those on M(R) and can be verified more easily. The next
proposition is often useful when we work with weak neighborhoods of a probability
P0on R.
Proposition 2.5.2. Let Qbe a countable dense subset of R. Given any weak neigh-
borhood U of P0there exist a1<a
2...<a
nin Qand δ>0such that
{P:|P[ai,a
i+1)P0[ai,a
i+1)| for 1in}⊂U
Proof. Suppose U={P:|fdPfdP0|<},wherefis continuous with compact
support. Because Qis dense in Rgiven δthere exist a1<a
2...<a
nin Qsuch that
f(x)=0forxa1,xan,and|f(x)f(y)|for x [ai,a
i+1],1in1.
Then the function fdefined by
f(x)=f(ai)forx[ai,a
i+1),i =1,2,...n1
satisfies sup
x|f(x)f(x)|.
82 2. M(X)AND PRIORS ON M(X)
For any P,fdP =f(ai)P[ai,a
i+1),
|fdP fdP0|<ckδwhere c=sup
x|f(x)|
In addition, if Pis in Uthen we have
|fdP fdP0|<2δ+ckδ
Thus with Bi=[ai,a
i+1] for small enough δ,{P:|P(Bi)P0(Bi)|}is contained
in U. The preceding argument is easily extended if Uis of the form
{P:|fidP fidP0|≤i,1ik, ficontinuous with compact support}
Following is another useful proposition.
Proposition 2.5.3. Le t U={F:sup
−∞<x<|F0(s)F(x)|<}be a supre-
mum neighborhood of a continuous distribution function F0. Then Ucontains a weak
neighborhood of F0.
Proof. Choose −∞ =x0<x
1<x
2<...< x
k=such that F(xi+1)F(xi)</4
for i=1,...,k1. Consider
W={F:|F(xi)F0(xi)|</4},i=1,2,...,k
If x(xi1,x
i),
|F(x)F0(x)|≤|F(xi1)F0(xi)|∨|F(xi)F0(xi1)|
≤|F(xi1)F0(xi1)|+|F0(xi1)F0(xi)|
+|F(xi)F0(xi)|+|F0(xi1)F0(xi)|
which is less than if FW.
2.6. DE FINETTI’S THEOREM 83
2.6 De Finetti’s Theorem
Much of classical statistics has centered around the conceptually simplest setting of
independent and identically distributed observations. In this case, X1,X
2,... are
a sequence of i.i.d. random variables with an unknown common distribution P.In
the parametric case, Pwould be constrained to lie in a parametric family, and in
the general nonparametric situation Pcould be any element of M(R). The Bayesian
framework in this case consists of a prior Π on the parameter set M(R); given P
the X1,X
2,... is modeled as i.i.d. P. In a remarkable theorem, De Finetti showed
that a minimal judgment of exchangeability of the observation sequence leads to the
Bayesian formulation discussed earlier.
In this section we briefly discuss De Finetti’s theorem. A detailed exposition of
the theorem and related topics can be found in Schervish [144] in the section on De
Finetti’s theorem and the section on Extreme models.
As before, let X1,X
2,... be a sequence of X-valued random variables defined on
Ω=R.
Definition 2.6.1. Let µbe a probability measure on R. The sequence X1,X
2,...
is said to be exchangeable if, for each n and for every permutation gof {1,...,n},the
distribution of X1,X
2,...,X
nisthesameasthatofXg(1),X
g(2),...,X
g(n).
Theorem 2.6.1 (De Finetti). Let µbe a probability measure on R. Then
X1,X
2,... is exchangeable iff there is a unique probability measure Πon M(R)such
that for all nand for any Borel sets B1,B
2,...,B
n,
µ{X1B1,X
2B2,...,X
nBn}=M(R)
n
1
P(Bi)dΠ(P) (2.4)
Proof. We begin by proving the theorem when all the Xis take values in a finite set
X={1,2,...,k}. This proof follows Heath and Sudderth [95].
So let X={1,2,...,k}and µbe a probability measure on Xsuch that X1,X
2,...
is exchangeable. For each n, let Tn(X1,X
2,...,X
n)=(r1,r
2,...,r
k), where rj=
n
i=1
I{j}(Xi) is the number of occurrences of jsinX1,X
2,...,X
n.Letµ
ndenote the
distribution of Tn/n =(r1/n, r2/n,...,r
k/n) under µ.µ
nis then a discrete probability
measure on M(X) supported by points of the form (r1/n, r2/n,...,r
k/n), where for
j=1,2,...,k,rj0 is an integer and rj=n. Because M(X) is compact, there
is a subsequence {ni}that converges to a probability measure Π on M(X). We will
argue that Π satisfies (2.4).
84 2. M(X)AND PRIORS ON M(X)
Because X1,X
2,...,X
nis exchangeable, it is easy to see that the conditional distri-
bution of X1,X
2,...,X
ngiven Tnis also exchangeable. In particular, the conditional
probability given Tn(X1,X
2,...,X
n)=(r1,r
2,...,r
k) is just the uniform distribution
on T1
n(r1,r
2,...,r
k). In other words, the conditional distribution of X1,X
2,...,X
n
given Tn=(r1,r
2,...,r
k) is the same as the distribution of nsuccessive draws from
an urn containing nballs with riof color i,fori=1,2,...,k.
Fix mand n>m. Then, given Tn(X1,X
2,...,X
n)=(r1,r
2,...,r
k), the conditional
probability that
X1=1,...,X
s1=1,X
s1+1 =2,...,X
s1+s2=2,...,X
msk1+1 =k,...,X
m=k
is (r1)s1(r2)s2...(rk)sk/(n)m,where for any real aand integer b,(a)b=b1
0(ai).
Because
(r1,r2,...,rk),rj=n
(r1)s1(r2)s2...(rk)sk
(n)m
µTn
n=(
r1
n,r2
n,...,rk
n)
=M(X)
(p1n)s1(p2n)s2...(pkn)sk
(n)m
n(p1,p
2,...,p
k)
As n→∞the sequence of functions
(p1n)s1(p2n)s2...(pkn)sk
(n)m
converges uniformly on M(X)to psj
jso that by taking the limit through the sub-
sequence {ni}, the probability of
(Xi=1,1is1;Xi=2,s
1+1is1+s2,...,X
i=k, m sk1+1im)
is M(X)psj
jdΠ(p1,p
2,...,p
k) (2.5)
Uniqueness is immediate because if Π1,Π2are two probability measures on M(X)
satisfying (2.5) then it follows immediately that they have the same moments.
To move on to the general case X=R,letB1,B
2,...,B
kbe any collection of
disjoint Borel sets in R. Set B0=k
1Bic.
2.6. DE FINETTI’S THEOREM 85
Define Y1,Y
2,... by Yi=jif XiBj. Because X1,X
2,... is exchangeable, so are
Y1,Y
2,.... Since each Yitakes only finitely many values, we use what we have just
proved and writing XiBjfor Yi=j, there is probability measure ΠB1,B2,...,Bkon
{p1,p
2,...,p
k:pj0,pj1}such that for any m,
µ(X1Bi1,X
2Bi2,...,X
mBim)=m
1
P(Bij)dΠB1,B2,...,Bk(P) (2.6)
where i1,i2,...,im are all elements of {0,1,2,...,k}and P(B0)=1k
1P(Bi).
We will argue that these ΠB1,B2,...,Bks satisfy the conditions of Theorem 2.3.4.
If A1,A
2,...,A
lis a collection of disjoint Borel sets such that Biareunionofsets
from A1,A
2,...,A
lthen the distribution of P(B1),P(B2),...,P(Bk) obtained from
P(A1),P(A2),...,P(Al)an
B1,B2,...,Bkboth would satisfy (2.5). Uniqueness then
shows that both distributions are same.
If (B1n,B
2n,...,B
kn)(B1,B
2,...,B
k) then (2.6) again shows that moments of
ΠB1n,B2n,...,Bkn converges to the corresponding moment of ΠB1,B2,...,Bk.
It is easy to verify the other conditions of Theorem 2.3.4. Hence there exists a Π
with ΠB1,B2,...,Bks as marginals. It is easy to verify that Π satisfies (2.4).
De Finetti’s theorem can be viewed from a somewhat general perspective. Let Gn
be the group of permutations on {1,2,...,n}and let G=∪Gn.Everyg∈Ginduces
in a natural way a transformation on Ω = Xthrough the map, if, say gin Gn, then
(x1,...,x
n,...)→ (xg(1),...,x
g(n),...). It is easy to see that the set of exchangeable
probability measures is the same as the set of probability measures on Ω that are
invariant under G. This set is a convex set, and De Finetti’s theorem asserts that the
set of extreme points of this convex set is {P:PM(X)}and that every invariant
measure is representable as an average over the set of extreme points. This view of
exchangeable measures suggests that by suitably enlarging Git would be possible
to obtain priors that are supported by interesting subsets of M(X) . Following is a
simple, trivial example.
Example 2.6.1. Let H={h, e},whereh(x)=xand e(x)=x. Set H=
Hn.If(h1,h
2,...,h
n)) Hn, then the action on Ω is defined by (x1,x
2,...,x
n)→
(h(x1),h(x2),...,h(xn). Then an exchangeable probability measure µis Hinvariant
iff it is a mixture of symmetric i.i.d. probability measures. To see this by De Finetti’s
theorem
µ(A)=P(A)dΠ(P)
86 2. M(X)AND PRIORS ON M(X)
Because by Hinvariance µ(X1A, X2∈−A)=µ(X1A, X2A), it is not hard
to see that EΠ(P(A)P(A))2= 0. Letting Arun through a countable algebra
generating the σ-algebra on X, we have the result.
More non trivial examples are in Freedman [68]
Sufficiency provides another frame through which De Finetti’s theorem can be use-
fully viewed. The ideas leading to such a view and the proofs involve many measure-
theoretic details. Most of the interesting examples involve invariance and sufficiency
in some form. We do not discuss these aspects here but refer the reader to the excel-
lent survey in Schervish [144], the paper by Diaconis and Freedman [[44]] and Fortini,
Ladelli, and Regazzini [67].
To use DeFinetti’s theorem to construct a specific prior on M(R), we need to know
what to expect from the prior in terms of the observables X1,X
2,...,X
n. Although
this method of assigning a prior is attractive from a philosophical point of view, it
is not easy to either describe explicitly an exchangeable sequence or identify a prior,
given such a sequence. We will not pursue this aspect here.
3
Dirichlet and Polya tree process
3.1 Dirichlet and Polya tree process
In this chapter we develop and study a very useful family of prior distributions on
M(R) introduced by Ferguson [61]. Ferguson introduced the Dirichlet processes, un-
covered many of their basic properties, and applied them to a variety of nonparametric
estimation problems, thus providing for the first time a Bayesian interpretation for
some of the commonly used nonparametric procedures. These priors are relatively
easy to elicit. They can be chosen to have large support and thus capture the non-
parametric aspect. In addition they have tractable posterior and nice consistency
properties. These processes are not an answer to all Bayesian nonparametric or semi-
parametric problems but they are important as both a large class of interpretable
priors and a point of departure for more complex prior distributions.
The Dirichlet process arises naturally as an infinite-dimensional analogue of the
finite-dimensional Dirichlet prior, which in turn has its roots in the one-dimensional
beta distribution . We will begin with a review of the finite-dimensional case.
3.1.1 Finite Dimensional Dirichlet Distribution
In this section we summarize some basic properties of the Dirichlet distribution,
especially those that arise when the Dirichlet is viewed as a prior on M(X) -the set of
88 3. DIRICHLET AND POLYA TREE PROCESS
probability measures on X. Details are available in many standard texts, for example
Berger [13].
First consider the simple case when X={1,2}. Then
M(X)=p=(p1,p
2):p10,p
20,p
1+p2=1
Because p2=1p1and 0 p11, any probability measure on [0,1] defines
a prior distribution on M(X). In particular say that phas a beta(α1
2)priorif
α1>0
2>0 and if the prior has the density
Π(p1)= Γ(α1+α2)
Γ(α1)Γ(α2)pα11
1(1 p1)α210p11
It is easy to see that
E(p1)= α1
α1+α2
V(p1)= α1(α1+1)
(α1+α2)(α1+α2+1) α1
(α1+α2)2
=α1α2
(α1+α2)2(α1+α2+1)
We adopt the convention of setting the beta prior to be degenerate at p1=0if
α1= 0 and degenerate at p2=0ifα2= 0. Note that the convention goes well with
the expression for E(p1). In fact the following proposition provides more justification
for this convention.
Proposition 3.1.1. If α1n0and α2nc, 0<c<, then beta(α1n
2n)
converges weakly to δ0.
Proof. If pnis distributed as beta(α1n
2n), then Epn0,V(pn)0 and hence
pn0 in probability.
The following representation of the beta is useful and well known. Let Z1,Z
2be
independent gamma random variables with parameters α1
2>0, i.e., the density is
given by
f(zi)= 1
Γ(αi)ezizαi1
izi>0
then Z1/(Z1+Z2) is independent of Z1+Z2and is distributed as beta(α1
2).
If we define a gamma distribution with α= 0 to be the measure degenerate at 0,
then the representation of beta random variables remains valid for all α10
20
as long as one of them is strictly positive.
3.1 DIRICHLET DISTRIBUTION 89
Suppose X1,X
2,...,X
nare X-valued i.i.d. random variables distributed as p, then
beta priors are conjugate in the sense that if phas a beta(α1
2) prior distribution
then the posterior distribution is also a beta, with parameters α1+δXi(1) and
α2+δXi(2), where δxstands for the degenerate measure δx(x) = 1. Moreover, the
marginal distribution of X1,X
2,...,X
nis exchangeable with marginal probability
λ(X1=i)=αi/(α1+α2).
Next we move on to the case where X={1,2,...,k,}. The set M(X) of probability
measures on X, is now in 1-1 correspondence with the simplex
Sk=p=(p1,p
2,...,p
k1):pi0fori=1,2,...,k 1,pi1
and as before we set pk=1k1
1pi.A prior is specified by specifying a probability
distribution for (p1,p
2,...,p
k1). This distribution determines the joint distribution of
the 2kvectors {P(A):A⊂X}through P(A)=
iA
pi. The k- dimensional Dirichlet
distribution is a natural extension of the beta distribution.
Definition 3.1.1. Let α=(α1
2,...,α
k)withαi>0fori=1,2,...,k.p=
(p1,p
2,...,p
k)issaidtohaveDirichlet distribution with parameter (α1
2,...,α
k),
if the density is
Π(p1,p
2,...,p
k1)= Γ(k
1αi)
Γ(α1)Γ(α2),...,Γ(αk)pα11
1pα21
2pαk11
k1(1
k1
1
pi)αk1
for (p1,p
2,...,p
k1)inSk.
(3.1)
Convention If any αi= 0, we still a define a Dirichlet by setting the corresponding
pi= 0 and interpreting the density (3.1.1) as a density on a lower-dimensional set.
The Dirichlet distribution with the vector (α1
2,...,α
k) as parameter will be
denoted by D(α1
2,...,α
k). So we have a Dirichlet distribution defined for all
(α1
2,...,α
k),as long as αi>0. Following are some properties of the Dirichlet
distribution.
Properties.
1. Like the beta distribution, Dirichlet distributions admit a useful representation
in terms of gamma variables. If Z1,Z
2,...,Z
kare independent gamma random
variables with parameter αi0, then
90 3. DIRICHLET AND POLYA TREE PROCESS
(a)
Z1
k
1
Zi
,Z2
k
1
Zi
,..., Zk
k
1
Zi
(3.2)
is distributed as D(α1
2,...,α
k);
(b)
Z1
k
1
Zi
,Z2
k
1
Zi
,..., Zk
k
1
Zi
(3.3)
is independent of
k
1
Ziand
(c) If p=(p1,p
2,...,p
k) is distributed as D(α1
2,...,α
k), then for any
partition A1,A
2...,A
mof X, the vector (P(A1),P(A2),...,P(Am)) =
iA1
pi,
iA2
pi,...,
iAm
piis a D(α
1
2,...,α
k)
where α
i=
jAi
αj. In particular, the marginal distribution of piis beta with
parameters (αi,
i=j
αj).
This property suggests that it would be convenient to view the parameter
(α1
2,...,α
k) as a measure α(A)=
iA
αi. Thus every non-zero measure αon
Xdefines a Dirichlet distribution and the last property takes the form
(P(A1),P(A2),...,P(Am)) is D(α(A1)(A2),...,α(Am))
2. (Tail Free Property) Let M1,M
2,...,M
kbe a partition of X.Fori=1,2,...,k
with α(Mi)>0, let P(.|Mi) be the conditional probability given Midefined by
P(j|Mi)= P(j)
P(Mi):forjMi
3.1 DIRICHLET DISTRIBUTION 91
If α(Mi)=0thentakeP(.|Mi) to be an arbitrary fixed probability for all P.
If Pthe probability on Xis D(α) then
(i) (P(M1),P(M2),...,P(Mk)) ,P(.|M1),P(.|M2),...,P(.|Mk) are indepen-
dent;
(ii) if α(Mi)>0 then P(.|Mi)isD(αMi),where αMiis the restriction of αto
Mi,and
(iii) (P(M1),P(M2),...,P(Mk)) is Dirichlet with parameter
(α(M1)(M2),...,α(Mk))
To see this, let X={1,2,...,n}and let {Yi:1in}be independent
gamma random variables with parameter α(xi). The gamma representation of
the Dirichlet immediately shows that
P(.|M1),P(.|M2),...,P(.|Mk) (3.4)
are independent. Further if Zj=iMjYi, then
Z1,Z
2,...,Z
k
are independent, and using (3.4) it is easy to see that (Z1,Z
2,...,Z
k) and hence
jZjis independent of
P(.|M1),P(.|M2),...,P(.|Mk)
Because P(Mj)=Zj/jZjthe result follows.
3. (Neutral to the right property) Let B1B2...B
k. Then we have the
independence relations given by
P(B1)P(B2|B1)...P(Bk|Bk1)
This follows from the tail free property by successively considering partitions
B1,Bc
1;
Bc
1,B
2,B
1Bc
2;...
4. Let α1
2be two measures on Xand P1,P
2be two independent k-dimensional
Dirichlet random vectors with parameters α1
2.IfYindependent of P1,P
2is
distributed as beta(α1(X)
2(X)), then YP
1+(1Y)P2is D(α1+α2).
92 3. DIRICHLET AND POLYA TREE PROCESS
To see this, let Z1,Z
2,...,Z
kbe independent random variables with Zi
gamma(α1{i}). Similarly for i=1,2,...k let Zk+iGamma(α2{i}) be inde-
pendent gamma random variables. Then
k
1Zi
2k
1ZiZ1
k
1Zi
,..., Zk
k
1Zi+k
k+1 Zi
2k
1ZiZk+1
k
1Zi
,..., Z2k
k
1Zi
has the same distribution as YP
1+(1Y)P2. But then the last expression is
equal to
Z1+Zk+1
k
1Zi
,...,Zk+Z2k
k
1Zi
which is distributed as D(α1+α2). Note that the assertion remains valid even
if some of the α1{i}
2{j}are zero. An interesting consequence is: If Pis D(α)
and Yis independent of Pand distributed as Beta(c, α(X)), then
(1,0,...,0) +(1Y)PD(α{1}+c, α{2},...,α{k})
This follows if we think of δ1,0,...,0as Dirichlet with parameter (c, 0,...,0). A
corresponding statement holds if (1,0,...,0) is replaced by any vector with a 1
at one coordinate and 0 at the other coordinates.
5. For each pin M(X) , let X1,X
2,...,X
nbe i.i.d. Pand let Pitself be D(α).
Then the likelihood is proportional to
k
1
pαi1+ni
i
where ni=#{j:Xj=i}. Hence the posterior distribution of Pgiven
X1,X
2,...,X
ncan be conveniently written as D(α+δXi).
6. The marginal distribution of each Xiis ¯αwhere ¯α(i)=α(i)(X)andalso
E(P)=¯α. To see this, note that for each A⊂X,P(A)isbeta(α(A)(Ac))
and hence E(P(A)=α(A)/(α(A)+α(Ac)).
Property 5 immediately leads to
7.
D(α)(PC)=
k
1
α(i)
α(X)D(α+δi)(C)
3.1 DIRICHLET DISTRIBUTION 93
This follows from D(α)(PC)=E(E(PC|X1)); E(PC|X1)isbyprop-
erty 5, D(α+δX1)(C), and the marginal of X1is ¯α.
8. Let Pbe distributed as D(α)andXindependent of Pbe distributed as ¯α.
Let Ybe independent of Xand Pbe a beta(1(X)) random variable. Then
X+(1Y)Pis again a D(α) random probability.
This follows from properties 4 and 7 by conditioning on x=i, interpreting δi
as a D(δi) distribution, and then using properties 4 and 7.
9. The predictive distribution of Xn+1 given X1,X
2,...,X
nis
α+n
1δXi
α(X)+n
10. α1=α2implies D(α1)=D(α2), except when α1
2are degenerate and put all
their masses at the same point.
This can be verified by choosing an isuch that α1(i)=α2(i). Then P(i)hasa
nondegenerate beta distribution under at least one of α1
2. Next use the fact
that a beta distribution is determined by its first two moments.
11. It is often convenient to write a finite measure αon Xas α=c¯α,where ¯αis
a probability measure. Let αn=cn¯αnbe a sequence of measures on X. Then
D(cn¯αn) is a sequence of probability measures on the compact set Skand hence
has limit points. The following convergence results are useful.
(a) If ¯αn¯αand cnc, 0<c<, then D(cn¯αn)D(c¯α)weakly.
If ¯α{i}>0 for all i, then the density of D(cn¯αn) converges to that of
D(c¯α). If ¯α{i}=0forsomeoftheis, then the result can be verified by
showing that the moments of D(cn¯αn) converge to the moments of D(c¯α).
(b) Suppose that ¯αn¯αand cn0. Then D(cn¯αn) converges weakly to the
discrete measure µwhich gives mass ¯αito the probability degenerate at i.
To see this note that ED(cn¯αn)piαn{i}→¯α{i}, and it follows from
simple calculations that ED(cn¯αn)p2
ialso converges to ¯α{i}.Thuseachpiis
0 or 1 almost surely with respect to any limit point of D(cn¯αn). In other
words, any limit point of D(cn¯αn) is a measure concentrated on the set of
degenerate probabilities on X. It is easy to see that any two limit points
have the same expected value and this together with the fact that they are
both concentrated on degenerate measures shows that D(cn¯αn)converges.
94 3. DIRICHLET AND POLYA TREE PROCESS
(c) ¯αn¯αand cn→∞. In this case also, ED(cn¯αn)piconverges to ¯α{i}.
However Var
D(cn¯αn)pi0, and hence D(cn¯αn) converges to the measure
degenerate at ¯α.
3.1.2 Dirichlet Distribution via Polya Urn Scheme
The following alternative view of the Dirichlet process is both interesting and a pow-
erful tool. For a recent use of this approach, see Mauldin et al.[133].
Consider a Polya urn with α(X) balls of which α(i)areofcolori;i=1,2,...,k.[For
the moment assume that α(i) are whole numbers or 0]. Draw balls at random from
the urn, replacing each ball drawn by two balls of the same color. Let Xi=jif the i
th ball is of color j. Then
P(X1=j)= α(j)
α(X)(3.5)
P(X2=j|X1)=α(j)+δX1(j)
α(X)+1 (3.6)
andingeneral
(3.7)
P(Xn+1 =j|X1,X
2,...,X
n)=α(j)+n
1δXi(j)
α(X)+n(3.8)
Thus we are reproducing the joint distribution of X1,X
2,... that would be ob-
tained from property 9 in the last section. The joint distribution of X1,X
2,... is
exchangeable. In fact, if λαdenotes the joint distribution
λα(X1=x1,X
2=x2,...,X
n=xn)
=α(x1)
α(X)
n1
i=1
α+δi1
1xj
i1
1(α(X)+j)(xi+1)
setting ni=#{Xj=i}
={α(1)(α(1) + 1) ...(α(1) + n11)}{α(2)(α(2) + 1) ...(α(2) + n21)}...
α(X)(α(X)+1)...(α(X)+n1)
=[α(1)][n1]...[α(k)][nk]
[α(X)][n]
(3.9)
3.1 DIRICHLET DISTRIBUTION 95
where m[n]is the ascending factorial given by m[n]=m(m+1)...(m+n1).
It is clear that (3.5) defines successive conditional distributions even when α{i}is
not an integer but only 0. The scheme (3.5) thus leads to a sequence of exchangeable
random variables and the corresponding mixing measure Π coming out of De Finetti’s
theorem is precisely Dα. What we need to show is that if Dαis the prior on M(X)and
if given P,X1,X
2,... are i.i.d P, then the sequence X1,X
2,... has the distribution
given in (3.9). In fact, (3.9) is equal to
M(X)
[P(1)]n1...[P(k)]nkDα(dP )
which is equal to
M(X)
[P(1)]n1...[P(k)]nkΠ(dP )
Since the finite-dimensional Dirichlet is determined by its moments, this shows Π =
Dα.
The posterior given X1,X
2,...,X
ncan also be recovered from this approach. For
agivenX1, (3.5) defines a scheme of conditional distributions with αreplaced by
α+δX1. Once again DeFinetti’s theorem leads to the prior D(α+δX1), this is also
the posterior given X1.
We end this section with the question of interpretation and elicitation of α. From
property 6, ¯α=α(·)(X)=E(P). So ¯αis the prior guess about the expected P.
If we rewrite property 10 in terms of the Bayes estimate E(pi|X1,X
2,...,X
n)ofpi
given X1,X
2,...,X
n
E(pi|X1,X
2,...,X
n)= α(X)
α(X)+n¯α(i)+ n
α(X)+n(ni
n)
which shows the Bayes estimate can be viewed as a convex combination of the “prior
guess” and the empirical proportion. Because the weight of the “prior guess” is de-
termined by α(X), this suggests interpreting α(X) as a measure of strength of the
prior belief. This ease in interpretation and elicitation is a consequence of the fact
that Dirichlet is a conjugate prior for i.i.d. sampling from X. We will show that all
these properties hold when X=R. The fact that variability of Pis determined by a
single parameter α(X) can be a problem when k>2.
96 3. DIRICHLET AND POLYA TREE PROCESS
3.2 Dirichlet Process on M(R)
3.2.1 Construction and Properties
Dirichlet process priors are a natural generalization to M(R) of the finite-dimensional
distributions considered in the last section. Let (R,B) be the real line with the Borel
σ-algebra Band let M(R) be the set of probability measures on R, equipped with
the σ-algebra BM.
The next theorem asserts the existence of a Dirichlet process and also serves as a
definition of the process Dα.
Theorem 3.2.1. Let αbe a finite measure on (R,B). Then there exists a unique
probability measure Dαon M(R)called the Dirichlet process with parameter αsat-
isfying
For every partition B1,B
2,...,B
kof Rby Borel sets
(P(B1),P(B2)...,P(Bk)) is D(α(B1)(B2)...,α(Bk))
Proof. The consistency requirement in Theorem 2.3.4 follows from property 2 in the
last section. Continuity requirement 3 follows from the fact that if BnBthen
α(Bn)α(B) and from property 11 of the last section.
Note that finite additivity of αis enough to ensure the consistency requirements.
The countable additivity is required for the continuity condition.
Assured of the existence of the Dirichlet process, we next turn to its properties.
These properties motivate other constructions of Dαvia De Finetti’s theorem and an
elegant construction due to Sethuraman. These constructions are not natural unless
one knows what to expect from a Dirichlet process prior.
If PD(α), then it follows easily that E(P(A)) = ¯α(A)=α(A)(R). Thus one
might write E(P)=¯αas the prior expectation of P.
Theorem 3.2.2. For each Pin M(R), let X1,X
2,...,X
nbe i.i.d. Pand let P
itself be distributed as Dα, where αis finite measure. (A version of) the posterior
distribution of Pgiven X1,X
2,...,X
nis Dα+n
1δXi.
Proof. We prove the assertion when n= 1; the general case follows by repeated
application. A similar proof appears in Schervish[144].
To sho w that Dα+δXis a version of the posterior given X, we need to verify that
for each B∈Band Ca measurable subset of M(R),
B
Dα+δx(Cα(dx)=C
P(B)Dα(dP )
3.2. DIRICHLET PROCESS ON M(R)97
As Cvaries each side of this expression defines a measure on M(R), and we shall argue
that these two measures are the same. It is enough to verify the equality on σ-algebras
generated by functions P→ (P(B1),P(B2)...,P(Bk)), where B1,B
2,...,B
kis a
measurable partition of R. We do this by showing that the moments of the vector
(P(B1),P(B2)...,P(Bk)) are same under both measures.
First suppose that α(Bi)>0fori=1,2,...,k. For any nonnegative r1,r
2,...,r
n,
look at
Bk
1
[P(Bi)]riDα+δx(dP )¯α(dx) (3.10)
If we denote by Dα+δiand Dαthe k-variate Dirichlet distributions with parameters
(α(B1),...,α(Bi)+1,...,α(Bk)) and (α(B1),...,α(Bi),...,α(Bk)),then (3.10) is
equal to
k
1
α(BBi)
α(B)yr1
1...y
ri
i...y
rn
kDα+δi(dy1...dy
k1).
whichinturnisequalto
=
k
1
α(BBi)
α(B)yr1
1...y
ri+1
i...y
rn
kDα(dy1...dy
k1).
On the other hand because P(B)=P(BBi),
k
1
[P(Bi)]riP(B)Dα(dP )
=
k
1k
1
[P(Bi)]riP(BBi)Dα(dP )
=
k
1P(B1)r1...P(Bi)ri+1 ...P(Bk)rk...P(BBi)
P(Bi)Dα(dP )
Since P(BBi)
P(Bi)is a Beta random variable and independent of (P(B1),P(B2)...,P(Bk)) ,
the preceding equals
k
1
α(Bi)B
α(B)P(B1)r1...P(Bi)ri+1 ...P(Bk)rk...D
α(dP )
98 3. DIRICHLET AND POLYA TREE PROCESS
which is equal to the expression obtained earlier. To take care of the case when some
of the α(Bi) may be 0, consider the simple case when, say α(B1)=0,r
1>0andthe
rest of the α(Bi) are positive. In this case
Bk
1
[P(Bi)]riDα+δx(dP )¯α(dx)=0
Because in k
1(α(BBi)(B)) yr1
1...y
ri
i...y
rn
kDα+δi(dy1...dy
k1), α(BB1)=
0 and for i=1,y1=0a.e.Dα+δi,
yr1
1...y
ri
i...y
rn
kDα+δi(dy1...dy
k1)=0
A Similar argument applies when α(Bi)is0formorethanonei.
Remark 3.2.1 (Tail Free Property). Fix a partition B1,B
2,...,B
kof X. Consider a
sequence {T}n:n1of nested partitions with T1={B1,B
2,...,B
k}and σ{{T}n:n1}=
B. Then Dαis tail free with respect to this partition. And we leave it to the reader
to verify that with Dirichlet as the prior and with given P,XP,
(P(B1),P(B2)...,P(Bk)) and X
are conditionally independent given {IBi(X); 1 ik}. Consequently, the condi-
tional distribution of the vector (P(B1),P(B2)...,P(Bk)) given Tnisthesamefor
all nand is equal to the marginal distribution of
(P(B1),P(B2)...,P(Bk))
under the measure Dα+δX.
The last remark provides an alternative and more natural approach to demonstrate
that Dα+δXis indeed the posterior given X. For, by the martingale convergence
theorem, the conditional distribution of (P(B1),P(B2)...,P(Bk)) given Tnconverges
to the conditional distribution of (P(B1),P(B2)...,P(Bk)) given X, and this limit
is the marginal distribution of the vector (P(B1),P(B2)...,P(Bk)) arising out of
Dα+δX. This is true for any partition B1,B
2,...,B
kand since a measure on M(R)
is determined by the distribution of finite partitions, we can conclude that Dα+δXis
indeed the posterior.
3.2. DIRICHLET PROCESS ON M(R)99
Remark 3.2.2 (Neutral to the Right property). Another useful independence prop-
erty follows immediately from Property 4 of the last section. If t1<t
2,... < t
k,
then
(1 F(t1)),1F(t2)
1F(t1),..., 1F(tk)
1F(tk1)
are independent.
Many of the properties of the Dirichlet process on M(R) either easily follow from,
or are suggested by the corresponding property for the finite-dimensional Dirichlet
distribution. One major difference is that in the case of M(R) the measure αcan be
continuous. This leads to some interesting consequences, some of which are explored
next.
Denote by λαthe joint distribution of P, X1,X
2,.... Suppose PD(α) and given
P,X1,X
2,... are i.i.d. P. From Theorem 3.2.2 it immediately follows that the
predictive distribution of Xn+1 given X1,X
2,...,X
nis
α+n
1δXi
α(R)+n
and hence that
X1is distributed as ¯α
Conditional distribution of X2given X1is α+δX1
α(R)+1
Conditional distribution of X3given X1,X
2is α+δX1+δX2
α(R)+2
Conditional distribution of Xn+1 given X1,X
2,...,X
nisα+n
1δXi
α(R)+n,etc.
Suppose that αis a discrete measure and let X0be the countable subset of Rsuch
that α(X0)=α(R)andα{x}>0 for all x∈X
0.Dαcan then be viewed as a prior
on M(X0). Further the joint distribution of X1,X
2,...,X
ncan be written explicitly.
For ea ch (x1,x
2,...,x
n) and for each x∈X
0,letn(x) be the number of issuch
that xi=x.Notethatn(x) is nonzero for at most nmany xs. If αndenotes the joint
distribution of X1,X
2,...,X
n, then
αn(x1,x
2,...,x
n)=
x∈X0
α(x)[n(x)] (3.11)
where a[b]=a(a+1)...(a+b1).
The case when αis continuous is a bit more involved. Even if αhas density with
respect to Lebesgue measure, for n2, because P{X1=X2} =0,α2is no longer
100 3. DIRICHLET AND POLYA TREE PROCESS
absolutely continuous with respect to the two-dimensional Lebesgue measure. To see
this formally, note that
α2{X1=X2}=(α+δx1)
α(R)+1{x1}d¯α(x1)= 1
α(R)+1
On the other hand the Lebesgue measure of {(x, x):xR}is 0.
While αnis not dominated by the n-dimensional Lebesgue measure, it is dominated
by a measure λ
ncomposed of Lebesgue measure in lower-dimensional spaces, and with
respect to this measure, it is possible to obtain a fairly explicit form of the density
of αn. We will look at the case n= 3 in some detail and then extend these ideas to
general n.
We will begin by calculating αn(A×B×C) when αis a continuous measure. Let
R1,2,3={(x1,x
2,x
3):x1,x
2,x
3are all distinct }
Then
α3((A×B×C)R1,2,3)
=α3{X1A, X2B−{X1},X
3C−{X1,X
2}}
=α(A)
α(R)
α(B)
(α(R)+1)
α(C)
(α(R)+2)
where the last equality follows from the fact that for each x1,bycontinuityofα,
α(B−{x1})=α(B)andδx1(B−{x1}) = 0. Consequently
Pr{X2B−{x1}=[α+δx1]
α(R)+1
(B−{x1})= α(B)
α(R)+1
Similarly for Pr{X3C−{x1,x
2}.
Next, let
R12,3={(x, x, x3):x=x3}
Then
α3((A×B×C)R12,3)
=α3{X1A, X2={X1},X
3C−{X1}}
3.2. DIRICHLET PROCESS ON M(R) 101
Because Pr{X2=x|X1=x}=[α+δx]/(α(R)+1)({x})=1/(α(R) + 1), again by
continuity of α, we have the preceding is equal to
α(AB)
α(R)
1
(α(R)+1)
α(C)
(α(R)+2)
Similarly, if
R13,2={(x, x2,x):x=x2}
then by exchangeability
αn(A×B×CR13,2)=αn(A×C×BR12,3)
=α(AC)
α(R)
1
(α(R)+1)
α(B)
(α(R)+2)
A similar expression holds for R1,23.
Let R123 ={(x, x, x)}. Then A×B×CR123 ={(x, x, x)xABC}.We
then have
αn(A×B×CR123)=2α(ABC)
α(R)
1
(α(R)+1)
α(B)
(α(R)+2)
where the factor 2 in the numerator arises from P(X3=x|X1=X2=x)=(δx+
δx)α(B)/(α(R)(α(R)+1)(α(R) + 2))(x).
Suppose that αhas a density ˜αwith respect to Lebesgue measure. Define a measure
λ
3as follows:
λ
3restricted to R1,2,3is the three-dimensional Lebesgue measure
λ
3restricted to R12,3is the two-dimensional Lebesgue measure obtained from R2via
the map (x, y)→ (x, x, y).
Define the restriction on R1,23 and R13,2similarly.
λ
3restricted to R12,3is the one-dimensional Lebesgue measure obtained from x→
(x, x, x).
Note that the function on R1,2,3defined by
˜α3(x1,x
2,x
3)= ˜α(x1α(x2α(x3)
α(R)(α(R)+1)(α(R)+2)
when viewed as a density with respect to λ
3restricted to R1,2,3gives, for any (A×
B×C), αn(A×B×CR1,2,3). Similarly the function on R12,3defined by
˜α3(x1,x
1,x
3)= ˜α(x1α(x3)
α(R)(α(R)+1)(α(R)+2)
102 3. DIRICHLET AND POLYA TREE PROCESS
corresponds to the density of α3with respect to λ
3restricted to R12,3and
˜α3(x1,x
1,x
1)= α(x1)
α(R)(α(R)+1)(α(R)+2)
corresponds to the density of α3with respect to λ
3restricted to R123.
The general case is similar but notationally cumbersome. For a partition {C1,...,C
k}
of {1,2,...,n},let
RC1,C2,...,Ck={(x1,x
2,...,x
n):xi=xjiff i, j Cmfor some m, 1mk}
The measure λ
nis defined by setting its restriction on RC1,C2,...,Ckto be the k-
dimensional Lebesgue measure. As before if we set I1=1and
Ij=1 if,x
j∈{x1,x
2,...,x
n}
0 otherwise.
the density of αnwith respect to λ
non RC1,C2,...,Ckis given by
˜αn(x1,x
2,...,x
n)=j˜α(xj)Ij(ej1)!
(α(R))[n](3.12)
where ej=#cj.
The verification follows essentially the same ideas, for example
αn(A1×A2×...×AnRC1,C2,...,Ck)= α(B1)α(B2)...α(Bk)
(α(R))[n]
where Bj=iCjAi.
Theorem 3.2.3. Dα{P:Pis discrete }=1.
Proof. Let ˜
E={(P, x):P{x}>0}. Note that Pis a discrete probability measure if
{x:(P,x)˜
E}P(x) = 1. We saw in the last chapter that ˜
Eis a measurable set. Let
˜
Ex={P:P{x}>0}˜
EP={x:P{x}>0}
Then
λα(˜
E)=Eλαλα(˜
E|X1
=Eλαλα(˜
EX1|X1=EλαDα+δX1(˜
EX1
=1
3.2. DIRICHLET PROCESS ON M(R) 103
Because P{x1}is beta with positive parameter α{x1}+1, P{x1}>0 with probability
1. Now
λα(˜
E)=Eλαλα(˜
E|P=EλαP(˜
EP)=1
so P(˜
EP) = 1 almost everywhere Dα.
The preceding proof is based on a presentation in Basu and Tiwari[10] . A variety of
proof for this interesting fact is available. See Blackwell & Mcqueen [25], and Blackwell
[23], Berk and Savage [17]. Another nice proof is due to Hjort [99]
3.2.2 The Sethuraman Construction
Sethuraman [148] introduced and elaborated on a useful and clever construction of
Dα, which provides insight into these processes and helps in simulation of the process.
As before let αbe a finite measure and ¯α=α/α(R).Let Ω be a probability space
with a probability µsuch that
θ1
2,... defined on Ω are i.i.d. beta(1(R))
Y1,Y
2,... are also defined on Ω such that they are i.i.d. ¯αand independent of the θis
Set p1=θ1and for n2, let pn=θnn1
1(1 θi). Easy computation shows that
1
pn= 1 almost surely. Now define an M(R) valued random variable on Ω by
P(ω, A)=
1
pn(ω)δYn(ω)(A) (3.13)
Because
1
pn= 1, the function ω→ P(ω, ·) takes values in M(X). It is not
hard to see that this map is also measurable. This random measure is a discrete
measure that puts weight pion Yi. Sethuraman showed that this random measure is
distributed as Dα. Formally, if Π is the distribution of ω→ P(ω,·) then Π = Dα.We
will establish this by showing that for every partition B1,B
2,...,B
kof Rby Borel
sets (P(ω, B1),P(ω,B2),...,P(ω, Bk)) is distributed as D(α(B1),...,α(Bk)).
Denote by δk
Yithe element of Skgiven by (IB1(Yi),I
B2(Yi),...,I
Bk(Yi)). Then for
each ω,(P(ω, B1),P(ω, B2),...,P(ω, Bk)) can be written as
1pi(ω)δk
Yi(ω).
Let Pbe an Skvalued random variable, independent of the Ysandθs, and dis-
tributed as D(α(B1),...,α(Bk)).
104 3. DIRICHLET AND POLYA TREE PROCESS
Consider the Skvalued random variable
P1=p1δk
Y1+(1p1)P
where Y1Bi,δk
Yiis the vector with a 1 in the ith coordinate and 0 elsewhere. Hence
by property 4 from Section 3.1, given Y1Bi,P1is distributed as a Dirichlet with
parameter (α(B1),...,α(Bi)+1,...,α(Bk)). Since µ(Y1Bi)=α(Bi), by property
8inSection3.1,P1is distributed as D(α(B1),...,α(Bk)).
It follows by easy induction that for all n,1n
1pi=n
1(1 θi). Using this fact,
a bit of algebra gives
n
1
piδk
Yi+(1
n
1
pi)P
=
n1
1
piδk
Yi+(1
n1
1
pi)(θnδk
Yn+(1θn)P)
Because our earlier argument showed that θnδk
Yn+(1θn)Phas the same distribution
as P, a simple induction argument shows that, for all n,
n
1
piδk
Yi+(1
n
1
pi)P
is distributed as D(α(B1),...,α(Bk)).Letting n→∞and observing that (1n
1pi)
goes to 0, we get the result.
Note that we have not assumed the existence of a Dαprior. Because P(ω, ·)is
M(X) valued, the argument also shows the existence of the Dirichlet prior.
3.2.3 Support of Dα
We begin by recalling that M(R) under the weak topology is a complete separable
metric space, and hence for any probability measure Π on M(R) the support—the
smallest closed set of measure 1— exists. Note that support is not meaningful if we
consider the total variation metric or setwise convergence.
Theorem 3.2.4. Let αbe a finite measure on Rand let Ebe the support of α.Then
Mα={P:support of PE}
is the weak support of Dα
3.2. DIRICHLET PROCESS ON M(R) 105
Proof. Mαis a closed set by the Portmanteau theorem, since Eis closed and if PnP
then P(E)lim supnPn(E). Further, because P(E)isbeta(α(R),0), Dα(Mα)=1.
Let P0belong to Mαand let Ube a neighborhood of P0. Our theorem will be
proved if we show that Dα(U)>0.
Choose points a0<a
1< ... < a
T1<a
Tand let Wj=(aj,a
j+1]Eand J=
{j:α(Wj)>0}.Then depending on whether α(jJWj)=α(R)orα(jJWj)<
α(R), (P(Wj):jJ)orP(Wj):jJ, 1jJP(Wj)has a finite-dimensional
Dirichlet distribution with all parameters positive. And in either case, for any η>0,
Dα{PM(R):|P(Wj)P0(Wj)|:jJ}>0
By Propositon 2.5.2 for small enough δ,Ucontains a set of the above form. Hence
Dα(U)>0.
3.2.4 Convergence Properties of Dα
Many of the theorems in this section are adapted from Sethuraman and Tiwari [149].
Because under Dα,E(P)=¯α, Theorem 2.5.1 in Chapter 2 immediately yields the
following.
Theorem 3.2.5. Let {αt:tT}be a family of finite measures on R. Then the
family {Dαt:tT}is tight iff {¯αt:tT}is tight.
Theorem 3.2.6. Suppose {αm}are finite measures on Rsuch that ¯αm¯α
weakly.
(i) If αm(R)α(R)where 0(R)<, then DαmDαweakly.
(ii) If αm(R)0. Then Dαmconverges weakly to D, where
D{P:Pis degenerate}=1
(iii) If α(R)→∞then Dαmconverges weakly to δα.
Proof. By Theorem 3.2.5, {Dαm}is tight and hence any subsequence has a further
subsequence that converges to, say, D.
(i) We will argue that the limit Dis Dαand is the same for all subsequences. By
(iii) of Theorem 2.5.2 and (a) of property 11 of the finite-dimensional Dirichlet
(see Section 3.1) it follows that D=Dα.
106 3. DIRICHLET AND POLYA TREE PROCESS
(ii) From property 11 for any ¯αcontinuity set A,D{P:P(A)=0,or 1}=1.
By using a countable collection of ¯αcontinuity sets that generate the Borel
σ-algebra, the result follows.
(iii) (iii) Recall that E(P(A)) = ¯α(A). Because αn(R)→∞,Var(P(A)) 0for
all A. Hence P(A) converges in probability to ¯α(A). This holds for any finite
collection of sets. The result now follows as in the preceding case.
As a consequence of the theorem we have the following results.
Theorem 3.2.7. (i) Let αbe a finite measure. Then for each P0the posterior
Dα+n
1δXiδP0weakly, almost surely P0.
(ii) As α(R)goes to 0, the posterior converges weakly to Dn
1δXi.
Proof. Because a.e. P0,α+n
1δXi=αnsatisfies ¯αnP0and αn→∞, (iii) of
Theorem 3.2.6 yields the result.
Remark 3.2.3.Note that posterior consistency holds for all P0, not necessarily in
theweaksupportofDα. This is possible because the version of the posterior chosen
behaves very nicely. This version is not unique even for P0in the weak support of Dα.
One sufficient condition for uniqueness up to P0null sets is that P0be dominated by
α.
Remark 3.2.4.Assertion (ii) has been taken as a justification of the use of Dn
1δXi
as a noninformative (completely nonsubjective in the terminology of Chapter 1) pos-
terior. Note that Theorem 3.2.6 shows that the corresponding prior is far from a
noninformative prior.
The posterior Dn
1δXihas been considered as a sort of Bayesian bootstrap by Rubin
[142]. For an interesting discussion of the Bayesian bootstrap and Efron’s bootstrap,
see Schervish [144].
We would like to remark that all the theorems in this section go through if Ris
replaced by any complete separable metric space. The existence aspect of the Dirichlet
process can be handled via the famous Borel isomorphism theorem, which says that
there is a 1-1, bimeasurable function form Ronto X. The proofs of other results
require only trivial modifications.
3.2. DIRICHLET PROCESS ON M(R) 107
3.2.5 Elicitation and Some Applications
We have seen that with a Dαprior the posterior given X1,X
2,...,X
nis Dα+δXi.
As α(R)goesto0,(α+δXi)/(α(R)+n) converges to δXi/n, the empirical
distribution, further α(R)+nconverges to n. Hence as observed in the last section
Dα+δXiconverges weakly to DδXi. In particular if the X1,X
2,...,X
nare distinct
then DδXiis just the uniform distribution on the n-dimensional probability simplex
S
n. This phenomenon suggests an interpretation of α(R) goes to 0, as leading to
a “noninformative”prior. In this section we investigate a few examples, all taken
from Ferguson [61], where as α(R) goes to 0, the Bayes procedure converges to the
corresponding frequentist nonparametric method.
While these examples corroborate the feeling that α(R) goes to 0 leads to a non-
informative prior, (ii) of Theorem 3.2.6 points out the need to be careful with such
an interpretation. As α(R) goes to 0 the posterior leads to an intuitive noninforma-
tive limit. However the corresponding prior cannot be considered noninformative. We
believe these applications are justified in the completely non-parametric context of
making inference about Pbecause the Dirichlet is conjugate in that setting. Similar
assessments of conjugate prior in finite-dimensional problems is well known.
However, the Dirichlet is often used in problems where it is not a conjugate prior.
In such problems the interpretation of α(R) as a sort of sample size or a measure of
prior variability is of doubtful validity. See Newton et al. [136] in this connection.
Estimation of F . Suppose that we want to estimate the unknown distribution
function under the loss L(F, G)=(F(t)G(t))2dt. If Π is a prior on M(R),
equivalently on the space of distribution functions Fon R,itiswellknownthat
the no-sample Bayes estimate is given by ˆ
FΠ(t)=F(t)dΠ(F). If Π is Dαthen
because the posterior is Dα+δXi, the Bayes estimate of Fgiven X1,X
2,...,X
nis
(α+δXi)(−∞,t]/(α(R)+n).Setting Fnas the empirical distribution, we rewrite
this as
α(R)
α(R)+n¯α(−∞,t]+ n
α(R)+nFn
which is a convex combination of the prior guess and a frequentist nonparametric
estimate.
This property makes it clear how αis to be chosen. If the prior guess of the distri-
bution of Xis, say, N(0,1) then that is ¯α. The value of α(R) determines how certain
one feels about the prior guess. This interpretation of α(R) as a measure of one’s faith
in a prior guess is endorsed by the fact that if α(R)→∞then the prior goes to δ¯α.
108 3. DIRICHLET AND POLYA TREE PROCESS
If α(R)0 the Bayes estimate of Pconverges to the empirical distribution and
the posterior converges weakly to DnFn. Since the prior has no role any more, DnFn
is called a noninformative posterior and Fnthe corresponding noninformative Bayes
estimate. These intuitive ideas are helpful in calibrating α(R) as a cost of sample size
and α(R) = 1 is sometimes taken as a prior with low information.
Estimation of mean of F. The problem here is to estimate the mean µFof the
unknown distribution function F, the loss function being the usual squared error
loss, i.e., L(F, a)=(µFa)2. If Π is a prior on Fsuch that ˆ
FΠhas finite mean, then
the Bayes estimate ˆµis µFdΠ(F) and with probability 1 this is the same as the
mean of ˆ
FΠ. This follows because
xdF Π(dF )
= lim xI[0,n]dF Π(dF )
=xd ˆ
FΠ(x)=xd ˆ
FΠ(x)<
Thus if αhas finite mean then
Dα{F:Fhas finite mean}=1
and given X1,X
2,...,X
n, the Bayes estimate of µFis the mean of α+δXi. This
is easily seen to be a convex combination of the mean of ¯αand ¯
Xand goes to ¯
Xas
α(R)0.
Estimation of median of F. We next turn to the estimation of the median of the
unknown distribution F. For any F∈F,tis a median if
F(t)1
2F(t)
If αhas support [K1,K
2],−∞ ≤ K1<K
2≤∞then with Dαprobability 1, F
has unique median. If t1<t
2are both medians of F, then for any rational a, b;t1<
a<b<t
2we have F(a)=F(b).On the other hand Dα{F:F(a)=F(b)}=0.By
considering all rationals a, b in the interval (K1,K
2) we have the result.
In the context of estimating the median the absolute deviation loss is more natural
and convenient than the squared error loss. Formally, L(F, m)=|mFm|. If Π is a
prior on Fthen the “no-sample” Bayes estimate is just the median of the distribution
of mF.
3.2. DIRICHLET PROCESS ON M(R) 109
If the prior is Dαthen any median of mFis also a median of ¯α. This may be seen
as follows: tis a median of mFiff
Dα{mF<t}≤1
2Dα{mFt}
Now mFtiff F(t)1/2. Because F(t)isbeta(α(−∞,t](t, ), Dα{F(t)
1/2}≥1/2iα(t, )(R)1/2 (see exercise 11.0.2 ). On the other hand mF<t
iff F(t)>1/2 . This yields α(−∞,t)(R)1/2andsuchatis a median of ¯α.
Consequently, the Bayes estimate of the median given X1,X
2,...,X
nis a median of
(α+δXi)/(α(R)+n)).If ¯αis continuous then the median of (α+δXi)/(α(R)+n))
is unique. As α(R) goes to 0 the limit points of the Bayes estimates of mFare medians
of the empirical distribution.
Testing for median of F. Consider the problem of testing the hypotheses that the
median of Fis less than or equal to 0 against the alternative that the median is
greater than 0. If we view this as a decision problem with 0-1 loss, for a Dαprior on
Fthe Bayes rule is
decide median is 0ifDα{F(0) >1
2}>1
2
Because Dα{F(0) >1/2}=1/2 iff the two parameters are equal this reduces to
“accept the hypotheses that the median is 0 iff
α(−∞,0]
α(R)>1
2

Given X1,X
2,...,X
nthis condition becomes “accept the hypotheses that the me-
dian is 0 iff
Wn>1
2n+α(R)1
2¯α(−∞,0)
where Wnis the number Xi0.
Estimation of P(XY). Suppose that X1,...,X
nare i.i.d. Fand Y1,...,Y
m
are independent of the Xis and are i.i.d G. We want to estimate P(X1Y1)=
F(t)dG(t) under squared error loss. Suppose that the prior for (F, G)isofthe
form Π1×Π2. The Bayes estimate is then ˆ
FΠ1(t)dˆ
FΠ2(dt), where for i=1,2,
ˆ
FΠi(t) is the distribution function F(t)dΠi(t).
If the prior is Dαthen the Bayes estimate given X1,X
2,...,X
nbecomes
(α1+δXi)
α1(R)+n(−∞,t]dα2+δYi
α2(R)+n(dt)
110 3. DIRICHLET AND POLYA TREE PROCESS
This can be written as
p1,np2,m ¯α1(−∞,t)d¯α2(t)+p1,n(1 p2,m)1
n
m
1
¯α1(−∞,Y
j]
+(1p1,n)p2,m )1
m
n
1
(1 ¯α2(−∞,X
i)) + (1 p1,n)(1 p2,m )1
mnU
where p1,n =α1(R)/(α1(R)+n), p2,m =α2(R)/(α2(R)+m)andU, is the number of
pairs for which XiYj, i.e.,
U=
n
1
m
1
I(,Yj](Xi).
As α1(R)andα2(R) go to 0, the nonparametric estimate converges to (mn)1U,
which is the familiar Mann-Whitney statistic.
3.2.6 Mutual Singularity of Dirichlet Priors
Asbefore,wehaveaDαprior on M(R), given P,X1,X
2,...,X
nis i.i.d. P,andλα
is the joint distribution of Pand X1,X
2,... . The main result in this section is ‘ If
α1and α2are two nonatomic measures on R, then λα1and λα2are mutually singular
and hence so are Dα1and Dα2’. Mutual singularity of all priors in a family being used
is undesirable. It shows that the family is too small to be flexible enough to represent
prior opinion, which is based on information and judgment and is independent of
the data. To clarify, consider a simple example of this sort. Let X1,X
2,...,X
nbe
i.i.d. N(θ, 1) and suppose we are allowed only N(µ, 1) priors and the only values of µ
allowed are finite and widely separated as 0 and 10. Then for a large nif we get ¯
X,
it is clear that with high probability the data can be reconciled with only one prior
in the family. The result proved next is of this kind but stronger. It follows from a
curious result of Korwar and Hollander [116], who show that the prior Dαcan be
estimated consistently from X1,X
2,... . We begin with their result.
Lemma 3.2.1. Define τ1
2,... and Y1,Y
2,... by τ1=1and τn=kif the number
of distinct elements in {X1,X
2,...,X
k}is n and the number of distinct elements in
{X1,X
2,...,X
k1}is n1. In other words, τnis the number of observations needed
to get ndistinct elements.
3.2. DIRICHLET PROCESS ON M(R) 111
Set Yn=Xτnand set
Dn=1if Xn∈{X1,X
2,...,X
n1}
0otherwise
Note that n
1Diis the number of distinct units in the first nobservations. If αis
nonatomic then
(i) for any Borel set U, 1/n n
1δYn(U)¯α(U)a.e. λα;
(ii) 1/log nn
1(DiE(Di)) 0a.e. λα; and
(iii) 1/log nn
1E(Di)α(X).
Proof. Note that τi<a.e.
To prove (i) it is enough to show that Y1,Y
2,... are i.i.d. ¯α.
We start with a finer conditioning than Y1,...,Y
n1. Consider for t1<t
2,... <
tn1,t
n,,
PrYnA|X1,X
2,...X
tn1
n1=tn1
n=tn
=PrYn∈|X1,...X
tn1
n1=tn1
ntn
Prτn=tn|X1,...X
tn1
n1=tn1
ntn(3.14)
After cancelling out α(X)+tn1 from the numerator and denominator this becomes
α+tn1
1δXi(A−{Y1,...,Y
n})
α+tn1
1δXi(X−{Y1,...,Y
n})
and by nonatomicity this reduces to ¯α.ThusY1,Y
2,... are i.i.d and (i) follows.
For the second assertion, it is easy to see that the Dnare independent with λα(Dn=
1) = α(R)/(α(R)+n1).
By Kolomogorov’s SLLN for independent random variables
1
log n
n
1
(DiE(Di)) 0 a.s. λαif
1
V(Di)
(log i)2<
Here V(Di)=α(R)(i1)/((α(R)+i1)2) and the preceding condition holds.
112 3. DIRICHLET AND POLYA TREE PROCESS
Moreover
1
log n
n
1
E(Di)= 1
log n
n
1
α(R)
α(R)+i1α(R)
because n
2
α(R)
α(R)+i1=
n
2
α(R)
i1α(R)
n
2
α(R)
α(R)+i1
1
i1
and as n→∞, the second term on the right converges, so that
n
2
α(R)
α(R)+i1=α(R) [log n+O(1)]
Theorem 3.2.8. If α1and α2are two nonatomic measures on R,α1=α2, then
λα1and λα2are mutually singular and hence so are Dα1and Dα2.
Proof. Let Ube a Borel set such that α1(U)=α2(U), and set
E=ω:1
n
n
1
δYi(U)¯α1(U)and 1
log n
n
1
Diα1(R)
By Lemma 3.2.1, λα1(E)=1andλα2(E)=0.
Further, because ER, we also have
λα1(E)=P(E)Dα1(dP )=1
so that, Dα1{P:P(E)=1}= 1. Similarly Dα2{P:P(E)=1}=0.
Remark 3.2.5.To handle the general case, consider the decomposition of α1
2into
αi=αi1+αi2,whereαi1is the nonatomic part of αiand αi2is the discrete part.
Let M1,M
2be the support of α12 and α22 . Then if α11 =α21 but M1=M2, then
also λα1and λα2are singular.
If α11 =α21 and M1=M2;λα1and λα2may not be orthogonal. Sethuraman gives
necessary and sufficient condition for the orthogonality using Kakutani’s well-known
criteria based on Hellinger distance.
3.2. DIRICHLET PROCESS ON M(R) 113
Remark 3.2.6.The Theorem 3.2.8 shows that Dirichlet process used as priors dis-
play a curious behavior. Suppose αis a continuous measure, then for every sample
sequence X1,X
2,... the continuous part of the successive posterior base measures
changes from α/α(X)+nto α/(α(X)+n+1) and hence the sequence of the posteriors
are mutually singular.
3.2.7 Mixtures of Dirichlet Process
Dirichlet process requires specification of the base measure α, which itself can be
viewed as consisting of the prior expectation ¯αand the strength of the prior belief
α(R). In order to achieve greater flexibility Antoniak [4] proposed mixtures of Dirich-
let process which arise by considering a family αθof base measures indexed by a
hyperparameter θand a prior for the parameter θ.
Because the Dirichlet processes sit on discrete measures, so does any mixture of
these and hence they are unsuitable as priors for densities. For the same reason, it is
also inappropriate for the parametric part of a semiparametric problem. For example,
Diaconis and Freedman [46] show that the Dirichlet prior in a location parameter
problem can lead to pathologies as well as inconsistency of the posterior for even
reasonable “true” densities.
Usually one will not have as a prior a completely specified ¯αbut an αθ—like
N(η, σ2)—with θ=(η, σ2) unknown but having a prior µso that the distribution
of Pgiven θis Dαθ. Suppose that X1,X
2,...,X
nare—given P—i.i.d P. Because
given θ,X1,X
2,...,X
n;Pis distributed as Dαθ+δXi, the distribution of Pgiven
X1,X
2,...,X
nis obtained by integrating Dαθ+δXiwith the conditional distribution
of θgiven X1,X
2,...,X
n.
For simplicity let Θ= Rlet µbe the prior on Θ with density ˜µ;foreachθ,αθ
is a finite measure on Rwith density ˜αθwith respect to Lebesgue measure. Using
equation (3.12) the joint density of θand X1,X
2,...,X
nis
˜µ(θ)j˜αθ(xj)Ij(ej1)!
(αθ(R))[n](3.15)
The conditional density of θgiven X1,X
2,...,X
nis thus
C(x1,x
2,...,x
nµ(θ)j˜αθ(xj)Ij(ej1)!
(αθ(R))[n](3.16)
114 3. DIRICHLET AND POLYA TREE PROCESS
If the “true” distribution P0is continuous then with probability 1, the Ijsareall
equal to 1 and the conditional density becomes
C(x1,x
2,...,x
nµ(θ)n
1˜α(θ)(xj)
(α(θ)(R))[n](3.17)
Newton et al. [137] provides an interesting heuristic approximation to Bayes esti-
mates in this context.
3.3 Polya Tree Process
Polya tree process are a large class of priors that include Dirichlet processes and
provide a flexible framework for Bayesian analysis of nonparametric problems. Like
the Dirichlet, Polya tree priors form a conjugate class with a tractable expression for
the posterior. However they differ from Dirichlet process in two important aspects.
The Polya tree process are determined by a large collection of parameters and thus
provide means to incorporate a wide range of beliefs. Further, by suitably choosing the
parameters, the Polya tree priors can be made to sit on continuous, even absolutely
continuous, distributions.
Polya tree priors were explicitly constructed by Ferguson [62] as a special case of tail
free processes discussed in the Chapter 2. A formal mathematical development using
De Finetti’s theorem is given in Mauldin et al. [133], Lavine [118, 119], indicates
the construction and discusses the choice of various components that go into the
construction of Polya tree priors. Here we briefly explore the properties of Polya tree
priors. The basic references for these are Ferguson [62], Mauldin et al. [133] and
Lavine[118, 119].
3.3.1 The Finite Case
The construction in this section is a special case of the discussion in Section 2.3.1.
To briefly recall, let X={x1,x
2,...,x
2k}.LetB0={x1,x
2,...,x
2k1}and B1=
{x2k1,...,x
2k}be a partition of X. For any jlet Ejstand for all sequences of 0s and
1s of length jand E
j=ijEi.Foreachjk, we consider a partition {B:Ej}
of Xsuch that B0,B1is a partition of B.IfEk, clearly Bis a singleton.
Definition 3.3.1. ApriorΠonM(X)issaidtobeaPolya tree prior with pa-
rameter α={α:E
k}if α0and
(i) P(B0|B):E
k1are all independent and
3.3. POLYA TREE PROCESS 115
(ii) P(B0|B)isabeta(α0
1) random variable
when =take, P(B0|B)tobeP(B0). (i) and (ii) uniquely determine a Π, for
if x=B1,2,...,k, then
P(x)=P(B12...k)=
i:i=0
PB12...i10|B12...i1
i:i=1 PB12...i11|B12,...i1
(3.18)
Because
PB1,2,...,i11|B1,2,...,i1=1PB1,2,...,i10|B1,2,...,i1
P(B0|B):E
k1determines the distribution of P(x).
Suppose Π is a Polya tree prior on M(X) and given P,Xis distributed as P.For
any xlet 1(x)=0ifxB0and 1 otherwise, and let k(x)=0ifxB1(x)...k1(x)0
and 1 otherwise. The joint density of Pand Xis given, up to a constant, by
E
k
[P(B0|B)]α01[1 P(B0|B)]α11
i:i(x)=0
P(B0|B)
i:i(x)=1
(1 P(B0|B))
=
E
k
[P(B0|B)]α
01[1 P(B0|B)]α
11
where
α
=1+αif xB
αotherwise
We summarize this discussion as the following theorem
Theorem 3.3.1. If the prior on M(X)is PT(α)where α={α:E
k}and if
given P,X1,X
2,...,X
nare i.i.d. P, then
(i) the posterior distribution on M(X)given X1,X
2,...,X
nis a Polya tree with
parameters {α,X1,X2,...,Xn:E
k}where
α,X1,X2,...,Xn=α+
n
1
IB(Xi)
(ii) the marginal distribution of X1is given by
Pr{X=x}=
k
1
α1(x)2(x)...i(x)
α1(x)2(x)...i1(x)0 +α1(x)2(x)...i1(x)1
116 3. DIRICHLET AND POLYA TREE PROCESS
and the predictive distribution of Xn+1 given X1,X
2,...,X
nis of the same form
with αreplaced by α,X1,X2,...,Xn.
To prove (ii), Pr(X=x)=P(x)dΠ(P) and is the integral of the terms in (3.18).
The components in the product are independent beta random variables and a direct
computation yields the result.
The distribution of X1,X
2,...,X
ndefined via Theorem 3.3.1 can be thought of as
a Polya urn scheme, though not as easy to describe as that for a Dirichlet. This is
done in Mauldin et al. (92) and we refer the interested reader to their paper.
Remark 3.3.1.The assumption that Xcontains 2kelements and that partitions
are into two halves is not really necessary. All we need is Ti={B:Ei}for
i=1,2,...,k be a nested sequence of partitions. The equal halves can be relaxed by
allowing empty sets to be in the partition and setting the corresponding parameter
to be 0.
Remark 3.3.2.The form of the posterior distribution shows that Xand the vector
{P(B0|B):E
i}are conditionally independent given {IB:Ei}.
3.3.2 X=R
Motivated by the Xis finite case, we define a Polya tree prior on M(R) as follows:
Recall that Ejis the set of all sequences of 0s and 1s of length jand E=jEjis
all sequences of 0s and 1s of finite length. Also Eis the set of all infinite sequences
of 0s and 1s.
Definition 3.3.2. Fo r each n,letTn={B:En}be a partition of Rsuch that
for all in E,B0,B1is a partition of B.
Let α={α:E}be a set of nonnegative real numbers.
ApriorΠonM(R) is said to be a Polya tree (with respect to the partition T=
{Tn}n1) with parameter α, denoted by PT(α), if under Π
1. {P(B0|B):E}are a set of independent random variables
2. for all E,P(B0|B)isbeta(α0
1).
The first question, of course, is do such priors exist? We have already discussed this
in Chapter 2.
Theorem 3.3.2. A Polya tree with parameter α={α:E}exists if for all
Eα0
α0+α1 α00
α00 +α01  α000
α000 +α001 ...= 0 (3.19)
3.3. POLYA TREE PROCESS 117
and α10
α10 +α11  α110
α110 +α111 ...=0
Proof. This is an immediate consequence of Theorem 2.3.5 . We noted there that if
we set Y=P(B0),Y
=P(B0|B) then {Y:E}induces a measure on M(R)- if
it satisfies the continuity condition
YY0Y00 ...= 0 almost surely
Because n
1Y0...0is decreasing in nand bounded by 0 and 1, this happens iff
E(n
1Y0...0)0. The Yare independent beta random variables and the condition
translates precisely to (3.19).
Marginal distribution of X Let PPT(α) and given P,Xbe distributed as Pand
let mbe the marginal distribution of X. Because the finite union of sets in
nTnis an algebra it is enough to calculate m(B) for all in E.
If =12...
k,
m(XB12...,k)=E
ik1
PB12...i1i|B12...i1
=
{i:i=0,ik1}
Y12...i1
{i:i=1,ik1}
(1 Y12...i1) (3.20)
The factors inside the expectation are independent beta random variables, and
hence we have
=
k
1
α12...i
α12...i0+α12...i1
Theorem 3.3.3. Suppose that Xis distributed as Pand Pitself has a PT(α)prior.
Then the posterior distribution of Pgiven Xis PT(αX), where αX=α+IB(X).
Based on the corresponding result for the finite case, it is reasonable to expect the
posterior to be PT(αX). In fact the posterior distribution of {P(B):En}given X
is same as the posterior of {P(B):En}given {IB(X):En}. The calculation
in the finite case done in the last section shows that this posterior distribution is a
Polya tree with parameters {α,X =α+IB(X):∈∪
n
1Ei}. The proof is completed
118 3. DIRICHLET AND POLYA TREE PROCESS
by recognizing that posterior distributions of {P(B):En:n1}determine the
posterior distribution Π(·|X).
Repeatedly applying the last theorem we get the following.
Theorem 3.3.4. If PT(α)is the prior on M(R)and given P;ifX1,X
2,...,X
n
are i.i.d. P, then the posterior distribution of Pgiven X1,X
2,...,X
nis a polya tree
with parameter αX1,X2,...,Xnwhere
α,X1,X2,...,Xn=α+
n
1
IB(Xi)
Predictive distribution and Bayes estimate
It is immediate from the last two properties that if X1,X
2,...,X
nare i.i.d. Pgiven P,
and Pis has PT(α) prior, then the predictive distribution of Xn+1 given X1,X
2,...,X
n
is
P{Xn+1 B12...k}
α1+n
1IB1(Xi)
α0+α1+n
α12+n
1IB12(Xi)
α10+α11+n1...
α1...k+n
1IB1...k(Xi)
α1...k10+α1...k11+n1...k1
where nis the number of Xis falling in B.
In view of the calculations done so far, ˆ
P=E(P|X1,X
2,...,X
n) is the measure
satisfying
ˆ
P(B12...k)=
k
1
α1...j+n
1IB1...j(Xi)
α1...j0+α1...j1+n1...j1
Like the Dirichlet, here also the posterior is consistent. Formally, we have the
following theorem.
Theorem 3.3.5. Let Pbe distributed as PT(α)and given P, let X1,X
2,...,X
n
be i.i.d. P. Then for any P0,asn→∞, the posterior
PT(αX1,X2,...,Xn)δP0weakly a.s P0
.
3.3. POLYA TREE PROCESS 119
The result would follow as a particular case of a more general theorem proved later
for tail free priors. However one can give proof along the same lines as that for the
Dirichlet process and follows from the following lemmas.
Lemma 3.3.1. Let ¯αm=Eαm(P), where E¯αmis the expectation taken under PT(αm).
If {¯αm:mM}is tight, then so is {PT(αm):mM}.
Proof. Easily follows from corollary to Theorem 2.5.1
Lemma 3.3.2. If ¯αmP0and if for all E,
L(P(B)|PT(αm)) →L(P(B)|δP0))
then PT(αm)converges weakly to δP0.
Proof. The tightness of αmensures that PT(αm) has a limit point. This limit point
can be identified as δP0using calculations similar to Theorem 3.2.6.
To prove the theorem, let Ω = {ω:1/n n
1IB(Xi)P0(B) for all E}.
P0(Ω) = 1, and further for each ωΩitiseasilyveriedthat¯αm=α,X1,X2,...,Xn(ω)
satisfies the assumptions of the Lemma 3.3.2.
Support of PT(α)
Our next theorem is on the topological support of PT(α). Recall that the support is
the smallest closed set of PT(α) measure 1. Here we assume that {a:E}is a
dense set of numbers and induce a nested sequence of partitions.
Theorem 3.3.6. PT(α)has all of M(R)as support iff α>0for all E.
Proof. The proof follows along the same lines as for the Dirichlet (see Theorem 3.2.4).
Mauldin et al. [133] show that, unlike the Dirichlet, we can find αwhich will ensure
that PT(α) sits on the space of continuous measures. Because Polya tree priors are tail
free, we can use Theorem 2.4.3 to show that by suitably choosing the partitions and
parameters the Polya tree can be made to sit on, not just continuous distributions but
even absolutely continuous distributions. The theorem is an application of Theorem
2.4.3 to Polya tree processes. The proof is just a verification of the conditions of
Theorem 2.4.3.
120 3. DIRICHLET AND POLYA TREE PROCESS
Theorem 3.3.7. Let λbe a continuous probability measure on Rwith distribution
function λ. Define B12...i=λ1(i
2i,i
2i+1
2i).Ifα12...i=ai, and a1
i<
then PT(α)(L(λ)) = 1.
In particular when α12...i=i2, the polya tree gives mass 1 to probabilities that are
absolutely continuous with respect to λ.
A few concluding remarks about Polya tree priors: The Polya tree prior depends on
the underlying partition T={Tn}n1and is tail free with respect to this partition.
In fact a prior which is tail-free with respect to every sequence of partitions is, except
for trivial cases, a Dirichlet process [ Doksum [48]].
We have seen that Polya tree priors, unlike the Dirichlet, can be made to sit on
densities. One unpleasant feature of this construction is that absolute continuity of P
is ensured by controlling the variability of Paround the chosen absolutely continuous
λ. We have seen that for the Dirichlet the prior and posterior are mutually singular.
Dragichi and Ramamoorthi [56] have shown that if the parameters are as in the
Theorem 3.3.7, then the posterior given distinct observations is absolutely continuous
with respect to the prior.
Lavine suggests that, if the prior expectation is F, then the partitions of the form
F1(i/2i,i/2i+1/2i) would be appropriate. For then the ratios
α12...i10/(α12...i10+α12...i11)=1/2
and this would ensure that the marginal of Xis F, which may then be treated as
a “prior guess” of the “mean” of the random P. As to the magnitude of the αs
(as distinct from their ratios), their role is somewhat similar to that of α(R)forthe
Dirichlet, except that the availability of more parameters introduces more flexibility.
It is expected that for moderate ka choice of the magnitude would be on the basis
of prior belief and for higher k, a conventional choice would be made. A conventional
choice might be to ensure that the prior sits on densities. For example, one may take
α1,...,k=1/k2. Lavine [118] has expressed well what the main considerations are; we
refer the reader to his paper.
4
Consistency Theorems
4.1 Introduction
We briefly discussed consistency of the posterior in Chapters 1, 2 and 3. To recall,
our setup consists of:
a (unknown) parameter θthat lies in a parameter space Θ;
a prior distribution Π for θ, equivalently, a probability measure on Θ; and
X1,X
2,...,X
n, which are given θ, i.i.d. with common distribution Pθ.
Our interest centers on the consistency of the posterior distribution, and as dis-
cussed in Chapter 1, this is a requirement that if indeed θ0is the “true ” distribution
of X1,X
2,...,X
nthen the posterior should converge to δθ0almost surely. In other
words, as n→∞, the posterior probability of every neighborhood of θ0should go to
1withPθ0probability 1.
We noted that posterior consistency can be viewed as
a sort of frequentist validation of the Bayesian method;
merging of posteriors arising from two different priors; and
as an expression of “data eventually swamps the prior”.
In Chapter 1 we saw that when Θ is a subset of a finite-dimensional Euclidean space
and if θ→ Pθis smooth, then for smooth priors the posterior is consistent in the
122 4. CONSISTENCY THEOREMS
support of the prior. In Chapter 1 we also saw an example showing that inconsistency
cannot be ruled even when Θ = R.
The example in Chapter 1 may be dismissed as a technical pathology, but in the
nonparametric case inconsistency can scarcely be called pathological. This has led
some to question the role of consistency in Bayesian inference. The argument is that it
is well known that the prior and the posterior given by Bayes theorem are imperatives
arising out of axioms of rational behavior–and since we are already rational why
worry about one more criteria? In other words inconsistency does not warrant the
abandonment of a prior. We would argue that in the nonparametric context typically
one would have many priors that would be consistent with one’s prior beliefs, and it
does make sense to choose among these priors that are consistent at a large number
of parameter values, among which we expect the true parameter to lie.
In the nonparametric context Θ is M(R) or large subset of it. M(R) has various
kinds of convergence, namely, total variation, setwise , weak, etc. Each of these leads to
a corresponding notion of consistency. The issue of consistency has been approached
from different point of view by [143]. We begin with a formal definition of these.
4.2 Preliminaries
Definition 4.2.1. {Π(·|X1,X
2,...,X
n)}is said to be strongly or L1-consistent at
P0ifthereisaΩ
0Ω such that P
0(Ω0) = 1 and for ω0
Π(U|X1,X
2,...,X
n)1
for all total variation neighborhoods of P0.
Definition 4.2.2. {Π(·|X1,X
2,...,X
n)}is said to be weakly consistent at P0if
thereisaΩ
0Ω such that P
0(Ω0) = 1 and for ω0
Π(U|X1,X
2,...,X
n)1
for all weak neighborhoods of P0.
Before we proceed to the study of consistency, we note that Bayes estimates inherit
the convergence property of the posterior. Recall that we denote X1,X
2,...,X
nby
Xn.
Proposition 4.2.1. Define the Bayes estimate ˆ
Pn(·|Xn)to be the probability mea-
sure ˆ
Pn(A|Xn)=P(A(dP |X1,X
2,...,X
n)=E(P(A)|Xn).Then
4.2. PRELIMINARIES 123
1. if {Π(·|X1,X
2,...,X
n)}is strongly consistent at P0, then || ˆ
PnP0|| → 0, almost
surely P0.
2. If {Π(·|X1,X
2,...,X
n)}is weakly consistent at P0, then ˆ
PnP0weakly, almost
surely P0.
Proof. By Jensen’s inequality
|| ˆ
PnP0|| ≤ ||PP0|| Π(dP |Xn)
=U||PP0|| Π(dP |Xn)+Uc||PP0|| Π(dP |Xn)
and if U={P:||PP0|| <}then
Π(U|Xn)+Π(Uc|Xn)+(1)
as n→∞.
A similar argument works for assertion (ii) by considering |fd
ˆ
PnfdP
0|,f
bounded continuous.
It is worth pointing out that the conventional Bayes estimate considered earlier is
a Bayes estimate only for the squared error loss for P(A). The Bayes estimate ˜
P(A)
for, say, the absolute deviation loss, will be the posterior median. Unfortunately, the
˜
P(·) so obtained will not be a probability measure.
As far as the prior is on M(R), weak consistency is intimately related to the con-
sistency of the Bayes estimates of F(t).
Theorem 4.2.1. Suppose Πis a prior on F, the space of distribution functions on
R, and X1,X
2,...,X
nbe given F; i.i.d. F. Then the posterior is weakly consistent
at F0, iff there is a dense subset Qof Rsuch that for tin Q
(i) lim
n→∞ E(F(t)|Xn)=F0(t); and
(ii) lim
n→∞ V(F(t)|Xn)=0.
Proof. If (i) and (ii) hold, it follows from a simple use of Chebychev’s inequality that
for every tin Q,
Π((F(t)F0(t)|)|Xn)1a.sF0
124 4. CONSISTENCY THEOREMS
and hence it follows that
Π((F(ti)F0(ti)||Xn)for1ik)1a.sF0
By Proposition 2.5.2, any weak neighborhood Uof F0contains a set of the above
form for a suitable δ. Hence Π(U|Xn)1 a.e. F0.
On the other hand, if the posterior is weakly consistent, then it is easy to see that
(i) and (ii) hold for any tthat is a continuity point of F0.
Since strong consistency is desirable, it is natural to seek a prior Π for which the
posterior would be strongly consistent at all Pin M(R). Such a prior can be thought
of as more diffuse than priors that do not have this property. However such a prior
does not exist. If it did, the corresponding Bayes estimates, by the last Proposition
4.2.1, would give a sequence of estimates of Pthat is consistent in the total variation
metric and such estimates do not exist [41]. On the other hand the Dirichlet priors
considered earlier provide an example of a prior that is weakly consistent at all P.
We note that Doob’s theorem is applicable also to strong consistency.
If Uis a neighborhood of P0with prior probability 0, then any reasonable version
of the posterior will assign mass 0 to Uand consequently it is unreasonable to expect
consistency at such a P0. Thus it is appropriate to confine the search for points of
consistency to the (topological) support of the prior.
4.3 Finite and Tail free case
When Xis a finite set, M(X) is a subset of the Euclidean space, and all the topologies
coincide on M(X) , and we have the following pleasing theorem. This theorem can
also be proved from Theorem 1.3.4. Here is a direct proof that in a way is related to
the Schwartz theorem discussed later in this chapter.
Theorem 4.3.1. Let Πbe a prior on M(X), where X={1,2,...,k}. Then the
posterior is consistent at all points in the support of Π.
Proof. Let
V={P:PP0}
be a neighborhood of P0.
Π(Vc|Xn)=Vcenk
1(ni/n)log(P0(i)/P (i)) dΠ(p)
Xenk
1(ni/n)log(P0(i)/P (i)) dΠ(p)
4.3. FINITE AND TAIL FREE CASE 125
where niis the number of Xnequal to i. Writing it as
I1(Xn)
I2(Xn)
we will show that
(i) for all β>0, lim inf n→∞ eI2(Xn)=a.s P0;and
(ii) there exists a β0>0 such that e0I1(Xn)0a.sP0.
condition (i) follows from the strong law of large numbers.
As for (ii)
k
1
ni
nlog P0(i)
P(i)=
k
1
ni
nlog ni/n
P(i)+
k
1
ni
nlog P0(i)
ni/n
which gives
lim
n→∞
k
1
ni
nlog P0(i)
P(i)= lim
n→∞
k
1
ni
nlog ni/n
P(i)
If Fnstands for the empirical distribution
k
i
ni
nlog ni/n
P(i)=K(Fn,P)
and by Proposition 1.2.2
K(Fn,P)||FnP||2
4=(||PP0||−||FnP0||)2
4
If PVcand nis large so that ||FnP0|| /2, we have
K(Fn,P)(δδ/2)2
4=δ0
In other words,
inf
PVcK(Fn,P)
0a.s P0
Consequently
lim
n→∞ eI1(Xn)lim
n→∞ eVc
enK(Fn,P )dΠ(p)en(βδ0)
which goes to 0 if β<δ
0. The proof of the theorem is easily completed by taking
β0
0.
126 4. CONSISTENCY THEOREMS
When Xis infinite, even weak consistency can fail to occur in the weak support
of Π. Freedman [69] provided dramatic examples when X={1,2,3,...,}.Another
elegant example, due to Ferguson, is described in [65].
Theorem 4.3.2. For k=1,2,..., let Tk={B:Ek}be a partition of Rinto
intervals. Further assume that {T k:k1}are nested. If Πis a prior on M(R),
tail free with respect to {T k:k1}and with support all of M(R)then (there exits
a version of) the posterior which is weakly consistent at every P0.
Proof. By Theorem 2.5.2, enough to show that for each nthe posterior distribution
of {P(B):En}given Xnconverges a.e. P0to {P0(B):En}. Proposition
2.3.6 ensures that the posterior distribution of {P(B):En}given X1,X
2,...,X
n
isthesameasthatgiven{n:En},wherenis the number of X1,X
2,...,X
n
in B. A little reflection will show that we are now in the same situation as Theorem
4.3.1.
4.4 Posterior Consistency on Densities
4.4.1 Schwartz Theorem
In the last section we looked at priors on M(R). An important special case is when the
prior is concentrated on Lµ, the space of densities with respect to a σ-finite measure
µon R. This case is important because of its practical relevance. In addition this
is a situation when one has a natural posterior given by the Bayes theorem. Our
(conventional) Bayes estimate is the expectation of fwith respect to the posterior.
We begin the discussion with a theorem of Schwartz [145]. Our later applications
will show that Schwartz’s theorem is a powerful tool in establishing posterior consis-
tency. Barron [8] provides insight into the role of Schwartz’s theorem in consistency.
Our setup, then, is Lµ=f:fis measurable,f 0,fdµ=1
.We tacitly iden-
tify the µequivalence classes in Lµand equip Lµwith the total variation or L1-metric
||fg|| =|fg|.Everyfin Lµcorresponds to a probability measure Pf, and it
is easy to see that the Borel σ-algebra generated by the L1-metric and the σ-algebra
BMLµare the same.
Let Π be a prior on Lµ. Recall that K(f,g) stands for the Kullback-Leibler diver-
gence flog(f/g).K(f) will stand for the neighborhood {g:K(f,g)<}.
Definition 4.4.1. Let f0be in Lµ.f0is said to be in the K-L support of the prior
Π, if for all >0, Π(K(f0)) >0.
4.4. POSTERIOR CONSISTENCY ON DENSITIES 127
As before, X1,X
2,... are given f, i.i.d. Pf.Pn
fwill stand for the joint distribution
of X1,X
2,...,X
nand P
ffor the joint distribution of the entire sequence X1,X
2,...
. We will, when needed, view P
fas a measure on Ω = R.
Let Ube a set containing f0. In order for the posterior probability of Ugiven Xn
to go to 1, it is necessary that f0and Uccan be separated. This idea of separation
is conveniently formalized through the existence of appropriate tests for testing H0:
f=f0versus H1:fUc. Recall that a test function is a nonnegative measurable
function bounded by 1.
Let {φn(Xn):n1}be a sequence of test functions.
Definition 4.4.2. {φn(Xn):n1}is uniformly consistent for testing H0:f=f0
versus H1:fUc,ifasn→∞,
Ef0(φn(Xn)) 0
inf
fUcEf(φn(Xn)) 1
Definition 4.4.3. A test φ(Xn)isstrictly unbiased for H0:f=f0versus H1:
fUc,if
Ef0(φn(Xn)) <inf
fUcEf(φn(Xn))
Definition 4.4.4. {φn(Xn):n1}is uniformly exponentially consistent for test-
ing H0:f=f0versus H1:fUc, if there exist C, β positive such that for all
n,
Ef0(φn(Xn)) Ce
and
inf
fUcEf(φn(Xn)) 1Ce
The next proposition relates these three definitions. The proposition is itself inter-
esting, and the ideas involved in the proof surface again in later arguments.
Proposition 4.4.1. The following are equivalent
(i) There exists a uniformly consistent sequence of tests for testing H0:f=f0
versus H1:fUc.
(ii) for some n1, there exists a strictly unbiased test φ(Xn)for H0:f=f0
versus H1:fUc.
(iii) There exists a uniformly exponentially consistent sequence of test functions for
testing H0:f=f0versus H1:fUc.
128 4. CONSISTENCY THEOREMS
Proof. Clearly, (i) implies (ii) and (iii) implies(i). So all that needs to be established
is that (ii) implies (iii).
Consider first the simple case when m= 1, i.e., there exists φ(X) such that Ef0φ=
α< inf
fUcEfφ=γ.
Let
Ak=(x1,x
2,...,x
k):1
kφ(Xi)>(α+γ)
2
Then Pk
f0(Ak)=Pk
f0(φ(Xi)kEf0φ>k(γα)/2), and by Hoeffeding’s inequal-
ity,
Pk
f0φ(Xi)kEf0φ>k(γα)
2ek2(γα)2
4k=ek(γα)2
4
On the other hand, for fUc
Pk
f(Ak)Pk
fφ(Xi)kEfφ>k(αγ)
2
Because αγ<0, by applying Hoeffeding’s inequality to φ,weget
Pf(Ak)1ek(γα)2
4
and thus φk=IAkprovides the required sequence of tests.
To move on to the general case, suppose
Ef0φm(X1,X
2,...,X
m)=α< inf
fUcEfφm(X1,X
2,...,X
m)=γ
From what we have just seen, if n=km, then there is a set Akwith Pn
f0(Ak)
en(γα)2/4m.Ifkm < n (k+1)m, then
Pn
f0(Ak)enkm(γα)2
n4m
enk(γα)2
(k+1)4men(γα)2
8m
Thus, setting β=(γα)2/8m, we have the exponential bound for φn=IAkwith
respect to Pf0. A similar argument yields the corresponding inequality for inf
fUcPf(Ak).
Corollary 4.4.1. Let νbe any probability measure on Uc. When there is a φn(Xn)
such that Efn
0φn(Xn)Ceand inffUcEfφn(Xn)1Ce , we have ||f0
fnν(df )|| ≥ 2(1 2Ce), where fnis the n-fold product density n
1f(xi).
4.4. POSTERIOR CONSISTENCY ON DENSITIES 129
Theorem 4.4.1 (Schwartz). Let Πbe a prior on Lµ.Iff0Lµ,and Usatisfy
(i) f0is in the K-L support of Πand
(ii) there exists a uniformly consistent sequence of tests for testing H0:f=f0
versus H1:fUc,
then Π(U|X1,X
2,...,X
n)1a.s P
f0
Proof. Because
Π(Uc|X1,X
2,...,X
n)=Ucn
1f(Xi(df )
Lµn
1f(Xi(df )=Ucn
1
f(Xi)
f0(XiΠ(df )
Lµn
1
f(Xi)
f0(Xi)Π(df )
it is enough to show that the last term in this expression goes to 0 a.s. P
f0.
We will show in Lemma 4.4.1 that condition (i) implies
for every β>0,lim inf
n→∞ eLµ
n
1
f(Xi)
f0(Xi)Π(df )=a.e.P
fo(4.1)
By Proposition 4.4.1, there exist exponentially consistent tests for testing f0against
Uc. Using these we invoke Lemma 4.4.2, by taking Vn=Ucfor all nto show that
for some β0>0,lim
n→∞ e0Uc
n
1
f(Xi)
f0(Xi)Π(df ) = 0 a.e.P
fo(4.2)
By taking β=β0in (4.1) it easily follows that the ratio in (4.4.1) goes to 0 a.e.
Lemma 4.4.1. If f0is in the Kullback-Leibler support of Πthen
for every β>0,lim inf
n→∞ eLµ
n
1
f(Xi)
f0(Xi)Π(df )=a.e.P
fo
Proof.
Lµ
n
1
f(Xi)
f0(Xi)Π(df )K(f0)
en
1log f0
f(Xi)
For ea ch fin K(f0), by the law of large numbers
1
nlog f0
f(Xi)→−K(f0,f)>a.s P
f0
130 4. CONSISTENCY THEOREMS
Equivalently, for each fin K(f0),
en(21
nlog f0
f(Xi))→∞a.s P
f0(4.3)
Hence by Fubini, there is a Ω0ΩofP
f0measure 1 such that, for each ω0,for
all fin K(f0), outside a set of Π measure 0, (4.3) holds. Using Fatou’s lemma,
lim inf en2Lµ
n
1
f(Xi)
f0(Xi)Π(df )lim inf en2K(f0)
n
1
f(Xi)
f0(Xi)Π(df )
K(f0)
en(21
nlog f0
f(Xi)(ω))Π(df )→∞
We will state the next lemma in a form slightly stronger than what we need.
Lemma 4.4.2. If there exist tests φn(Xn)and sets Vnwith lim infnΠ(Vn)>0,
such that for some β>0,
Ef0φn(Xn)Ce
and
inf
fVn
Efφn(Xn)1Ce
then
for some β0>0,lim
n→∞ e0Vn
n
1
f(Xi)
f0(Xi)Π(df )=0a.e. P
fo
Proof. Set qn(x1,x
2,...,x
n)=(1/Π(Vn)Vnn
1f(Xi(df ). Denoting by A(fn
0,q
n)=
f0(xi)qn(xi), by Corollaries 4.4.1 and 1.2.1 , there is 0 <r<1 such that
A(fn
0,q
n)(1 ||PQ||2
4)2Cenr
Thus
Pn
f0qn(Xn)
f0(Xi)enr=Pn
f0(qn(Xn)
f0(Xi)enr
22Cenr
2enr
An application of Borel-Cantelli yields
qn(Xn)
f0(Xi)enr a.s P
f0
4.4. POSTERIOR CONSISTENCY ON DENSITIES 131
and we have 1
Π(Vn)enr
2Vnn
1f(Xi)
n
1f0(Xi)Π(df )0a.sP
f0
Since lim inf Π(Vn)>0, we have the conclusion.
Remark 4.4.1.The role of the assumption that f0is in the Kullback-Leibler support
is to ensure that (4.1) holds. Sometimes it might be possible to verify it by direct
calculation without invoking the K-L support assumption. We will see an example of
this kind in the next chapter.
Let f0be in the K-L support of Π. In order to apply the Schwartz theorem, we
need to identify neighborhoods of f0for which there exists a uniformly consistent test
for H0:f=f0vs H1:fUc.
Let Ube a weak neighborhood of the form
U=fdP fdP0<,f bounded continuous (4.4)
Because fis bounded, by adding a constant we make it nonnegative and multiplying
by a positive constant we can make 0 f1. Then Uhas the same expression in
terms of this transformed f, with perhaps a different .Nowfis a test function and
which separates P0and Uc. Thus for neighborhoods of the form displayed we have an
unbiased test and consequently a uniformly consistent sequence of tests for
H0:P=P0H1:PUc
For any test function f,|fdP fdP0|<iff fdP fdP0<and
(1 f)dP (1 f)dP0<.InotherwordsU={P:|fdP fdP0|<}can
be expressed as intersections of sets of the type in (4.4).
Theorem 4.4.2. Let Πbe a prior on Lµ.Iff0is in the K-L support of Π, then
the posterior is weakly consistent at f0.
Proof. If U={P:|fidP fidP0|<
i:1ik}then
U=k
1{P:|fidP fidP0|<
i}
Hence it is enough to show that the posterior probability of each of the sets in the
intersection goes to 1 a.s f0. By the discussion preceding the theorem, {P:|fidP
132 4. CONSISTENCY THEOREMS
fidP0|<
i}is an intersection of two sets of the type displayed in (4.4). Since the
Schwartz condition is satisfied for these sets
Π(U|X1,X
2,...,X
n)1a.sP
f0.
Further, using a countable base for weak neighborhoods, we can ensure that almost
surely P
f0, for all U(U|X1,X
2,...,X
n)1.
If we have a tail free prior on densities, like a suitable Polya tree prior, then we do
not need a condition like “f0is in the K-L support of Π” to prove weak consistency of
the posterior. On the other hand, consistency is proved for a tail free prior by using
a Schwartz like argument for finite-dimensional multinomials, which tacitly uses the
condition of f0being in the K-L support. See also the result in the next section that
establishes posterior consistency without invoking Schwartz’s condition.
Applications of Schwartz’s theorem appear in Chapters 5, 6 and 7.
4.4.2 L1-Consistency
What if Uis a total variation neighborhood of f0? LeCam [122] and Barron [7] show
that in this case, if f0is nonatomic, then a uniformly consistent test for H0:f=f0
versus H1:fUcwill not exist.
Barron investigated the connection between posterior consistency and existence of
uniformly consistent tests. The next two results are adapted from an unpublished
technical report of Barron. Some of these appear in [8].
Proposition 4.4.2. Suppose for some β0>0,Π(Wn)<Ce
0.Iff0is in the
K-L support of Πthen
Π(Wn|Xn)0a.s.P
f0
Proof. By the Markov inequality
Pf0Wn
n
1
f
f0
(Xi(df )>e
eRnWn
n
1
f
f0
(Xi(df )
n
1
f0(Xi)µn(dx1,dx
2,...,dx
n)
=eWn
Π(df )
eCe0
4.4. POSTERIOR CONSISTENCY ON DENSITIES 133
and if β<β
0
P
f0Wn
n
1
f
f0
(Xi(df )>e
i.o =0
By Lemma 4.4.1, for all β>0,
eLµ
n
1
f(Xi)
f0(Xi)Π(df )→∞a.s P
f0.
The argument is now easily completed.
Theorem 4.4.3 (Barron). Let Πbe a prior on Lµ,f0in Lµand Ube a neigh-
borhood of f0. Assume that Π(K(f0)) >0for all >0. Then the following are
equivalent.
(i) There exists a β0such that
Pf0{Π(Uc|X1,X
2,...,X
n)>e
0infinitely often}=0
(ii) There exist subsets Vn,W
nof Lµ, positive numbers c1,c
2
1
2and a sequence
of tests {φn(Xn)}such that
(a) UcVnWn,
(b) Π(Wn)C1e1, and
(c) Pf0{φn(Xn)>0infinitely often}=0and
inffVnEfφn1c2e2.
Proof. (i)=(ii): Set Sn=(x1,x
2,...,x
n):Π(Uc|x1,x
2,...,x
n)>e
0and
φn=ISn.Letβ<β
0
Vn=f:Pf(Sn)>1e
Wn=f:Pf(Sc
n)eUc
By assumption P
f0{φn= 1 infinitely often }= 0 and by construction
inf
fVn
Efφn>1e
134 4. CONSISTENCY THEOREMS
Now,
Π(Wn)=Πf:Pf(Sc
n)>e
Uc
eUc
Pf(Sc
n(df )
and by Fubini
=eSc
n
π(Uc|xn)n(xn)
ee0=en(β0β)
where λnis the marginal distribution of Xn.
(ii)=(i):
Π(Uc|Xn)=Π(UcVn|Xn)+Π(UcWn|Xn)
Since Wnhas exponentially small prior probability, by Proposition 4.4.2
Π(Wn|Xn)0a.sP
f0
The proof actually shows that for some β0>0, writing i.o. for ”infinitely often”
P
f0Π(Wn|Xn)>e
0i.o =0
Because Π(UcVn|Xn)Π(Vn|Xn), it is enough to show that, for some β>0,
P
f0Π(Vn|Xn)>e
i.o =0
Now,
Π(Vn|Xn)
=φn(Xn)Π(Vn|Xn)+(1φn(Xn))Π(Vn|Xn)
Since P
f0{φn>0 i.o. }= 0, for any β>0, P
f0{φnΠ(Vn|Xn)>0 i.o. }=0.
4.4. POSTERIOR CONSISTENCY ON DENSITIES 135
For any βan application of Markov’s inequality and Borel-Cantelli lemma shows
that
Pf0Vn
n
1
f
f0
(xi(df )(1 φn(xn)) >e
eRnVn
n
1
f
f0
(xi)(1 φn(xn)) Π(df )
n
1
f0(xi)µn(dxn)
=eVn
Ef(1 φn(df )
eC2e2
and if β<β
2
Pf0Vn
n
1
f
f0
(xi(df )(1 φn(xn)) >e
i.o =0.
As before by Lemma 4.4.1 for any β,
eLµ
n
1
f(Xi)
f0(Xi)Π(df )→∞a.s P
f0.
The argument is now easily completed.
This last theorem can be used to develop sufficient conditions for posterior con-
sistency on L1-neighborhoods. Barron, Schervish and Wasserman [5] provide such a
condition using bracketing metric entropy. Motivated by their result, we prove the
following.
Definition 4.4.5. Let G⊂Lµ.Forδ>0, the L1-metric entropy J(δ, G) is defined
as the logarithm of the minimum of all nsuch that there exist f1,f
2,...,f
nin Lµ
with the property G⊂∪
n
1{f:ffi}.
Theorem 4.4.4. Let Πbe a prior on Lµ. Suppose f0Lµand Π(K(f0)) >0for
all >0. If for each >0, there is a δ<,c1,c
2>0,β<
2/2, and FnLµsuch
that, for all nlarge,
1. Π(Fc
n)<C
1e1,
2. J(δ, Fn)<nβ,
136 4. CONSISTENCY THEOREMS
then the posterior is strongly consistent at f0.
Proof. Let U={f:ff0<},Vn=FnUc,and Wn=Fc
n. We will argue that the
pair (Vn,W
n) satisfy (ii) of Theorem 4.4.3. Here UcVnWnand Π(Wn)<c
1e1.
Let g1,g
2,...,g
kin Lµbe such that Vn⊂∪
k
1Giwhere Gi={f:fgi}.
Let fiVnGi.Thenforeachi=1,2,...,k,f0fi>and if fGi, then
f0f>δ. Consequently for each i=1,2,...,k, there exists a set Aisuch that
Pf0(Ai)=αand Pfi(Ai)=γ>α+
Hence if fGi, then Pf(Ai)δ>α+δ.
Let
Bi=(x1,x
2,...,x
n): 1
n
n
j=1
IAi(xj)(γ+α)/2
A straightforward application of Hoeffeding’s inequality shows that
Pf0(Bi)exp[n2/2]
On the other hand, if fGi,
Pf(Bi)Pf1
n
n
j=1
IAi(xj)Pf(Ai)(αγ)
2+δ
Pfn1
n
j=1
IAi(xj)Pf(Ai)
2+δ(4.5)
Applying Hoeffeding’s inequality to n1n
j=1 IAi(xj), the preceding probability
is greater than or equal to
1exp[(n/2)(/2δ)2]
If we set
φn(X1,X
2,...,X
n)= max
1ikIBi(X1,X
2,...,X
n)
then
Ef0φnkexp[n2/2]
and
inf
fVn
Efφn1exp[(n/2)(/2δ)2]
4.5. CONSISTENCY VIA LECAM’S INEQUALITY 137
By choosing log kJ(δ, Fn)<nβ,wehaveEf0φnexp[n(2/2β)]. Since
β<
2/2, all that is left to show is
Pf0{φn>0 infinitely often}=0
This follows easily from an application of the Borel Cantelli lemma and from the fact
that φntakes only values 0 or 1.
This last theorem is very much in the spirit of Barron et al. [5]. Their theorem is in
terms of bracketing entropy. If G⊂Lµ,forδ>0, the L1-bracketing entropy J1(δ, G)
is defined as (here we use a weaker notion that suffices for our purpose) the logarithm
of the minimum of all nsuch that there exist g1,g
2,...,g
nsatisfying
1. gi1+δ,
2. for every g∈Gthereexistsanisuch that ggi.
We feel that in many examples the L1entropy is easier to apply than bracketing
entropy.
4.5 Consistency via LeCam’s inequality
It is of technical interest that one can prove posterior consistency without assuming
that the prior is tail free or satisfies the condition of f0being in the K-L support. An
inequality of LeCam [121] is useful to do this.
Let Π be a prior on M(X). For any measurable subset Uof M(X), let λUbe the
probability measure on Xgiven by
λU(B)= 1
Π(U)U
P(B)dΠ(P)
We will let λstand for the marginal on X.
If given P,XP, and Π(U|Xn) is the posterior probability of U, then
Π(U)=Π(U)U
(·)= Π(U)U
Π(U)U(Uc)Uc
(·)
Π(U)
Π(V)
U
V
(·)ifVUc
138 4. CONSISTENCY THEOREMS
Also recall that the L1-distance satisfies
PQ=2sup
B|P(B)Q(B)|=2 sup
0f1fdP fdQ
where of course Bsandfs are measurable.
Lemma 4.5.1 (LeCam). Let U, V be disjoint subsets of X. For any P0and any
test function φ
Π(V|x)dP0(x)≤P0λU+φdP0+Π(V)
Π(U)(1 φ)V(4.6)
Proof.
Π(V|x)dP0(x)=φ(x)Π(V|x)dP0(x)+(1 φ(x))Π(V|x)dP0(x)
adding and subtracting (1 φ(x))Π(V|x)U(x)
φ(x)dP0(x)+(1 φ(x))Π(V|x)dP0(x)(1 φ(x))Π(V|x)U(x)
+(1 φ(x))Π(V|x)U(x)
φ(x)Π(V|x)dP0(x)+P0λU+Π(V)
Π(U)(1 φ)V
where the first term comes from observing
0Π(V|x)1
and the second from
0(1 φ)(x)Π(V|x)1
The third term follows by noting that
Π(V|x)(Π(V)/Π(U))(V/dλU)
4.5. CONSISTENCY VIA LECAM’S INEQUALITY 139
Our interest is when Vis the complement of a neighborhood of P0and we have
X1,X
2,...,X
nwhich are given P, i.i.d. P.IfUnV=and φnare test functions,
then we can write LeCam’s inequality as
Π(V|Xn)≤Pn
0λn
Un+φndP n
0+Π(V)
Π(Un)(1 φn)V
where of course Pnis the n-fold product of Pand λn
U=(
UPndΠ(P))/Π(U).
Theorem 4.5.1. Let Uδ
n={P:P0P/n}. If for every δ,{Π(Uδ
n):n1}
is not exponentially small, i.e.,
for all β>0,e
Π(Uδ
n)→∞ (4.7)
then the posterior is weakly consistent at P0
Proof. It is not hard to see that
P0P/n⇒Pn
0Pn
Consequently the first term goes to δ. Since for any weak neighborhood we can choose
an exponentially consistent test φnfor testing H0:f=f0against H1:fVc
n,and
by assumption for all β>0,e
Π(Uδ
n)→∞, it is not hard to see that the third term
goes to 0. Because δis arbitrary, the result follows.
Remark 4.5.1.By Proposition 1.2.1, PQ≤2H(P, Q). Hence Theorem 4.5.1
holds if we take Uδ
n={P:H(P0,P)/n}
Suppose (4.7) holds and Vnare sets such that for some β0>0,Π(Vn)e00; then
choosing φn0 it follows easily that Π(Vn|X1,X
2,...,X
n)0. In other words, we
have an analog of Proposition 4.4.2. Consequently, we also have an analog of Theorem
4.4.4.
Theorem 4.5.2. Let Πbe a prior on Lµ. If for each >0, there is a δ<,
c1,c
2>0,β<
2/2, and FnLµsuch that for all nlarge,
1. Π(Fc
n)<C
1e1and
2. J(δ, Fn)<nβ
Further if with Uδ
n={P:P0P/n},
for every δ, for all β>0,e
Π(Uδ
n)→∞
then the posterior is strongly consistent at f0.
5
Density Estimation
5.1 Introduction
As the name suggests, density estimation is the problem of estimating the density of a
random variable Xusing observations of X. In this chapter we discuss some Bayesian
approaches to density estimation.
Density estimation has been extensively studied from the non-Bayesian point of
view. These include many methods of estimation starting from simple histogram
estimates to more sophisticated kernel estimates, estimates through Fourier series
expansions, and more recently wavelet-based methods. In addition, the asymptotics
of many of these methods, including minimax rates of convergence are available. There
are many good references; Silverman [151] and Van der Vaart [160] provide a good
starting point.
Consider the simple case when the density is to be estimated through a histogram.
Important features of the histogram are number of bins, their location and their width.
In order to reflect the true density, these features of the histogram estimate need to be
dependent not just on the number of observations but on the observations themselves.
The need for such a dynamic choice has been recognized and there have been many
reasonable, ad hoc, prescriptions. This issue persists in one form or another with the
other methods of estimation such as kernel estimates. The Bayesian approach, via
the posterior provides a rational method for choosing these features.
142 5. DENSITY ESTIMATION
In this chapter we discuss histogram priors of Gasperini and mixtures of nor-
mal densities which were introduced by Lo [130] and further developed by Escobar,
Mueller and West [ [168],[59] and [170]]. Gaussian process priors developed by Leonard
[[126],[127]] and studied by Lenk [125] are some what different in sprit and are also
discussed. See also Hjort [98] and Hartigan [94].
Consistency is dealt with at some length for the histogram and the mixture of nor-
mal kernel priors. These partly demonstrate different techniques to show consistency.
For the priors on histograms direct calculation is easier than invoking the Schwartz
theorem whereas for the mixture of normal kernels Schwartz’s theorem is a conve-
nient tool. This chapter is beset with long computations. To an extent they are both
natural and necessary.
5.2 Polya Tree Priors
A prerequisite for Bayesian density estimation is, of course, a prior on densities.
Since the Dirichlet process and their mixtures sit on discrete measures, these are
clearly unsuitable. On the other hand we have saw in Chapter 3 that by choosing
the parameters appropriately we can get Polya tree priors that are supported by
densities. Since the posterior for these priors involves simple updating rules, it is
natural to consider Polya trees as a candidate in density estimation.
Recall that if we have a Polya tree with partitions {B:Ej:j1}and pa-
rameters {α:E
k}:k1}, the predictive density at xis given by
α(x) = lim
k→∞
k
1
1
λ(B1(x)2(x)...i(x))
α1(x)2(x)...i(x)
α1(x)2(x)...i(x)0 +α1(x)2(x)...i(x)1
where i(x)=1ifxB1(x)2(x)...i(x)and 0 otherwise.
If X1=x1is observed and x1B
1,
2,...
kfor a sequence (
1,
2,...)of0sand1s,
and if and differ for the first time at the (j+ 1)th coordinate, then the predictive
density α(x|X1=x1)is
α(x|X1=x1)=
j
1
1
λ(B1(x)2(x)...i(x))
α1(x)2(x)...i(x)+1
α1(x)2(x)...i(x)0 +α1(x)2(x)...i(x)1
j+1
1
λ(B1(x)2(x)...i(x))
α1(x)2(x)...i(x)
α1(x)2(x)...i(x)0 +α1(x)2(x)...i(x)1
5.3. MIXTURES OF KERNELS 143
As is to be expected the predictive density depends on the partition. While a
general expression for the predictive density given X1,X
2,...,X
nis cumbersome to
write down, it is clear that sequential updating is possible.
The density estimates from Polya tree priors have no obvious relation with classical
density estimates. Further, the priors lead to estimates that lack smoothness at the
endpoints of the defining partition. Lavine [118] observes that this disadvantage can
be overcome by considering a mixture of {PT(θ)(θ))}processes, where the par-
titions themselves depend on the hyperparameter θ. One advantage of the Polya tree
priors is the relative ease with which one can conduct robustness studies; see Lavine
[119].
If we have a prior on densities, as discussed in Chapter 4 the consistency of interest
is L1-consistency. It is shown in Barron et al. [5] that if αn=8
n, the posterior is L1-
consistent. Such a high value of αnimplies that the random Ps are highly concentrated
around the prior guess E(P), so that posterior consistency will be an extremely slow
process. Hjort and Walker [165] have used a some what curious argument and show
that with αn=n2+δthe Bayes estimate is L1-consistent.
5.3 Mixtures of Kernels
While Polya tree priors can be made to sit on densities, it is not possible to constrain
the support to have smoothness properties. Much before Polya tree priors became
popular, Lo [131] had developed a useful construction of priors on densities. Much of
this section is based on Lo [131] and Ferguson [63].
Let Θ be a parameter set, typically Ror R2.LetK(x, τ ) be a kernel, i.e.,for each
τ,K(·) is a probability density on Xwith respect to some σ-finite measure. For any
probability Pon Θ, let
K(x, P )=K(x, τ )dP (τ)
For ea ch P,K(·,P) is a density on Xand Lo’s method consists of choosing a
mixture K(·,P) at random by choosing Paccording to a Dirichlet process. These
would be referred to as Dirichlet mixtures of K(·,P).
Formally the model consists of PDα, given P;X1,X
2,...,X
nare i.i.d. K(·,P).
If α=M¯α,wherαis a probability measure, then the prior expected density is
f0=K(·,P)Dα(dP )=K(·α( )
144 5. DENSITY ESTIMATION
It is convenient to view the X1,X
2,...,X
nas arising in the following way: P
Dαgiven P;τ1
2,...,τ
nare i.i.d Pand given P,τ1
2,...,τ
n;X1,X
2,...,X
nare
independent with XiK(·
i).
The latent variables τ1
2,...,τ
nalthough unobservable, provide insight into the
structure of the posterior and are useful in describing and simulating the posterior.
A simple kernel would be to take τ=(i, h):h>0
K(x, (i, h)) = I(ih,(i+1)h]
h(x)
With this kernel one gets random histograms.
Another very useful kernel is the normal kernel. Here τ=(θ, σ)andK(x, θ, σ)=
(1)φ((xθ))whereφis the standard normal density. In this case the prior picks
a random density that is a mixture of normal densities. The weak closure of such
mixtures is all of M(R).
The prior is a probability measure on the space of densities {K(·,P):PM(R)}
and so is the posterior given X1,X
2,...,X
n. For the normal kernel Pis in general
not identifiable. It is known from [156] that if P1and P2are discrete measures with
finite support, then K(·,P
1)=K(·,P
2)iP1=P2.ItiseasytoseethatifP1=
N(0,1) ×δ(00)and P2=δ(0,1+σ2
0), then K(·,P
1)=K(·,P
2)=N(0,1+σ2
0).
Thus in general, Pis not identifiable. Identifiability of Pwhen restricted to discrete
measures is still unresolved [63].
If we denote by Π(·|X1,X
2,...,X
n) the posterior distribution of Pgiven X1,...,X
n
and by H(·|X1,X
2,...,X
n) the posterior distribution of τ1,...,τ
ngiven X1,...,X
n
then
Π(·|X1,X
2,...,X
n)=Π(·|(τ1,X
1),...,(τn,X
n))H(|X1,X
2,...,X
n)
Since Pand X1,X
2,...,X
nare conditionally independent given τ1
2,...,τ
n,
Π(·|(τ1,X
1),...,(τn,X
n)) = Π(·|(τ1
2,...,τ
n)) = Dα+δτi
and
Π(·|X1,X
2,...,X
n)=Dα+δτiH(|X1,X
2,...,X
n)
The evaluation of these quantities depend on H(·|X1,X
2,...,X
n). If αhas a den-
sity, the joint density ˜α(τ1
2,...,τ
n) is discussed in Chapter 3 (see equation 3.15).
5.3. MIXTURES OF KERNELS 145
Recall that if C1,C
2,...,C
N(P)is a partition of {1,2,...,n}then the density (with
respect to the Lebesgue measure on Rk)at
τ=(τ1
2,...,τ
n):τi=τi,i,i
Cj,j =1,2,...N(P)
is N(P)
1
α(τj)(ej1)!
n
1(M+i)(5.1)
where ej=#Cjand hence the joint density of the xsandτsat
τ=(τ1
2,...,τ
n):τi=τi,i,i
Cj,j =1,2,...N(P)
is N(P)
1
α(τj)(ej1)! lCjK(xl
j)
n
1(M+i)
Consequently, the posterior density of τis
N(P)
1α(τj)(ej1)! lCjK(xl
j)
PN(P)
1α(τj)(ej1)! lCjK(xl
j)d(τj)
Thus
1
nK(x, τi)H(|X1,X
2,...,X
n) (5.2)
=1
n
PN(P)
1(ej1)! K(x, τj)lCiK(xl
j)α(τj)j
PN(P)
1(ej1)! lCiK(xl
j)α(τj)j
(5.3)
Since the Bayes estimate ˆ
fof fis, by 5.2, this reduces to
M
M+nf0(x)+ n
M+nK(x, τi)H(|X1,X
2,...,X
n)
Hence, we have that the Bayes estimate of fis
M
M+nK(x, τ α()
+n
M+n
P
W(P)ei
nK(x, τ )lCiK(xl)α(τ)
lCiK(xl)α(τ)(5.4)
146 5. DENSITY ESTIMATION
where P={C1,C
2,...,C
N(P)}is a partition of {1,2,...,n},eiis the number of
elements in Ci,and
W(P)= Φ(P)
Φ(P),Φ(P)=
N(P)
1{(ei1)!
lCi
K(xl)α(τ)}
The Bayes estimate is thus composed of a part attributable to the prior and a
part attributable to the observations. Since for the Dirichlet, M0 corresponds to
removing the influence of the prior, it is tempting to consider the estimate
1
nK(x, τi)H(|X1,X
2,...,X
n)
as a partially Bayesian estimate with the influence of the prior removed. Unfortu-
nately, this interpretation is quite misleading. As M0 the Bayes estimate (5.4)
goes to K(x, τ1α(τ1)n
1K(xi
1)1
˜α(τ1)n
1K(xi
1)1
(5.5)
corresponding to a partition in which all τiare equal to τ1. All other terms have a
power of Mand tend to 0. The term (5.5) corresponds to assuming that all the Xis
came from a single parametrized population with density K(x, τ ) and so is highly
parametrized.
The apparent paradox is resolved by the fact that role of the hyperparameters
depends on the context. Here Mdecides the likelihood of different clusters and in fact
relatively large values of Mhelp bring the Bayes estimate close to a data-dependent
kernel density estimate. For a penetrating discussion of the role of M, see discussion
by Escobar [66] and West et al. [170].
Clearly to calculate quantities like K(x, τ)α( ) it would be convenient if αis
conjugate to K(., .). Thus if Kis the normal kernel a convenient choice for ¯αis a
prior conjugate to N(τ, σ). Hence an appropriate choice for ¯αis the inverse normal-
gamma prior, i.e., the precision ρ=12has a gamma distribution and given ρ,τis
N(µ, 1). Ferguson [63] has interesting guidelines for choosing the parameters of ¯α
and M.
The expression for the Bayes estimate, even though it has an explicit expression, in-
volves enormous computation. The posterior for Dirichlet mixtures of normal densities
is amenable to MCMC methods. Gibbs methods are based on successive simulations
from one-dimensional conditional distributions of τigiven τj,j =i, X1,X
2,...,X
n.
5.4. HIERARCHICAL MIXTURES 147
For a good exposition see Schervish [144] and Chen et al. [32]. The MCMC methods
were developed in the present context by Escobar, Mueller and West ([59], [169],[170]).
A good survey of the issues underlying MCMC issues is given by Escobar and West
in [60].
To implement MCMC one essentially works with the conditional distributions of
τigiven τj,j =i, X1,X
2,...,X
n, which may be written explicitly from the posterior
distribution of the τs given earlier or directly [32]. In practice, αhas a location and
scale parameter (µ, σ), which leads to some complications. In the joint distribution
of τs one replaces ˜αby αµ,σ and multiplies by the prior Π(µ, σ). Starting from this,
one can calculate all the relevant posterior distributions needed in MCMC. See also
Neal [135].
Since no explicit expressions are available for the Bayes estimate of f(x), it would
be worth exploring whether approximations like Newton [137] can be developed.
The next issue would be to do the asymptotics. In Section 5.4 we do this for a
slightly modified version of the mixture model. While formal asymptotics is yet to be
done for the priors discussed in this section, we expect that the results and techniques
of the next section will go through with minor modifications.
5.4 Hierarchical Mixtures
This method is a slight variation of the method discussed in the last section.
Let K(x) be a density on R.Foreachh>0 consider the kernel Kh(x, θ)=
(1/h)K((xθ)/h). For any PM(R), let
Kh,P =Kh(x, θ)dP (θ)
Note that Kh,P is just the convolution KhP.IfPDα,thenwegetaprioron
Fh={Kh,P :PM(R)}
We now view has the smoothing “window” and think of has a hyperparameter
and put a prior µfor h. The calculations are very similar to those of the last section
except that we need to incorporate the hyperparameter h.
As before, the observations can be thought of as arising from: hµ,given
h;PDα;givenh, P ;θ1
2,...,θ
nare i.i.d. Pand given h, P ,and θ1
2,...,θ
n;
X1,X
2,...,X
nare independent with XiKh(·
i).
148 5. DENSITY ESTIMATION
The posterior distribution of Pgiven X1,X
2,...,X
nis
Π(·|X1,X
2,...,X
n)
=Π(·|(h, θ1
2,...,θ
n,X
1,...,X
n))H(d(h, θ)|X1,X
2,...,X
n) (5.6)
Because Pand X1,X
2,...,X
nare conditionally independent given h, θ1
2,...,θ
n,
Π(·|(h, θ1
2,...,θ
n,X
1,...,X
n)) = Dα+δθi
and
Π(·|X1,X
2,...,X
n)=Dα+δθiH(d(h, θ)|X1,X
2,...,X
n)
As before, if µand αhare densities with respect to Lebesgue measure then the
posterior density of (h, θ1
2,...,θ
n)isgivenby
µ(hα(θ1
2,...,θ
n)n
1Kh(Xiθi)
µ(hα(θ1
2,...,θ
n)n
1Kh(Xiθi)dhdθ
where ˜αis given by 3.15.
An expression analogous to (5.4) for the Bayes estimate can be written. In the
next two sections we look at consistency problems in the case when Kgivesriseto
histograms and when Kis the standard normal density.
Ishwaran [103] has used a general polya urn scheme to model θis and used these to
construct measures analogous to a prior and established consistency of the posterior.
These are then applied to a variety of interesting problems.
5.5 Random Histograms
In this section we consider priors that choose at random first a bin of width hand
then a histogram with bins (ih, (i+1)h:h∈N)whereN={0±1±2...}. Formally,
in the hierarchical model we take Θ = Nand the kernel K(x)=I(0,1](x).
Thus the model consists of, hµ;givenh;choosePon integers with PDαh
and X1,X
2,...,X
nare, given h, P , i.i.d. fh,P where
fh,P (x)=
i=−∞
P{i}
hI(ih,(i+1)h](x)
5.5. RANDOM HISTOGRAMS 149
One could introduce intermediate latent variables θ1
2,...,θ
nwhich are given h, P ;
i.i.d. P. However, they are not of much use here because Xicompletely determines
θi,namely,θi=jiff Xi(jh,(j+1)h].
For ea ch h,letnjh be the number of Xis in the bin (jh,(j+1)h]andJh={j:
njh >0}.
A bit of reflection shows that the posterior distribution of Pgiven h, X1,X
2,...,X
n
is Dαh+njhδj,whereδjis the point mass at j.
If µis a density on (0,) then the joint density of hand X1,X
2,...,X
nis
µ(h)
1[αh(i)][nhi1]hn
M[n]
h
where Mh=αh(N) for any positive real xand positive integer k,x[k]=x(x+
1) ...(x+k1). Hence the posterior density Π(h|X1,X
2,...,X
n)is
µ(h)
1[αh(i)][nhi1]hn
0µ(h)
1[αh(i)][nhi1]hndh (5.7)
Thus the posterior is of the same form as the prior, with µupdated to (5.7) and
αhupdated to αh+nhj δj.
Since each Dαhleads to the expected density
f¯αh(x)=¯αh(j)
hI(jh,(j+1)h](x)
the prior expectation is given by
f0(x)=f¯αh(x)µ(h)dh
Using the conjugacy of the prior, an expression for the Bayes estimate given the
sample can be written.
A choice of µwhich is positive in a neighborhood of 0 will allow for wide variability
in the choice of histograms and will ensure that the prior has all densities as its
support. If the prior belief leads to the density f0then an appropriate choice of ¯αh
would be
¯αh(j)=(j+1)h
jh
f0(x)dx
Of course, this choice would lead to a prior expected density, which may not be
equal to f0, but it can be viewed as an approximation to f0.
150 5. DENSITY ESTIMATION
5.5.1 Weak Consistency
Gasperini introduced these priors in his thesis and under some assumptions on αh
showed that if the true f0is not constant on any interval then under the posterior
distribution given X1,X
2,...,X
n,hgoesto0,asn→∞. Thus the posterior stays
away from densities that are far from f0. Under additional assumptions on f0, he also
showed that the Bayes estimate of fconverges in L1to f0. In the spirit of Chapter
4 we investigate the consistency properties of the posterior. We confine ourselves to
the case when the random histograms all have support on (0,], that is, the case
when Pis a probability on N+={0,1,2,...}. This restriction is not required but
simplifies the proof of Lemma 5.5.2. Some of the following calculations are taken from
Gasperini’s thesis, but the main ideas of the proof and the main results are different.
The consistency results in this chapter typically describe a large class of densities
where consistency obtains. We saw in Chapter 4 that when we have a prior Π on
densities, the Schwartz condition Π(Kf0()) >0 for all >0 (recall Kf0()isthe
Kullback-Leibler neighborhood of f0) ensures weak consistency at f0. Thus it seems
appropriate, in the context of histogram priors, that we should attempt to describe f0s
which would satisfy Schwartz’s condition. This would entail relating the tail behavior
of f0to the tail behavior of αhs. This is to be expected but leads to somewhat
cumbrous and restrictive conditions. It turns out that histogram priors are amenable
to direct calculations that lead to consistency results.
To be more specific, recall that Schwartz’s condition (Lemma 4.4.1) was used to
show that for all β>0,
eFf(xi)
f0(xi)dΠ(f)→∞a.s. P
f0
Under some assumptions we will establish this result directly. The following propo-
sition indicates the steps involved.
Proposition 5.5.1. Let Fbe a family of densities. For each hH,Πhisaprior
on F;µis a prior on H, i.e., hµ; given h;fΠhand given h, f ;X1,X
2,...,X
n
are i.i.d. f. If for a density f0,
for every β>0
µh:eFf(xi)
f0(xi)dΠh(f)→∞a.s. P
f0>0 (5.8)
then the posterior is weakly consistent at f0.
5.5. RANDOM HISTOGRAMS 151
Proof. Let Ube a weak neighborhood of f0and let Π be the prior on the space of
densities induced by µ, Πh. Since we have exponentially consistent tests for testing f0
against Uc, it follows from Lemma 4.4.2 that for some β0
e0Uc
n
1
f(xi)
f0(xi)dΠ(f)0 a.s. P
f0
To establish consistency it is enough to show that
lim inf
n→∞ e0F
n
1
f(xi)
f0(xi)dΠ(f) = lim inf
n→∞ e0F
n
1
f(xi)
f0(xi)dΠh(f)(h)
→∞ a.s. P
f0
Consider
(h, x):xR,hH:e0F
n
1
f(xi)
f0(xi)dΠh(f)→∞
By assumption for hin a set of positive µmeasure, the h– section of Ehas measure
1 under P
f0. By Fubini there is a FR,P
f0(F) = 1 and for xF,thexsection
of Ehas positive µmeasure and for each xFby Fatou
lim inf
n→∞ He0F
n
1
f(xi)
f0(xi)dΠh(f)(h)=
Assumptions on the Prior (Gasperini)
(i) µis a prior for hwith support (0,).
(ii) For each h,αhis a probability measure on N+, and for all h,αh(1) >0.
(iii) For each h, there is a constant Kh>0 such that
αh(j)
αh(j+1) <K
hfor j=0,1,2...
152 5. DENSITY ESTIMATION
Theorem 5.5.1. Suppose that the prior satisfies the assumptions just listed. If f0
is a density such that
(a) x2f0(x)dx < and
(b) limh0f0log(f0,h/f0)=0,
then the posterior is weakly consistent at f0.
Proof. Let Inh =Fhn
1(f(xi)/f0(xi))Dαh(df )
To apply the last proposition it is enough to show that for any β>0 there exists
h0such that for each hin (0,h
0),
exp[n(β+log Inh
n)] →∞a.s. P
f0(5.9)
and this follows if for any >0,there exists h0such that for h(0,h
0),
lim
n
log Inh
n>a.s. P
f0
Then by taking =β/2, (5.9) would be achieved.
log Inh
n=1
nlog Fh
n
1
f(xi)
f0h(xi)Dαh(df )+ 1
n
n
1
log f0h(xi)
f0(xi)
where f0h(x)=(1/h)ih (i+1)hf0(y)dy for x(ih, (i+1)h].
By assumption b and SLLN for some h0, whenever h<h
0,
lim
n
1
n
n
1
log f0h(xi)
f0(xi)>
2a.s. P
f0
Note that whenever f∈F
h,fis a constant on (ih, (i+1)h]:i0. Consequently
for f∈F
h,n
1
f(xi)
f0h(xi)=
iJh
(f
h(i))nih
(f
0h(i))nih
where nih =#{xi(ih, (i+1)h]},Jh={i:nih >0}, and for any density f,f
h
denotes the probability on Ngiven by f
h(j)=(j+1)h
jh f(x)dx. Also let fhdenote the
histogram fh(x)=f(i)/h for x(ih, (i+1)h].
5.5. RANDOM HISTOGRAMS 153
Since Dαhis Dirichlet and αh(N)=1,
1
nFh
iJh
(f
h(i))nih
hnDαh(df )= 1
n
1
Γ(n+1)
iJh
1
hn
Γ(αh(i)+nih)
Γ(αh(i))
Therefore
1
nlog Fh
n
1
f(xi)
f0h(xi)Dαh(df )= 1
nlog 1
Γ(n+1)
iJh
1
hn
Γ(αh(i)+nih)
Γ(αh(i)) log
iJh
f
0h(i)
hn
It is shown in Lemma 5.5.2 that
1
nlog 1
Γ(n+1)
iJh
1
hn
Γ(αh(i)+nih)
Γ(αh(i))
iJh
nih log nih
n0 a.s.P
f0(5.10)
Using (5.10) we have
lim
n→∞
1
nFh
iJh
(f
h(i))nih
hnDαh(df )
= lim
n→∞ 1
nlog 1
Γ(n+1)
iJh
1
hn
Γ(αh(i)+nih)
Γ(αh(i)) log h
iJh
log f
0h(i) + log h
= lim
n→∞
iJh
nih
nlog nih
nlog h1
nlog
iJh
(f
0h(i))nih
hn
=
iJh
nih
nlog nih
nlog h
iJh
nih
nlog f
0h(i) + log h
0 a.s. P
f0(5.11)
Lemma 5.5.1. Under the assumptions of the theorem,
max
iJh
i
n0a.s P
f0
Consequently
#Jh
n
max
iJh
i
n0a.s P
f0
154 5. DENSITY ESTIMATION
Proof.
max
iJh
i≤{max(X1,X
2,...,X
n)
h}+1
Now max(X1,X
2,...,X
n)/n0. This follows from: If Y1,Y
2,...,Y
nare i.i.d.
(X2
i=Yiin our case) then max(Y1,Y
2,...,Y
n)/n 0iEY1<. Recall as-
sumption (a) of Theorem 5.5.1.
Lemma 5.5.2. Under the assumptions of the theorem
1
nlog 1
Γ(n+1)
iJh
1
hn
Γ(αh(i)+nih)
Γ(αh(i))
iJh
nih log nih
n0a.s. P
f0(5.12)
Proof. Let ln(h) stand for the first term on the left-hand side. Then
ln(h)= 1
nlog 1
Γ(n+1)
iJh
1
hn
Γ(αh(i)+nih)
Γ(αh(i))
ln(h)=
1
nlog 1
Γ(n+1)
iJh
1
hn
Γ(αh(i)+nih)
Γ(αh(i))
1
n
iJh
log Γ(αh(i)+nih)
1
n
iJh
log Γ(αh(i)) log h1
nlog Γ(n+1)
We first show that
1
n
iJh
log Γ(αh(i)) 0 a.s. P
f0
Since Γ(x)1/x for 0 x1, for h<,
01
n
iJh
log Γ(αh(i)) 1
n
n
1
log 1
αh(i)
5.5. RANDOM HISTOGRAMS 155
By using a telescoping argument, the right-hand side of the expression becomes
1
n
N
i=2
k
j=2 log 1
αh(i)log 1
αh(i1)+N
nlog 1αh(1)
=1
n
N
2
(Nj+1)logαh(j1)
αh(j)+N
nlog 1αh(1)
(N+1)(N+2)
2nKh+N
nlog 1
αh(1) 0 a.s. P
f0(5.13)
By Stirling’s approximation for all x1,
log Γ(x)=(x1
2)logxx+log2π+R(x)0<R(x)<1
and we can write
1
nlog 1
Γ(n+1)
iJh
1
hn
Γ(αh(i)+nih)
Γ(αh(i))
1
n
iJh
log Γ(αh(i)+nih)1
n
iJh
log Γ(αh(i))
=1
n
iJh{(αh(i)+nih 1
2)log(αh(i)+nih )}
1
n
iJh αh(i)nih log 2π+R(αh(i)+nih)!log h
1
n{(n+1
2)log(n+1)(n+ 1) + log 2π+R(n)}(5.14)
Since iJhnih =nand
1
n
iJh αh(i) + log 2π+R(αh(i)) + nih log h!
(maxiJhi)(2+log2π)
n0 a.s. P
f0(5.15)
156 5. DENSITY ESTIMATION
we get
lim
n→∞ |ln(h)
iJh
nih
nh log nih
nh |
1
n
iJh{(αh(i)+nih 1
2)log(αh(i)+nih )}
iJh
nih
nh log nih +logn+logh(5.16)
By adding and subtracting 1/n iJhnih 1/2lognih we have
|1
n
iJh{(αh(i)+nih 1
2)log(αh(i)+nih )
iJh
nih
nh log nih|
≤|1
n
iJh
αh(i)log(αh(i)+nih )|
+1
n
iJh
(nih 1
2)log(1 + αh(i)
nih |+1
n
iJh
1
2log nih (5.17)
Using log(1 + x)x
log(n+1)
n+1
n+log n
2n#Jh
The last term in this expression goes to 0 by Lemma 5.5.2.
The condition α(j1)(j)<Kessentially requires that the prior does not vanish
too rapidly in the tails. If our prior expectation f0is unimodal then it is easy to see
that the condition holds with K=m+h
mhf0(x)ds,wheremisthemodeoff0.
5.5.2 L1-Consistency
We next turn to L1-consistency. We will use Theorem 4.4.4. Recall that Theorem 4.4.4
required two sets of conditions—one being the Schwartz condition and the other was
construction of a sieve Fnwith metric entropy andsuchtha(Fc
n) is exponen-
tially small. A look at the proof of Theorem 4.4.4 shows that the Schwartz condition
can be replaced by
for all β>0,lim inf
n→∞ en
1
f(Xi)
f0(Xi)Π(df )=a.s P
f0
5.5. RANDOM HISTOGRAMS 157
Since we have already discussed this aspect in the last section, here we shall con-
centrate on the construction of a sieve.
To look ahead our sieve will be Fn=h>hnFan,h where Fan,h is the set of histograms
with support [an,a
n]. We will compute the metric entropy of Fnand show that for
a suitable choice of hn,a
nit is of the order . What is then left is to ensure that the
prior gives exponentially small mass to Fc
n
Proposition 5.5.2.
Let Pδ
k={(P1,P
2,...,P
k):Pi0,
k
1
Pi1δ}
Then
J(Pδ
k,2δ)(k
δ+1
2)log(1+δ)+klog(1 + δ)1
2log K+1
Proof. Let Kbe the largest integer less than or equal to k/δ and consider
P={PPδ
k:Pi=jδ
kfor some integer j}
We will show that given any PPδ
kthere is PPwith PP<2δ. The
logarithm of the cardinality of Pthen gives an upper bound for J(Pδ
k,2δ).
Let PPδ
k. Then since
|Pi
PjPi|=Pi
Pj
(1 Pj)Pi
Pj
δ,
we have (Pi/Pj)Pi.
Given PPδ
kwith Pi=1,letPbe such that
P
i=jδ
kfor some integer jand PiP
i<δ
k
Then P=(P
1,P
2,...,P
k)Pand also PP.Thuswehaveshownthat
Pis a 2δnet in Pδ
k.
To compute the number of elements in P, consider kpoints a1,a
2,...,a
k,each
endowed with a weight of δ/k.Ifweplace(k1) sticks among these points, then these
divide a1,a
2,...,a
kinto kparts, those to the left of the first stick, those between
the first and second, and so on, the last part being all those a
isto the right of the
last stick. Adding the weight of each of these parts gives a (P
1,P
2,...,P
k)Pand
158 5. DENSITY ESTIMATION
any element of Pcorresponds to a kpartition of a1,a
2,...,a
k. The number of ways
of partitioning kelements into kparts (some may be empty) is k+k1
k1.
Recall Stirling’s approximation
x!=2πxx+1
2ex+θ
12x0<θ<1
so that
k+k1
k1=(k+k1)!
(k1)!k!
(k+k)!
k!k!
2π(k+k)!(k+k)!+ 1
2e(k+k)!+ θ
12(k+k)!
2πkk+1
2ek+θ
12k2π(k)k+1
2ek+θ
12k
and therefore
log k+k1
k1log (k+k)k+1
2
(k)k+1
2
+log(k+k)k
kk+1
2
+
where
=log 1
2π+θ
12(k+k)θ
kθ
k<1
so that,
J(Pδ
k,2δ)(k+1
2)log(1+ k
k)+klog(1 + k
k)
1
2log k+1
substituting kk/δ we get the proposition.
Lemma 5.5.3. Suppose
PPδ
k={(P1,P
2,...,P
k):1Pi0,
k
1
Pi1δ}
δ<1,h
0<h<1and hh0=<δh
0/2(K+1).Iffhis the histogram fh(x)=
(Pi/h)I(ih,(i+1)h](x)and fh0is the histogram fh0(x)=(Pi)h)I(ih0,(i+1)h0](x), then
fhfh0<3δ.
5.5. RANDOM HISTOGRAMS 159
Proof. Let
I1=(0,h],I
2=(h, 2h],...I
k=((k1)h, kh]
and
J1=(0,h
0],J
2=(h0,2h0],...J
k=((k1)h0,kh
0]
Because k < h,fori<k,
Ii=(IiJi)(IiJi+1
Further,
IiJi+1 =((i+1)h0,(i+1)h)
Since fh=fh0on IiJi,wehave
Ii|fhfh0|dx =|Pi
hP(i+1)
h|(i+1)(hh0)
and because Pi1andh<h
0,
kh
0|fhfh0|dx =
k1
1|PiP(i+1)|(i+1)(hh0)
h+Pk
(i+1)(hh0)
h
k
1
Pi
(i+1)(hh0)
h
2(k+1)
h0δ
(5.18)
A bit of notational clarification: For every h,an/h will not be an integer and hence
when we write Fan,h what we mean is the set of all histograms from 0 to [an/h]where
[an/h]is the largest integer less than or equal to an/h. In our calculations, to avoid
notational mess, we pretend that an/h is an integer.
Lemma 5.5.4. For a>0, let Fa,h be all histograms from [0,a]with bin width h.
Then
h>h0Fa,h =2h0>h>h0Fa,h
160 5. DENSITY ESTIMATION
Proof. For any h>h
0, for some integer m,(h/m)(h0,2h0).The conclusion follows
because any histogram with bin width hcan also be viewed as a histogram with bin
width h/m.
We put all the previous steps together in the next proposition Let Fδ
a,h be all
histograms fhin Fa,h such that Pfh[0,a]>1δ.
Proposition 5.5.3.
Jh>hFδ
a,h,5δlog(2a
h+1)+( a
log(1 + δ)+ a
nlog(1 + 1
δ)+1
Proof. By Lemma 5.5.4
h>hFa,h=2h>h>h Fa,h
Set k=2a/h and =δh2/(2a+1)
Let N=[h]+1 where for any a,[a] is the largest integer less than or equal to a,and
hi=h+i, i =1,2,···,N. Then by Proposition 5.5.2, given any f∈∪
2h>h>hFa,h,
thereissomehisuch that ffhi<3δ. Use of Proposition 5.5.1 at each of Fa,hi,
and a bit of algebra gives the result.
Theorem 5.5.2. Let µbe a probability measure on (0,)such that 0 is in the
support of µ.αis a probability measure on R. Our setup is hµ, the prior on Fhis
Dαhwhere αh(i)=α(ih, (i+1)h].Letan→∞,h
n0such that (an/nhn)0.
If
(i) for some β0
1,C
1,C
2>0,
α(an,a
n]>1C1e0
(ii) µ(0,h
n)<C
2e1
then the posterior is strongly consistent at any f0satisfying (5.8).
Proof. If an
nhn0, it follows from Proposition 5.5.3 that J(Fn)<nβfor large
enough n. An easy application of Markov inequality with condition (i), and using (ii)
gives Π(Fc
n)<Ce
for some Cand γ. Theorem 4.4.4 gives the conclusion.
Thus if an=naand hn=nbthen what we need is a+b<1. For example if
αis normal then one can take an=n1/2. The condition would then be satisfied if
hn=nbwith b<1/2.
5.6. MIXTURES OF NORMAL KERNEL 161
5.6 Mixtures of Normal Kernel
Another case of special interest is when Kis the normal These priors were introduced
by Lo [131], (see also Ghorai and Rubin[72] and West [168] who obtained expressions
for the resulting posterior and predictive distributions. These can be further general-
ized by eliciting the base measure α=0of the Dirichlet up to some parameters
and then considering hierarchical priors for these hyperparameters.
5.6.1 Dirichlet Mixtures: Weak Consistency
Returning to the mixture model, let φand φhdenote, respectively the standard normal
density and the normal density with mean 0 and standard deviation h.LetΘ=R
and Mbe the set of probability measures on Θ. If Pis in M, then fh,P will stand
for the density
fh,P (x)=φh(xθ)dP (θ)
Note that fh,P is just the convolution φhP.
To get a feeling for the developments, we first look at the case where h=h0is
fixed and our model is PΠandgivenP,X1,X
2,...,X
nare i.i.d. fp.Inthiscase,
the induced prior is supported by Fh0={fh0,P :P∈M}, and the following facts are
easy to establish from Scheffe’s theorem:
(i) The map P→ fh0,P is one-to-one, onto Fh0. Further PnP0weakly if and
only if fh0,Pnfh0,P →0.
(ii) Fh0is a closed subset of F.
Fact (ii) shows that Fh0is the support of Π, and hence consistency is to be sought
only for densities of the form fh0,P . Theorem 5.6.1 implies consistency for such densi-
ties. Fact (i) shows that if the interest is in the posterior distribution of P, then weak
consistency at P0is equivalent to strong consistency of the posterior of the density
at fh0,P .
In order to establish weak consistency of the posterior distribution of fwe need to
verify the Schwartz condition. Following is a proposition that though not useful when
ΠisDαis useful in other contexts.
Proposition 5.6.1.
K(fP,f
Q)K(P, Q)
162 5. DENSITY ESTIMATION
Proof. A bit of change of variables and order of integration would show that
K(fP,f
Q)=K(Pxφ(x)dx, Qxφ(x)dx)
where Pxis the measure Pshifted by x. Using the convexity of the K-L divergence
and observing K(Px,Q
x)=K(P, Q) for all x,wehave
K(fP,fQ)=K(Pxφ(x)dx, Qxφ(x)dx)K(Px,Q
x)φ(x)dx =K(P, Q)
Thus if we have a prior Π such that every Pis in K-L support then the posterior
is weakly consistent at fP. In fact the earlier remark shows that we have weak con-
sistency at Pand hence strong consistency at fP. The Dirichlet does not have this
property. However, we will show in Chapter 6 that for a suitable choice of parameters
the Polya tree satisfies this property. Fixing hseverely restricts the class of densities
and is thus not of much interest.
We turn next to the model with a prior for h. Our model consists of a prior µfor h
and a prior Π on M. The prior µ×Π through the map (h, P )→ fh,P induces a prior
on F. We continue to denote this prior also by Π. Thus (h, P )µ×Πandgiven
(h, P ), X1,X
2,...,X
nare i.i.d. fh,P . This section describes a class of densities in the
K-L support of Π. By Schwartz’s theorem the posterior will be weakly consistent at
these densities. The results in this section are largely from [74]. The next two results
look at two simple cases and hold for general priors, but Theorem 5.6.3 makes use of
special properties of the Dirichlet.
Theorem 5.6.1. Let the true density f0be of the form f0(x)=fh0,P0(x)=
φh0(xθ)dP0(θ).IfP0is compactly supported and belongs to the support of Π,
and h0is in the support of µ, then Π(K(f0)) >0for all >0.
Proof. Suppose P0[k, k] = 1. Since P0is in the weak support of Π, it follows that
Π{P:P[k, k]>1/2}>0. It is easy to see that f0has moments of all orders.
For η>0, choose ksuch that |x|>kmax(1,|x|)f0(x)dx < η.Forh>0, we write
−∞ f0log (fh,P0/fh,P )asthesum
k
−∞
f0log fh,P0
fh,P
+k
k
f0log fh,P0
fh,P
+
k
f0log fh,P0
fh,P
(5.19)
5.6. MIXTURES OF NORMAL KERNEL 163
Now
k
−∞
f0(x)logfh,P0(x)
fh,P (x)dx
k
−∞
f0(x)logk
kφh(xθ)dP0(θ)
k
kφh(xθ)dP (θ)dx
k
−∞
f0(x)logφh(x+k)
φh(xk)P[k, k]dx
=k
−∞
f0(x)2k|x|
h2dx log(P[k, k]) k
−∞
f0(x)dx
<2k
h2+log2
η
provided P[k, k]>1/2. Similarly, we get a bound for the third term in (5.19).
Clearly,
c:= inf
|x|≤kinf
|θ|≤kφh(xθ)>0
The family of functions {φh(xθ):x[k,k
]}, viewed as a set of functions of θ
in [k, k], is uniformly equicontinuous. By the Arzela-Ascoli theorem, given δ>0,
there exist finitely many points x1,x
2,...,x
msuch that for any x[k,k
],there
exists an iwith
sup
θ[k,k]|φh(xθ)φh(xiθ)|<cδ (5.20)
Let
E=P:φh(xiθ)dP0(θ)φh(xiθ)dP (θ)
<cδ;i=1,2,...,m
Since Eis a weak neighborhood of P0(E)>0. Let PE.Thenforany
x[k,k
], choosing the appropriate xifrom (5.20), using a simple triangulation
argument we get
φh(xθ)dP (θ)
φh(xθ)dP0(θ)1
<3δ
and so φh(xθ)dP0(θ)
φh(xθ)dP (θ)1
<3δ
13δ
164 5. DENSITY ESTIMATION
(provided δ<1/3).
Thus for any fixed h>0, for Pin a set of positive Π-probability, we have
f0log (fh,P0/fh,P )<22k
h2+log2
η+3δ
13δ(5.21)
Now for any h,
f0log (f0/fh,P )=f0log (f0/fh,P0)+f0log (fh,P0/fh,P ) (5.22)
The first term on the right-hand side of (5.22) converges to 0 as hh0. To see this,
observe that φh0(xθ)dP0(θ)
φh(xθ)dP0(θ)sup
|θ|≤k
φh0(xθ)
φh(xθ)
The rest follows by an application of the dominated convergence theorem.
Given any >0, choose a neighborhood Nof h0(not containing 0) such that if
hN, the first term on the right-hand side of (5.22) is less than /2. Next choose η
and δso that for any hN, the right-hand side of (5.21) is less than /2. Because
h0is in the support of µ, the result follows.
Remark 5.6.1.In Theorem 5.6.1, the true density is a compact location mixture of
normals with a fixed scale. It is also possible to obtain consistency at true densities
which are (compact) location-scale mixtures of the normal, provided we use a mixture
prior for has well. More precisely, if we modify the prior so that (θ, h)P(a
probability on R×(0,)) and PΠ, then consistency holds at f0=φh(x
θ)P0(dθ, dh) provided P0has compact support and belongs to the support of Π. The
proof is similar to that of Theorem 3.
Theorem 5.6.1 covers the case when the true density is normal or a mixture of
normal over a compact set of locations. This theorem, however, does not cover the
case when the true density itself has compact support, like, say, the uniform. The
next theorem takes care of such densities.
Theorem 5.6.2. Let 0be in the support of µand f0be a density in the support of
Π.Letf0,h=φhf0.If
1. lim
h0f0log(f0/f0,h)=0,
2. f0has compact support,
5.6. MIXTURES OF NORMAL KERNEL 165
then Π(K(f0)) >0for all >0.
Proof. Note that, for each h,
f0log(f0/fh,P )=f0log(f0/f0,h)+f0log(f0,h /fh,P )
Choose h0such that for h<h
0,f0log(f0/f0,h)</2 so all that is required is to
show that for all h>0,
ΠP:f0log (f0,h/fh,P )</2>0
If f0has support in [k, k]. Then
f0log(f0,h/fh,P )k
k
f0(x)logk
kφh(xθ)f0(θ)
k
kφh(xθ)dP (θ)dx
The rest of the argument proceeds in the same lines as in Theorem 5.6.1.
While the last two theorems are valid for general priors on M, the next theorem
makes strong use of the properties of the Dirichlet process. For any Pin M,set
P(x)=P(x, )andP(x)=P(−∞,x).
Theorem 5.6.3. Let Dαbe a Dirichlet process on M.Letl1,l
2,u
1,u
2be functions
such that for some k>0for all Pin a set of Dα-probability 1, there exists x0
(depending on P) such that
P(x)l1(x),¯
P(x+klog x)u1(x)x>x
0
and
P(x)l2(x),P(xklog |x|)u2(x)x<x0
(5.23)
For any h>0, define
Lh(x)=φh(klog x)(l1(x)u1(x)),if x>0
φh(klog |x|)(l2(x)u2(x)),if x<0
and assume that Lh(x)is positive for sufficiently large |x|.Letf0be the “true” density
and f0,h =φhf0. Assume that 0is in the support of the prior on h.Iff0is in the
support of Dα(equivalently, supp(f0)supp(α)) and satisfies
166 5. DENSITY ESTIMATION
1. lim
h0f0log(f0/f0,h)=0;,
2. for all h,lim
a↑∞
−∞
f0(x)logf0,h(x)
a
aφh(xθ)f0(θ)dx =0; and
3. for all h,lim
M→∞ |x|>M
f0(x)logf0,h(x)
Lh(x)dx =0,
then Π(K(f0)) >0for all >0.
Remark 5.6.2.It follows from Doss and Sellke [55] that if α=0,whereα0is a
probability measure, then
l1(x)=exp[2log|log α0(x)|0(x)]
l2(x)=exp[2log|log α0(x)|0(x)]
u1(x) = exp 1
α0(x+klog x)|log α0(xklog x)|2
u2(x) = exp 1
α0(xklog |x|)|log α0(xklog |x|)|2
satisfy the requirements of (5.23). For example, when α0is double exponential, we
may choose any k>2 and the requirements of the theorem are satisfied if f0has
finite moment-generating function in an open interval containing [1,1].
Remark 5.6.3.The following argument provides a method for the verification of
Condition 1 of Theorems 5.6.1 and 5.6.2 for many densities. Suppose that f0is con-
tinuous a.e., f0log f0<, and further assume that, as for unimodal densities,
there exists an interval [a, b] such that inf{f(x):x[a, b]}=c>0andf0is increas-
ing in (−∞,a) and is decreasing in (b, ). Note that {x:f0(x)c}is an interval
containing [a, b]. Replacing the original [a, b] by this new interval, we may assume
that f0(x)coutside [a, b]. Choose h0such that N(0,h
0) gives probability 1/3to
(0,ba). Let h<h
0. Let Φ denote the cumulative distribution function of N(0,1).
If x[a, b] then
f0,h(θ)b
a
f0(θ)φh(xθ)c(Φ((bx)/h)+Φ((xa)/h)c/3
If x>bthen
f0,h(θ)x
a
f0(θ)φh(xθ)f0(x)1
2((ba)/h)1f0(x)/3
5.6. MIXTURES OF NORMAL KERNEL 167
Using a similar argument when x<a, we have that the function
g(x)=log (3f0(x)/c),if x[a, b]
log 3,otherwise
dominates log(f0/f0,h)forh<h
0and is Pf0-integrable. Since f0(x)/f0,h(x)1as
h0 whenever xis a continuity point of f0and f0log(f0/f0,h)0, an application
of (a version of) Fatou’s lemma shows that f0log(f0/f0,h)0ash0.
Proof. Let >0 be given and δ>0, to be chosen later. First find h0so that
f0log(f0/f0,h)</2 for all h<h
0. Fix h<h
0. Choose k1such that
−∞
f0(x)logf0,h(x)
k1
k1φh(xθ)f0(θ)dx < δ
Let p=P[k1,k
1] and let p0denote the corresponding value under P0.Wemay
assume that p0>0. Let Pdenote the conditional probability under Pgiven [k1,k
1],
i.e., P(A)=P(A[k1,k
1])/p (if p>0) and P
0denoting the corresponding objects
for P0.LetEbe the event {P:|p/p01|}. Because P0is in the support of Dα,
Dα(E)>0. Now choose x0>k
1such that
(i) |x|>x0
f0(x)log(f0,h(x)/Lh(x)) dx < δ
(ii) Dα(EF)>0, where
F=
P:
P(x)l1(x), P (x+klog x)u1(x)x>x
0
and
P(x)l2(x),P(xklog |x|)u2(x)x<x0
By Egoroff’s theorem, it is indeed possible to meet condition (ii).
Consider the event
G=P:sup
x0<x<x0
log k1
k1φh(xθ)dP
0(θ)
k1
k1φh(xθ)dP (θ)<2δ.
We shall argue that Dα(EFG)>0 and if P(EFG) then f0log(f0/fh,P )<
for a suitable choice of δ.
168 5. DENSITY ESTIMATION
The events EFand Gare independent under Dα, and hence, to prove the first
statement, it is enough to show that Dα(G)>0. By intersecting Gwith Eand
using the fact that {φh(xθ):x0xx0}is uniformly equicontinuous when
θ[k1,k
1], we can conclude that Dα(G)Dα(GE)>0 (see the proof of
Theorem 5.6.1).
Now,
f0log(f0/fh,P )
−∞
f0(x)log(f0(x)/f0,h (x))dx
+|x|≤x0
f0(x)logf0,h(x)
k1
k1φh(xθ)f0(θ)dx
+|x|≤x0
f0(x)logk1
k1φh(xθ)f0(θ)
k1
k1φh(xθ)dP (θ)dx
+|x|>x0
f0(x)logf0,h(x)
φh(xθ)dP (θ)dx
If PEFG,thenforx>x
0,
−∞
φh(xθ)dP (θ)x+klog x
x
φh(xθ)dP (θ)
φh(klog x)[P(x)P(x+klog x)]
and because PF, the expression is further greater than or equal to
φh(klog x)[l1(x)u1(x)] = Lh(x)
Using a similar argument for x<x0,weget
|x|>x0
f0(x)logf0,h(x)
fh,P (x)dx |x|>x0
f0(x)logf0,h(x)
Lh(x)dx < δ
Since PEG,foreachxin [x0,x
0],
log k1
k1φh(xθ)f0(θ)
k1
k1φh(xθ)dP (θ)=logp0
pk1
k1φh(xθ)dP
0(θ)
k1
k1φh(xθ)dP (θ)<3δ
All these imply that if δis sufficiently small, then PEFGimplies that
f0log(f0,h/fh,P )<.
5.6. MIXTURES OF NORMAL KERNEL 169
5.6.2 Dirichlet Mixtures: L1-Consistency
As before, we consider the prior which picks a random density φhP,wherehis
distributed according to µand Pis chosen independently of haccording to Dα. Since
we view has corresponding to window length, it is only the small values of hthat are
relevant, and hence we assume that the support of µis [0,M] for some finite M.
In this model the prior is concentrated on
F=0<h<M Fh
where Fh={φhP:PM}.
In order to apply Theorem 4.4.4, given U={f:ff0<},forsomeδ</4,
we need to construct sieves {Fn:n1}such that J(δ, Fn)and Fc
nhas
exponentially small prior probability. Because, as an→∞,Dα{P:P[an,a
n]>
1δ}→1, a natural candidate for Fnis
Fn=hn<h<M Fan
h
where hn0,anincreases, and Fan
h={φhP:P[an,a
n]>1δ}. What is then
needed is an estimate of J(δ, Fn). The next theorem provides such an estimate.
The next lemma shows that the restriction h<M simplifies things a bit.
Lemma 5.6.1. Let M>0and let FM
h,a,δ =h<h<M Fh,a,δ.Ifa>M/
δ, then
FM
h,a,δ ⊂F
h,2a,2δ.
Proof. By Chebyshev’s inequality, if h<M then the probability of (a, a] under
N(0,h
) is greater than 1 δ.Iff=φhP, then since φh=φhφh,whereh<M,
f=φhφhPand (φhP)(a, a]>12δ.
Theorem 5.6.4. Let FM
h,a,δ =h<h<M {fh,P :P[a, a]1δ}. Then
J(δ, FM
h,a,δ)Ka
h,
where Kis a constant that does depend on δand M, but not on aor h.
We prove Theorem 5.6.4 through a sequence of lemmas. Let Fh,a ={fh,P :P(a, a]=
1}. Without loss of generality, we shall assume that a1
Lemma 5.6.2. J(2δ, Fh,a)8
π
a
+1
1+log1+δ
δ.
170 5. DENSITY ESTIMATION
Proof. For any θ1
2,
φθ1,h φθ2,h
=1
2πh x>(θ1+θ2)/2
exp[(xθ2)2/(2h2)]dx
1
2πh x>(θ1+θ2)/2
exp[(xθ1)2)/(2h2)]dx
+1
2πh x<(θ1+θ2)/2
exp[(xθ1)2/(2h2)]dx
1
2πh x<(θ1+θ2)/2
exp[(xθ2)2/(2h2)]dx
=4 1
2π(θ2θ1)/(2h)
0
exp[x2/2]dx
2
π
(θ2θ1)
h
Given δ,letNbe the smallest integer greater than 8a/(πhδ). Divide (a, a]
into Nintervals. Let
Ei=a+2a(i1)
N,a+2ai
N:i=1,2,...,N
and let θibe the midpoint of Ei. Note that if θ, θEi, then |θθ|<2a/N,and
consequently φθ,h φθ,h.
Let PN={(P1,P
2,...,P
N):Pi0,N
i=1 Pi=1}be the N-dimensional prob-
ability simplex and let P
Nbe a δ-net in PN, i.e., given P∈P
N, there is P=
(P
1,P
2,...,P
N)∈P
Nsuch that N
i=1 |PiP
i|.
Let F={N
i=1 P
iφθi,h :P∈P
N}. We shall show that Fis a 2δnet in Fh,a.If
fh,P =φhP∈F
h,a,setPi=P(Ei) and let P∈P
Nbe such that N
i=1 |PiP
i|.
5.6. MIXTURES OF NORMAL KERNEL 171
Then
/
/
/
/
/φθ,hdP (θ)
N
i=1
P
iφθi,h/
/
/
/
/
/
/
/
/
/φθ,hdP (θ)
N
i=1 IEi(θ)φθi,hdP (θ)/
/
/
/
/
+/
/
/
/
/
N
i=1
Piφθi,h
N
i=1
P
iφθi,h/
/
/
/
/
N
i=1
IEi(θ)φθ,h φθi,hdP (θ)+
N
i=1 |PiP
i|
2δ
This shows that J(2δ, Fh,a)J(δ, PN), and we calculate J(δ, PN) along the lines
of Barron, Schervish and Wasserman as follows: Since |PiP
i|/N for all i
implies that N
i=1 |PiP
i|, an upper bound for the cardinality of the minimal
δ-net of PNis given by
number of cubes of length δ/N covering [0,1]N
×volume of (P1,P
2,...,P
N):Pi0,
N
i=1
Pi1+δ
=(N/δ)N(1 + δ)N1
N!
So,
J(δ, PN)Nlog NNlog δ+Nlog(1 + δ)log N!
Nlog NNlog δ+Nlog(1 + δ)Nlog N+N
=N1+log1+δ
δ
8
π
a
+1
1+log1+δ
δ
Lemma 5.6.3. Let Fh,a,δ ={fh,P :P(a, a]1δ}. Then J(3δ, Fh,a,δ)
J(δ, Fh,a).
172 5. DENSITY ESTIMATION
Proof. Let f=φhP∈F
h,a,δ. Consider the probability measure Pdefined by
P(A)=P(A(a, a])/P (a, a]. Then the density f=φhPclearly belongs to
Fh,a and further satisfies ff<2δ.
Proof. Putting Lemmas 5.6.2 , 5.6.3 and 5.6.1 together, we have Theorem 5.6.4.
The next theorem formulates the result in terms of strong consistency for Dirichlet-
normal mixtures.
Theorem 5.6.5. Suppose that the prior µhas support in [0,M]. If for each δ>0,
β>0, there exist sequences an,hn0and constants β0
1(all depending on δ,βand
M) such that
1. for some β0,Dα{P:P[an,a
n]<1δ}<e
0,
2. µ{h<h
n}≤e1, and
3. an/hn<nβ
then f0is in the K-L support of the prior implies that the posterior is strongly con-
sistent at f0.
Remark 5.6.4.What was involved in the preceding is a balance between anand hn.
Since δand Mare fixed, the constant Kobtained in Theorem 5.6.4 does not play
any role. If αhas compact support, say [a, a], then we may trivially choose an=a
and so hnmay be allowed to take values of the order of n1or larger. If αis chosen as
a normal distribution and h2is given a (right truncated) inverse gamma prior, then
the conditions of the theorem are satisfied if anis of the order nand hn=C/n
for a suitable (large) C(depending on δand β).
5.6.3 Extensions
The methods developed in this chapter toward the simple mixture models can be used
to study many of the variations used in practice. Some of these are discussed in this
section.
1. It is often sensible to let the prior depend on the sample size; see for instance
Roeder and Wasserman [141]. A case in point, in our context would be when
the precision parameter M=α(R) is allowed to depend on the sample size.
If Πnis the prior at stage n, then the results goes through if the assumption
Π(K(f0)) >0 is replaced by lim infn→∞ Πn(K(f0)) >0. This follows from the
5.6. MIXTURES OF NORMAL KERNEL 173
fact the Barron’s Theorem (see Chapter 4) goes through with a similar change.
The only stage that needs some care is an argument which involves Fubini, but
it can be handled easily.
2. Another way the Dirichlet mixtures can be extended is by including a further
mixing. Formally, Let X1,X
2,... be observations from a density fwhere f=
φhP,PDατ,hπ,τis a finite-dimensional mixing parameter, which is
also endowed with some prior ρ.Letf0be the true density. We are interested
in verifying the Schwartz condition at f0and conditions for strong consistency.
By Fubini’s theorem, Schwartz’s condition is satisfied for the mixture if
ρ{τ: Schwartz condition is satisfied with ατ}>0 (5.24)
(a) In particular, if f0has compact support, then (5.24) reduces to
ρ{τ: supp(f0)supp(ατ)}>0 (5.25)
(b) Suppose f0is not of compact support and τ=(µ, σ) gives a location-scale
mixture. So we have to seek the condition so that the Schwartz condition
holds with the base measure α((·−µ)). We report results only for α0=
α/α(R) double exponential or normal.
When α0is double exponential, a sufficient condition is that f0(µ+σx)has
finite moment-generating function on an open interval containing [1,1].
When αis normal, we need the integrability of xlog |x|exp[x2/2] with re-
spect to the density f0(µ+σx). For example, if the true density is N(µ0
0),
then the required condition will be σ<σ
0, so we need
ρ{(µ, σ):σ<σ
0}>0
We omit proof of these statements. Simulation shows inclusion of location,
and scale parameters in the base measure improves convergence of the the
Bayes estimates to f0.
(c) For strong consistency, we further assume that the support of the prior ρ
(for (µ, σ)) is compact. For each (µ, τ ), find the corresponding an(µ, σ)of
Theorem 5.6.5, i.e., satisfying
Dα(µ,τ){P:P[an(µ, τ),a
n(µ, τ )] <1δ}<e
0
for some β0>0. Now choose an=sup
µ,σ an(µ, σ). The order of anwill
then be the same as the individual an(µ, σ)s.
174 5. DENSITY ESTIMATION
(d) In some special cases, it is also possible to allow unbounded location mix-
tures. For example, when the base measure is normal, a normal prior for
the location parameter is both natural and convenient. Strong consistency
continues to hold in this case as long as σhas a compactly supported
prior. To see this, observe that ρ{|µ|>n}is exponentially small and
sup|µ|≤n,σ an(µ, σ) is again of the order of n.
(e) West et al. put a random prior Pon h, independent of Pand a Dirichlet
prior for P. This allows different amounts of smoothing near different
sets of Xis. Our methods should apply here also. Such techniques, i.e.,
dependence of hon Xisoronxin the range of Xis have been introduced in
the frequentist literature recently and are also known to improve estimates.
5.7 Gaussian Process Priors
Consider the probabilities p1,p
2,...p
kassociated with a multinomial with k- cells.
Often, for example, when the cells correspond to the bins of a histogram, it would
be evident that a priori that the probabilities of adjacent cells would be highly pos-
itively correlated and the correlation would drop off for cells are farther apart. The
Dirichlet prior for p1,p
2,...p
kresults in negative covariance whereas we want pos-
itive covariance. It is thus necessary to model other covariance structures. The dif-
ficulty is one of specifying covariances which would ensure that the prior sits on
Sk={(p1,p
2,...p
k),p
i0pi=1}. Leonard([126],[127]) suggested choosing real
variables Y1,Y
2,...Y
kand setting pi=exp(Yi)/exp(Yi). This ensures that pi0
and pi= 1. Further if the distribution of Y1,Y
2,...Y
kis tractable, say N(µ, Σ),
then Leonard shows that one can obtain tractable approximations to the posterior.
The situation is even more striking in the case of smooth random densities where
smoothness already implies that the value of the density at two points x, y would be
close if xand yare close. If we use the method of Section 5.5 calculations indicate
that one gets positive covariance (for fixed h) only for very small values of h.Inthe
spirit of Leonard one could choose a stochastic process {Y(x):xR}with smooth
sample paths and for any sample path define f=exp(y)/((exp y(t))dt). Leonard
[127] suggested using a Gaussian process {Y(x):xR}. In this section we present
these Gaussian process priors along the lines of Lenk [125]. Lenk considers a larger
class of priors which gives a unified appearance to the results. An alternative method
is to consider f=expYconditioned on exp Y(t)dt = 1. Thorburn[157] has taken
5.7. GAUSSIAN PROCESS PRIORS 175
this approach. While this method is not discussed here, it would be interesting to see
how this method relates to those developed by Leonard and Lenk.
Let µ:R→ Rand σ:R×R→ R+be a symmetric function. σis said to be
positive definite if for any x1,x
2,...,x
k,thek×kmatrix with σ(xi,x
j) as its entries
is positive definite.
Definition 5.7.1. Let µ:R→ Rand σbe a positive definite function on R×R.A
process {Y(x):xR}is said to be a Gaussian process with mean µand covariance
kernel σif for any x1,x
2,...,x
k,Y(x1),Y(x2),...,Y(xk)hasak-dimensional normal
distribution with mean µ(x1)(x2),...,µ(xk) and covariance matrix whose (i, j)th
entry is σ(xi,x
j).
The smoothness of the sample paths of a stochastic process is governed by moment
conditions. Extensive results of this kind can be found in [36]. Following are a few
that we use.
Theorem 5.7.1. Let {ξ(x):xR}be a stochastic process. Suppose that for
positive constants pr,
E|ξ(t+h)ξ(t)|pK|h|1+rfor all t, h
Let 0<a<r/p. Then there is a process {η(x):xR}equivalent to {ξ(x):xR}
(i.e. a process with the same finite-dimensional distributions as {ξ(x):xR}) such
that
|η(t+h)η(t)|≤A|h|awhenever |h|
As an example consider the standard Brownian motion. A Gaussian process with
µ=0andσ(x, y )=xy.Leth>0 then
E|ξ(t+h)ξ(t)|4=3{Var(ξ(t+h)ξ(t))}2=3h2
So we can take p=4,r = 1 to conclude that the sample paths are Lipschitz of order
at least a,where0<a<1/4.
More generally, since ξ(t+h)ξ(t)
his N(0,1),
E|ξ(t+h)ξ(t)|2k=Akhk
and we can choose p=2k, r =k1,0<a<(k1)/2k. Letting k→∞,weseethat
the sample functions are Lipshitz of order afor any 0 <a<1/2.
176 5. DENSITY ESTIMATION
Theorem 5.7.2. If for positive constants p<rand K,
E|ξ(t+h)ξ(t)|pK|h|
|log |h||1+r
and
E|ξ(t+h)+ξ(th)2ξ(t)|pK|h|1+p
|log |h||1+r
Then there is a process η(t)equivalent to ξ(t)such that η(t)exists and is continuous
almost surely.
To return to Lenk, we consider a Gaussian process Y(x) with mean µand covariance
kernel σ. Lenk appears to assume that
(i) µis continuous;
(ii) σis continuous on R×Rand positive definite; and
(iii) there exist positive constants c, β,  and nonnegative integer rsuch that
E|Y(x)Y(y)|β=C|xy|1+r+
Condition (iii) guarantees that if r1 then with probability 1, the sample paths
are rtimes continuously differentiable. A useful case is when σis of the form σ(x, y)=
ρ(|xy|) for some function ρon R. In this case, the process is stationary, and easier
sufficient conditions are available for the sample paths to be smooth.
Theorem 5.7.3. Let σ(x, y)=ρ(|xy|).If
1. for some a>3
ρ(h)=1O{|log |h||a}as h0
then there is an equivalent process with continuous sample paths
2. for some a>3and λ2>0,
ρ(h)=1λh2
2+O(h2
|log |h||a)as h0
then there is an equivalent process whose sample paths are continuously differ-
entiable
5.7. GAUSSIAN PROCESS PRIORS 177
Cram´er and Leadbetter [36] remark that a>3 may be replaced by a>1 but the
proof requires lot more work. Here are some examples used in Lenk [125].
(i) ρ(x)=e−|x|=1−|x|+O(x2)asx0;
(ii) ρ(x)=(1−|x|)I|x|≤1=1−|x|as x0;
(iii) ρ(x)=ex2=1x2+O(x4)asx0; and
(iv) ρ(x)= 1
1+x2=1x2+O(x4)asx0.
Cases (i) and (ii) satisfy condition (1) of the theorem and (iii) and (iv) satisfy
condition (2).
Let Ibe a bounded interval and let {Z(x):xR}be a Gaussian process with
mean µand covariance kernel σ. The log-normal process, denoted by LN(µ, σ), is the
process W(x)=exp(Z(x)). We will denote the associated measure on R+by Λ(µ, σ ).
Following is a proposition which will be used later.
Proposition 5.7.1. Fix x1,x
2,...,x
kin Iand constants a1,a
2,...,a
k.
Let µ(x)=µ(x)+
k
1
aiσ(x, xi)
Then
dΛ(µ)
dΛ(µ, σ)=k
1W(xi)ai
Ek
1W(xi)ai
=k
1W(xi)ai
e
x+aσx
2a
Here W(R+)Iand the expectation in the right-hand side is with respect to
Λ(µ, σ);µx=(µ(x1)(x2),...,µ(xk)) and [σx]i,j =σ(xi,x
j),a=a1,a
2,...,a
k.
We will prove the proposition through a series of simple lemmas.
Lemma 5.7.1. Let (Z1,Z
2,...,Z
k)be multivariate normal with mean vector µ=
(µ1
2,...,µ
k)and covariance Σ.Ifµ=(µ
1
2,...,µ
k)=µ+aΣ, where ais the
vector (a1,···,a
k)then
dN(µ,Σ)
dN(µ, Σ) (Z1,Z
2,...,Z
k)=KeiaiZi
where K=1/EeaiZi=1/e+aΣ
2a.
178 5. DENSITY ESTIMATION
Proof. For any µ1and µ2,
((xµ11(xµ1)(xµ21(xµ2))
=2(µ2µ11x+µ1Σ1µ
1µ2Σ1µ
2
Only the first term depends on x. Absorbing the other two terms in the constant
and taking µ1=µand µ2=µthe lemma follows.
Lemma 5.7.2. Let G(µ, σ )stand for the Gaussian measure with mean µand co-
variance σ.Ifµis as in Proposition 5.7.1, then
dG(µ)
dG(µ, σ)(Z)=Kek
1aiZ(xi)(5.26)
Proof. It is enough to show that the finite-dimensional distributions of the measure
defined by (5.26) are those arising from dG(µ). But that is precisely the conclusion
of the lemma 5.7.2.
Next we state a simple measure theoretic lemma whose proof is routine.
Lemma 5.7.3. Suppose P, Q are probability measures on (Ω,A)and Tis a 1-1
measurable function from (Ω,B).IfPQthen PT1QT 1and
dP T 1
dQT 1(ω)=dP
dQ(T1(ω))
Proof. To return to the proposition, it easily follows from Lemma 5.7.2 and by taking
T(Z)=eZin Lemma 5.7.3.
We next add another real parameter ξ, and following Lenk we define a generalized
log-normal process LN (µ, σ, ξ). When ξ= 0 the generalized log-normal process is
defined to be LN (µ, σ), i.e., LN (µ, σ, 0) = LN(µ, σ).
For any real ξ,LN (µ, σ, ξ ) is defined by
dLN(µ, σ, ξ)
dLN(µ, σ, 0)(W)=[IW(x)dx]ξ
C(ξ,µ)(5.27)
where C(ξ,µ)=EIW(x)dx]ξthe expectation being taken under LN(µ, σ, 0). Lenk
shows that this expectation exists for all real ξ.
We are now ready to define the random density.
5.7. GAUSSIAN PROCESS PRIORS 179
Definition 5.7.2. Let {W(x).x R}be a generalized log normal process LN(µ, σ, ξ)
on R+. The distribution of
f(x)= W(x)
IW(x)dx
is called a logistic normal process and denoted by LNS(µ, σ, ξ).
Clearly fis a random density. We next show that if fhas logistic normal distribu-
tion then so does the posterior given X1,X
2,...,X
n.
Theorem 5.7.4. If fLNS(µ, σ, ξ )then the posterior given X1,X
2,...,X
nis
LNS(µ)where µ(x)=µ(x)+n
1σ(x, Xi)and ξ=ξn.
Proof. If WLN(µ, σ, ξ) then by the Bayes theorem (for densities) the posterior Λ
of Wgiven X1,X
2,...,X
nis
dΛ
dΛ(µ, σ, 0)(W)=KI
[W(x)dx]ξn
1W(xi)
[I[W(x)dx]n](5.28)
=KI
[W(x)dx]ξn
n
1
W(xi) (5.29)
and comparison with (5.26) and (5.27) shows that this is LNS(µ). The theorem
follows because the distribution of fis just the posterior distribution of W/ IW(x)dx.
Even though the transformations µ→ µ→ σ, ξ → ξlook simple, any interpre-
tation needs to be tempered. First note that µ, σ, ξ do not identify the prior because
if µ1µ2Cthen both µ1ξ and µ2ξ will lead to the same prior for f. Second µ
and σdo not translate separately to E(f)andcov(f(x),f(y)). A change in either µor
σwill affect both E(f)andcov(f(x),f(y)). As n→∞both µ→∞and ξ→−
indicating that these cannot be used to do simple minded asymptotics.
Since the prior is on densities, the natural tool to study consistency is the Schwartz
theorem and Theorem 4.4.4. When the Gaussian process is a standard Brownian
motion, with some work it can be shown that if the true distribution f0satisfies
log f0is bounded then the Schwartz condition holds at f0.TowardL1-consistency a
natural sieve to consider would be to divide [a, b]intoO(n)intervalsandtolookat
the class of functions that have oscillation less than δin all the intervals. These are
just preliminary observations; more careful study needs to be done.
180 5. DENSITY ESTIMATION
It also appears, that in analogy with Dirichlet mixtures, one should introduce a
window hin the covariance and have ρh(x)=(1/h)ρ(x/h).
In any case a lot of further work is needed to develop this promising method.
It would also be good to have some theoretical or numerical evidence justifying the
numerical calculation of the posterior given in Lenk. For instance, one could compare
Lenk’s algorithms with approximations based on discretization.
6
Inference for Location Parameter
6.1 Introduction
We begin our discussion of semiparametric problems with inference about location
parameters. The related problem of regression is taken up in a later chapter.
Our starting point is an important counterexample of Diaconis and Freedman
[46, 45]. Since the Dirichlet process is a very flexible and popular prior for many
infinite-dimensional examples, it seems natural to use it for estimating a location
parameter. Diaconis and Freedman showed that it leads to posterior inconsistency.
Barron suggests that the pathology is more fundamental. We present some of their re-
sults in Section 2. Doss [50], [51] and [52], showed the existence of similar phenomena
when one wants to estimate a parameter θthat is a median.
A common explanation is that inconsistency is due to the Dirichlet sitting on
discrete distributions. It is indeed true that the semiparametric likelihood is difficult
to handle when a prior sits on discrete distributions. But Diaconis and Freedman [46]
argue in their rejoinder to such comments that they expect the same phenomenon
for Polya tree priors that sit on densities. We take up this problem in Sections 6.3
and 6.4 and show that under certain conditions symmetrized Polya tree priors have a
rich Kullback-Leibler support so that by Schwartz’s theorem, one can show posterior
consistency for the location parameter for a large class of true densities.
182 6. INFERENCE FOR LOCATION PARAMETER
One lesson that emerges from all this is that the tail free property, which is a
natural tool for consistency, is destroyed by the addition of a parameter. Hence the
Schwartz criterion is an appropriate tool for proving consistency. In particular, if one
wants posterior consistency for certain true P0s, then it is desirable to have a prior
whose Kullback-Leibler support contains them.
Another natural prior to consider is the Dirichlet mixture of normals, which has
emerged as the currently most popular prior for Bayesian density estimation. We will
explore its properties in the next chapter and return briefly to the location parameter
in Chapter 7.
Much of this chapter is based on Diaconis and Freedman [46] and Ghosal et.al. [78].
6.2 The Diaconis-Freedman Example
Suppose we have the model
Xi=Yi+θ, i =1,2,...,n
where given Pand θ,Yis are i.i.d. P. Finally Pand θare independent with Dirichlet
process prior Dαfor Pand a prior density µ(θ)forθ. The probability measure ¯αhas
a density g.
Suppose the true value of θis θ0and the true distribution of the YsisP0with den-
sity f0. The densities µ, g, f0are all with respect to Lebesgue measure on appropriate
spaces.
The main interest is in the location parameter θand the behavior of the posterior
for θunder P0. Since the random distributions Pare not symmetrized around 0, the
location parameter has an identifiability problem. For the time being, we ignore this.
We will rectify this later by symmetrizing P.
To calculate the posterior, note that the random distribution Pof Xs is a mixture
of Dirichlet, i.e., given θ,PDαθ,whereαθ(·)=α(Rα(·−θ). Because P0has a
density Xis may be assumed to be distinct. Hence by expression (3.17) the posterior
density Π(θ|X1,X
2,...,X
n) is proportional to
µ(θ)
n
1
g(Xiθ)
As Barron pointed out in his discussion of [46] the Dirichlet is a pathological prior
for a parameter in a semiparametric problem. The posterior is the same as if one
assumed that Xis are i.i.d. with the parametrized density g(Xiθ).
6.2. THE DIACONIS-FREEDMAN EXAMPLE 183
Diaconis and Freedman point out that consequences of choosing gcan be serious.
If gis a normal density, then one gets consistency, but not when gis Cauchy. An
intuitive interpretation of this is that a normal likelihood for θprovides a robust
model. For example, the MLE is ¯
X, which is consistent for E(X)=θeven without
normality. On the other hand, a Cauchy likelihood for θ, unlike a Cauchy prior,
does not provide robustness. In fact, Diaconis and Freedman provide the following
counterexample. They construct an f0, which has compact support, is symmetric
around 0, and infinitely differentiable. Under θ0and P0, nearly half the samples the
posterior concentrates around θ0+δand for nearly another half it concentrates around
θ0δ. The true model P0can be chosen to make δas large as we please. Because
we are now essentially dealing with a misspecified model g, when actually f0is true,
some insight into this phenomenon as well as the argument in [46] can be achieved
by studying the asymptotic behavior of the posterior under misspecified models; see
[17] and Bunke and Milhaud [28].
We now indicate why the same phenomenon holds even if we symmetrize Pto
Ps(A)=(1/2)(P(A)+P(A)).
Given Pwe first generate Z1,Z
2,...,Z
n, i.i.d. P. Then define Yi=|Zi|δi,whereδi
are i.i.d. and δi=±1 with probability 1/2. Then Y1,Y
2,...,Y
nare i.i.d. Ps.Given
Ysandθ;Xi=Yi+θas before. We will provide a heuristic computation of the
posterior distribution of θ.
Assume without loss generality that X1,X
2,...,X
nand (Xi+Xj)/2,1i<jn
are all distinct. The variables (θ, X), (θ, Z), and (θ, Y ) may be related in two ways.
If θ=(Xi+Xj)/2 for all pairs i, j then
Yi=|Zi|δi=Xiθ
are all distinct. Moreover, all the |Zi|s are also distinct. For, if |Zi|=|Zj|, then δiand
δjmust be of opposite sign and θmust be (Xi+Xj)/2, a case we have excluded for
the time being. Hence, given θ, |Z1|,|Z2|,...,|Zn|are ndistinct values in a sample of
size nfrom the distribution P|Z|=Ps,|Z|,wherePis Dαθ. Hence one can write down
the joint density of |Z1|,|Z2|,...,|Zn|by equation (3.17). Finally, δis are independent
given θand |Zi|. Since there is a 1-1 correspondence between Yiand (Zi
i), the
density of Yis given θis
C
n
1g|z|(|yi|)1
2=C
n
1
g(yi)=C
n
1
g(Xiθ) (6.1)
where C={α(R)[n]}1{α(R)}n.
184 6. INFERENCE FOR LOCATION PARAMETER
There is a second way in which the Yis can be related to Xis. Suppose θ=(Xi+
Xj)/2. Then |Zi|=|Zj|and δiand δjare of opposite sign. The remaining |Z|s—all
(n2) of them—are all distinct and different from the common value of |Zi|and |Zj|.
Hence, given θ=(Xi+Xj)/2, the density of Zs (with respect to (n1)-dimensional
Lebesgue measure) is
D
k=i,j
g|Z|(|Yk|)g|Z|(|Yi|)=Cn
1g|Z|(|Yk|)
g|Z|(|Yj|)
where D=C/α(R). Finally, given θ=(Xi+Xj)/2, the density of Y1,Y
2,...,Y
nis
Cn
1g|Z|(|Yk)
g|Z|(|Yj|)
1
2n=g(Xiθ)
2g(XiXj)(6.2)
because |Yi|=|Yj|=|XiXj|and g(|XiXj|)=g(XiXj).
The density (6.1) multiplied by µ(θ) leads to the absolutely continuous part of the
posterior for θ, while (6.2) leads to its discrete part. Formally, the discrete part is
Πd(θ|X1,X
2,...,X
n)=
i<j
µXi+Xj
2g(Xiθ)
2g(XiXj)
and the absolutely continuous part has the density
Πc(θ|X1,X
2,...,X
n)=µ(θ)C
n
1
g(yi)=(θ)
n
1
g(Xiθ)
Hence the posterior is
Π(θ|X1,X
2,...,X
n)=c(θ, X)+d(θ, X )
CN
where CNis the norming constant
CN=Cµc(θ, X)+
θ=(Xi+Xj)/2:i<j
µd(θ, X)
A detailed, rigorous proof appears in lemma 3.1 of and Freedman[45]. The posterior
is still pathological and leads to inconsistency.
Diaconis and Freedman give examples of P0where one of the two terms in the
posterior dominate. In case the first term dominates, the posterior for the symmetrized
Dirichlet is similar to the posterior for the Dirichlet, and the proof for consistency in
that case applies here.
6.3. CONSISTENCY OF THE POSTERIOR 185
6.3 Consistency of the Posterior
When Phas a symmetrized Dirichlet prior distribution and gis log concave, as for
normal, then Diaconis and Freedman [45] show that the posterior is consistent for all θ0
for essentially “all” true P0. On the other hand without such assumptions consistency
fails, as indicated in the previous section. One explanation is the pathological form of
the posterior. A somewhat deeper explanation is the fact that the Dirichlet and the
symmetrized Dirichlet live on discrete distributions.
Diaconis and Freedman reacted to this as follows. They argued that discreteness is
not the main issue. They construct a class of Polya tree priors, supported by densities
and remark “Now consider the location problem; we guess this prior is consistent
when expectation is the normal and and inconsistent when it is Cauchy. The real
mathematical issue, it seems to us, is to find computable Bayes procedures and figure
out when they are consistent.”
We believe that Diaconis and Freedman are correct in thinking that existence of
density for random Pis not enough. What one needs is a stronger notion of support
and a prior that has a support rich enough to contain one’s favorite P0s. the weak
support is not good enough except for tail free priors. Since addition of a parameter
destroys the tail free property, neither tail free priors nor the assumption that P0is
in the weak support of the prior helps in ensuring consistency. Schwartz’s theorem
shows that a sufficient condition for consistency is that P0is in the Kulback-Leibler
support of the prior. Schwartz’s theorem is stated next in the form in which we need
it.
Our parameter space is Θ ×F
swhere Θ is the real line and Fsis the set of
all symmetric densities on R.O×F
s, we consider a prior µ×Pand given
(θ, f ), X1,X
2,...,X
nare independent identically distributed as Pθ,f,wherePθ,f is
the probability measure corresponding to the density f(xθ). We denote by fθ
the density f(xθ). Given X1,X
2,...,X
n, we consider the posterior distribution
(µ×P)(···|X1,X
2,...,X
n)o×Fsgiven by the density fθ(Xi)/fθ(Xi)d(µ×
P)(θ, f ). The posterior (µ×P)(···|X1,X
2,...,X
n)issaidtobeconsistentat(θ0,f
0)
if, as n→∞,(µ×P)(···|X1,X
2,...,X
n) converges weakly to the degenerate measure
δθ0,f0almost surely Pθ0,f0. Clearly, if the posterior is consistent at (θ0,f
0), the marginal
distribution of (µ×P)(···|X1,X
2,...,X
n) on Θ converges to δθ0almost surely Pθ0,f0.
Theorem 6.3.1. If for all δ>0,
(µ×P){(θ, f ):K(fθ0,f
θ)}>0,(6.3)
then the posterior (µ×P)(···|X1,X
2,...,X
n)is consistent at (θ0,f
0).
186 6. INFERENCE FOR LOCATION PARAMETER
A naive way to ensure (6.3) is to require that θ0and f0belong respectively, to the
Euclidean and Kullback-Leibler supports of µand P. The flaw in this argument is
that the Kullback-Leibler divergence is not a metric. So even if θis close to θ0and
K(f0,f) is small, we cannot draw any conclusion about K(f0θ0,f
θ)orK(f,fθ). A
way out is indicated below.
Definition 6.3.1. The map (θ, f )→ fθis said to be KL-continuous at (0,f
0)if
K(f0,f
0)=
−∞
f0(x)log(f0(x)/f0(xθ))dx 0asθ0.
We would then call (0,f
0)aKL-continuity point.
Let f
0be the density defined by f
0(x)=(f0 (x)+f0(x)) /2, the symmetriza-
tion of f0where f0stands for f0(.θ). For later convenience we write Pinstead
of Pfor a prior on Fs.
Assumption A: Support of µis Rand for all θsufficiently small, f
0is in the
K-LsupportofP.
It is easy to check that this condition holds for many common densities, e.g., for
normal or Cauchy. However, it fails for densities like uniform on an interval. For such
cases a different method is discussed later.
Theorem 6.3.2. If µand Psatisfy Assumption A and if (0,f
0)is a KL-continuity
point, then the posterior (µ×P)(···|X1,X
2,...,X
n)is consistent at (0,f
0).
Proof. We first prove it when θ= 0. By Theorem 6.3.1, it is enough to verify that
µ×Psatisfies the Schwartz condition (6.3). For any θ,
K(f0,f
θ)=
−∞
f0log(f0/fθ) (6.4)
=
−∞
f0log f0
−∞
f0log f
Since
−∞
f0log f
0=
−∞
f
0log f
0(6.5)
and
−∞
f0log f=
−∞
f
0log f, (6.6)
6.3. CONSISTENCY OF THE POSTERIOR 187
we have, by the concavity of logx
K(f0,f
θ)=
−∞
f0log(f0/f
0)+
−∞
f
0log(f
0/f )
1
2
−∞
f0log f0
f0+1
2
−∞
f0log f0
f0,θ+K(f
0,f)
=1
2K(f0,f
0,2θ)+K(f
0,f)
(6.7)
By the KL-continuity assumption there is an εsuch that for |θ|, the first term
is less than δ/2. For any θ, by Assumption A, {f:K(f
0,f)/2}has positive P
measure. Thus we have, for each θ[ε, ε],{f:K(f
0,f)/2}is contained in
{f:K(f0,f
θ)}. Since µ[ε, ε]>0 this completes the proof for θ=0.
For a gen e ral θ0,K(f00,f
θ0+θ)=K(f0,f
θ) which by the previous argument is less
than δwith positive probability, if fis chosen as before and θis in [θ0, θ0+].
Assumption A of Theorem 6.3.2 can be verified if Parises as follows. Let Pbe a
symmetrization of Pobtained by one of the following two methods.
Method 1. Let Pbe a prior on F. The map f→ (f(x)+f(x))/2fromFto Fs
induces a measure on Fs.
Method 2. Let Pbe a prior on F(R+)—the space of densities on R+. The map
f→ f, where, f(x)=f(x)=f(x)/2, gives rise to a measure on Fs.
Lemma 6.3.1. Let Pbe a prior on For on F(R+)with a given symmetric f0
in its K-L support. Let Pbe the prior obtained on Fsby Method 1or Method 2.If
f0∈Fs, then
P{f∈Fs:K(f0,f)}>0 (6.8)
Proof. For Method 1, the result follows from Jensen’s inequality; the conclusion is
immediate for method 2 because, setting g0(x)=2f0(x)andg(x)=2f(x)forxin
R+,bothg0,g belong to F(R+)andK(f0,f)=K(g0,g).
The K-L continuity assumptions fails if f0has support in a finite interval. However,
our next result in this section shows that consistency continues to hold even when
f0has support in a finite interval, provided f0is continuous. The proof consists in
approximating f0by an f1satisfying conditions of Theorem 6.3.2. We first need a
lemma to bound a K-L number. It is a slight improvement over a lemma in [78].
Lemma 6.3.2. Let f0and f1be densities so that f0Cf1. Then for any f,
K(f0,f)Clog C+[K(f1,f)+K(f1,f)]
188 6. INFERENCE FOR LOCATION PARAMETER
Proof. First note that C1. Also
K(f0,f)f0[log(f0/f1)]+Cf1[log(Cf1/f)]+
Clog C+Cf1[log(f1/f)]+
(6.9)
But f1[log(f1/f)]+K(f1,f)+f1[log(f1/f)](6.10)
f1[log(f1/f)]=f1[log(f/f1)]+f1f
f11+
=ff1
2K(f1,f)
(6.11)
The last inequality follows from Proposition 1.2.2. Combining (6.9), (6.10) and (6.11),
one gets the lemma.
Theorem 6.3.3. If µand Psatisfy Assumption A, f0is continuous and has
support in a finite interval [a, a], and log α(x)is integrable with respect to N(µ, σ2)
for all (µ, σ), then the posterior P(···|X1,X
2,...,X
n)is consistent at (θ, f0)for all
θ.
Proof. We consider two cases.
Case 1. inf
[a,a]f0(x)=α>0.
Let
f1(x)=
(1 η)f0(x),for a<x<a
(η/2)φa,σ2,for x≤−a
(η/2)φa,σ2,for xa
(6.12)
where φa,σ2and φa,σ2are, respectively, the densities of N(a, σ2)andN(a, σ2)and
σ2is chosen to ensure that f1is continuous at a.
We first show that f1is KL-continuous, i.e.,
lim
θ0
−∞
f1log(f1/f1)=
−∞
lim
θ0f1log(f1/f1) = 0 (6.13)
It is enough to establish that for some ε>0, the family {log(f1/f1):|θ|}is
uniformly integrable with respect to f1. This follows because for any M,
sup
|θ|
sup
|x|<M|log(f1(x)/f1(x))|<C
M
6.4. POLYA TREE PRIORS 189
and when Mis large, for |x|>M,f1(x)=(η/2)(σ2π)1exp[(xaθ)2/(2σ2)]
for all |θ|, implying
sup
|θ||x|>M
f1(x)log(f1(x)/f1 (x))dx 0asM→∞
It now follows from Lemma 6.3.2 that, by setting C=(1η)1and choosing ηclose
to1sothat(C+1)logC<δ/2, we can choose a δsuch that K(f1,f)
implies
K(f0,f); consequently {(θ, f ):K(f1,f
θ)
}⊂{(θ, f ):K(f0,f
θ)}.
Theorem 6.3.2 shows that the set on the left hand side has positive µ×Pmeasure.
Case 2. inf
[a,a]f0(x)=0.
By the continuity of f0,wecan,givenanyη>0, choose a Csuch that a
a(f0C)=
1+η,whereab=max(a, b). Set f1=(1+η)1(f0C). Then f0(1 + η)f1and
using Lemma 6.3.2, we can choose ηand δsmallsuchthat{f:K(f1,f)
}⊂
{f:K(f0,f)}. Since f1is covered by Case 1, the theorem follows.
In the remaining section we concentrate on constructing Polya tree priors which
satisfy conditions of Theorem 6.3.2 for many f0s.
6.4 Polya Tree Priors
The main result in this section is Theorem 6.4.1. It implies that Assumption A is true
if Pis a symmetrization of the Polya tree prior in this theorem and K(f00)<
for all θ0.
We already discussed the basic properties of Polya trees in Chapter 3. They are
recalled below. Let E={0,1}and Embe the m-fold Cartesian product E×···×E
where E0=.Further,setE=
m=0Em.Letπ0={R}and for each m=1,2,...,
let πm={Bε:εEm}be a partition of Rso that sets of πm+1 are obtained from a
binary split of the sets of πmand
m=0πmis a generator for the Borel σ-field on R.
Let Π = {πm:m=0,1,...}.
A random probability measure Pon Ris said to possess a Polya tree distribution
with parameters (Π,A); we write PPT(Π,A), if there exist a collection of non-
negative numbers A={αε:εE}and a collection Y={Yε:εE}of random
variables such that the following hold:
(i) the collection Yconsists of mutually independent random variables;
(ii) for each εE,Yεhas a beta distribution with parameters αε0and αε1;
190 6. INFERENCE FOR LOCATION PARAMETER
(iii) the random probability measure Pis related to Ythrough the relations
P(Bε1···εm)=
m
j=1;εj=0
Yε1···εj1
m
j=1;εj=1
(1 Yε1···εj1)
m=1,2,...,
where the factors are Y0or 1 Y0if j=1.
We restrict ourselves to partitions Π = {πm:m=0,1,...}that are determined by
a strictly positive continuous density αon Rin the following manner: The sets in πm
are intervals of the form {x:(k1)/2m<x
−∞ α(t)dt k/2m},k=1,2,...,2m.We
term the measure (corresponding to) αas the base measure because its role is similar
to the base measure of Dirichlet process.
Our next theorem refines theorem 2 of Lavine [119] by providing an explicit condi-
tion on the parameters.
Theorem 6.4.1. Let f0be a density and Pdenote the prior PT(Π,A), where
αε=rmfor all εEmand
m=1 r1/2
m<. Further assume that K(f0)<.
Then for every δ>0,
P{P:K(f0,f)}>0 (6.14)
Proof. By Theorem 3.3.7, the weaker condition
m=0 r1
m<implies the existence
of a density of the random probability measure. Considering the transformation x→
x
−∞ α(t)dt, assume that fand f0are densities on [0,1].Moreover,Πisthenthe
canonical binary partition. By the martingale convergence theorem, there exists a
collection of numbers {yε:εE}from [0,1] such that, with probability one
f0(x) = lim
m→∞
m
j=1;εj=0
2yε1···εj1
m
j=1;εj=1
2(1 yε1···εj1)
. (6.15)
where the limit is taken through a sequence ε1ε2··· which corresponds to the dyadic
expansion of x. It similarly follows that
f(x) = lim
m→∞
m
j=1;εj=0
2Yε1···εj1
m
j=1;εj=1
2(1 Yε1···εj1)
(6.16)
for almost every realization of f.NowforanyN1,
K(f0,f)=MN+R1NR2N(6.17)
6.4. POLYA TREE PRIORS 191
where
MN=E
log
N
j=1;εj=0 yε1···εj1
Yε1···εj1N
j=1;εj=1 1yε1···εj1
1Yε1···εj1
(6.18)
R1N=E[log(
j=N+1;εj=0
2yε1···εj1
j=N+1;εj=1
2(1 yε1···εj1))] (6.19)
and
R2N=E[log(
j=N+1;εj=0
2Yε1···εj1
j=N+1;εj=1
2(1 Yε1···εj1))] (6.20)
with Estanding for the expectation with respect to the distribution of (ε1
2,...)for
a fixed realization of the Ys. The εs come from the binary expansion of x,andxis
distributed according to the density f0.
By the definition of a Polya tree, MNand R2Nare independent for all N1. To
prove (6.14), we show that for any δ>0,thereissomeN1 such that
P{MN}>0 (6.21)
|R1N|(6.22)
and
P{|R2N|}>0 (6.23)
The set {(Yε:εEm,m =0,...,N 1) : MN}is a nonempty open
set in R2N1; it is open by the continuity of the relevant map and it is nonempty
as (yε:εEm,m =0,...,N 1) belongs to this set. Thus (6.21) follows by
the nonsingularity of the beta distribution. Relation (6.22) follows from lemma 2 of
Barron [6]. To complete the proof, it remains to show (6.23) for some N1. We
actually prove the stronger fact
lim
N→∞ P{|R2N|≥δ}= 0 (6.24)
Let Estand for the expectation with respect to the prior distribution.i.e., the distri-
bution of the YsandE, as before, the expectation with respect to the distribution of
192 6. INFERENCE FOR LOCATION PARAMETER
(ε1
2,...). Now
P{|R2N|≥δ}
δ1E|R2N|
δ1EE[
j=N+1;εj=0 |log(2Yε1···εj1)|+
j=N+1;εj=1 |log(2(1 Yε1···εj1))|]
=δ1E[
j=N+1;εj=0
E|log(2Yε1···εj1)|+
j=N+1;εj=1
E|log(2(1 Yε1···εj1))|](6.25)
δ1E[
j=N+1
max{E|log(2Yε1···εj1)|,E|log(2(1 Yε1···εj1))|]
δ1
j=N+1
max
(ε1···εj1)Ej1max{E|log(2Yε1···εj1)|,E|log(2(1 Yε1···εj1))|]
=δ1
j=N+1
η(rj1)
where η(k)=E|log(2Uk)|with UkBeta(k, k). By Lemma 6.4.1, η(k)=O(k1/2)
as k→∞. Since
m=1 r1/2
m<by assumption, the right-hand side of (6.25) is
the tail of a convergent series. This completes the proof of (6.24) and hence of the
theorem as well.
Remark 6.4.1.Essentially the same proof shows that the Kullback-Leibler neighbor-
hoods would continue to have positive measure when the prior is modified as follows:
Divide Rinto k+ 1 intervals I1,...,I
k+1 and assume that (P(I1),...,P(Ik)) have
a joint density which is positive everywhere on the k-dimensional set {(a1,...,a
k):
ai>0,j =1,...,k,k
j=1 ai<1}.ForeachIj, the conditional distribution given
P(Ij) has a Polya tree prior satisfying the assumptions of the theorem. These priors
are special cases of the priors constructed by Diaconis and Freedman. Moreover, it
follows from theorem 1 of Lavine [119] that such priors can approximate any prior
belief up to any desired degree of accuracy in a strong sense.
Remark 6.4.2.It is not necessary that for each m,αε1···εmbe the same for all
(ε1,...,ε
m)Em. The proof goes through even when only αε1···εm10=αε1···εm11
for all (ε1,...,ε
m1)Em1,m1, and rm:= min{αε1···εm:(ε1,...,ε
m)Em}
satisfies the condition
m=1 r1/2
m<.
6.4. POLYA TREE PRIORS 193
Lemma 6.4.1. If Ukbeta(k, k), then E|log(2Uk)|=O(k1/2)as k→∞.
Proof. The proof uses Laplace’s method with a rigorous control of the error term. Let
ηk=E|log(2Uk)|, i.e.,
ηk=1
B(k, k)1
0|log(2u)|uk1(1 u)k1du (6.26)
=1
B(k, k)1
0|log(2(1 u))|uk1(1 u)k1du (6.27)
Adding (6.26) and (6.27) and observing that log(2u) and log(2(1 u))arealwaysof
the opposite sign,
2ηk=1
B(k, k)1
0|log(u/(1 u))|uk1(1 u)k1du (6.28)
This implies by Jensen’s inequality that
4η2
k1
B(k, k)1
0
(log(u/(1 u)))2uk1(1 u)k1du
=1
B(k, k)1
0{1+(log(u/(1 u)))2}uk1(1 u)k1du 1
(6.29)
We approximate the integral by Laplace’s method. Let
{1+(log(u/(1 u)))2}uk1(1 u)k1=exp(gk(u)) (6.30)
where
gk(u)=(k1) log u+(k1) log(1 u)+h(u)
and
h(u)=log{1+(log(u/(1 u)))2}
Clearly, gk(1/2) = 2(k1) log 2, g
k(1/2) = 0 and g
k(u) is decreasing in uso that
gk(u) has a unique maximum at 1/2. Fix δ>0 and let λ=sup{h(u):|u1/2|}.
Then on u(1/2δ, 1/2+δ), we have
gk(u)≤−2(k1) log 2 (u1
2)2
2(8(k1) λ) (6.31)
194 6. INFERENCE FOR LOCATION PARAMETER
Thus
4η2
k
1
B(k, k)1/2+δ
1/2δ
exp[2(k1) log 2 4(k1) 1λ
8(k1)(u1
2)2]du
+1
B(k, k)|u1
2|{1+(log(u/(1 u)))2}uk1(1 u)k1du 1 (6.32)
Γ(2k)
(Γ(k))222(k1)
−∞
exp[4(k1) 1λ
8(k1)(u1
2)2]du
+1
B(k, k)|u1
2|{1+(log(u/(1 u)))2}uk1(1 u)k1du 1
Since the function u(1 u){1+(log(u/(1 u))2}is bounded on (0,1) by, say, M,the
second term on the right-hand side of (6.32) is dominated by
M
B(k, k)|u1/2|
uk2(1 u)k2du
=M(2k1)(2k2)
(k1)2P{|Uk11
2|}
M(2k1)(2k2)
(k1)2E|Uk11
2|22
=O(k1)
(6.33)
The first term on the right-hand side of (6.32) is
Γ(2k)
(Γ(k))222k+2(2π)1/2(8(k1) λ)1/2(6.34)
which, by an application of Stirling’s inequalities [[171] p. 253], is less than
(2k)2k1/2e2k(2π)1/2exp[(24k)1]
(kk1/2ek(2π)1/2)222k+2(2π)1/2
×23/2(k1)1/21λ
8(k1)1/2
=k
k11/2
exp[(24k)1]1λ
8(k1)1/2
=1+O(k1)
(6.35)
Thus η2
k=O(k1), completing the proof.
6.4. POLYA TREE PRIORS 195
Remark 6.4.3.While we have discussed consistency issues, it would be interesting
to explore how the robustness calculations in Section 4 of Lavine [119] can be made
in the context of a location parameter.
We have argued that the Schwartz theorem is the best available tool for handling
consistency issues in semiparametric problems. We have also exhibited a Polya tree
priors which have a rich K-L support. However, there are caveats. The consistency
theorem notwithstanding, computation of the posterior for θfor a density f0of the
kind used by Diaconis-Freedman shows that convergence for Cauchy base measure is
very slow. Even for n= 500, one notices the tendency to converge to a wrong value,
as in the case of the Dirichlet prior with Cauchy base measure. Rapid convergence
does take place if we replace the Cauchy by the normal.
A second fact is that the condition r1/2
m<implies that the tail of the
random Pis close in some sense to the tail of the prior expected density. This in
turn implies that the posterior for fconverges to δf0rather slowly, which might imply
relatively slow convergence also of the posterior for θ. Both these questions can be
better understood if one can get rates of convergence of the posterior and see how
they depend on the base measure and the rms. These are delicate issues.
What happens if r1/2
m=? We have conjectured earlier that then, the Schwartz
condition would not hold. If so, it seems likely that in all such cases consistency would
depend dramatically on the base measure.
7
Regression Problems
7.1 Introduction
An important semiparametric problem is to make inference about the constants in
the regression equation when the error in the regression model
Yi=α+βxi+i,i=1,2,... (7.1)
has an unknown, symmetric distribution. This is similar to the location parameter
problem, so it is natural to try a symmetrized Polya tree prior for the error distribu-
tion. Another prior that suggests itself is a symmetrized version of Dirichlet mixtures
of normals of Chapter 5. We explore both priors in this chapter with a focus on pos-
terior consistency. The covariate may arise as fixed nonrandom constants or as i.i.d.
observations of a random variable.
Because this is a semiparametric problem, it is natural to try to use Schwartz’s the-
orem. However since the observations are not identically distributed, major changes
are needed. We begin with a variant of Schwartz’s theorem in Section 7.2. In two of
the subsequent sections we discuss how the conditions of the theorem can be verified.
Lack of i.i.d. structure for the Yis necessitates assumptions on the xis to ensure that
the exponentially consistent tests required by Schwartz’s theorem exist in the cur-
rent context. Also certain conditions have to be imposed on f0to verify conditions
relating to K-L support and variance in the Schwartz theorem. Among other things
198 7. REGRESSION PROBLEMS
it is shown that Polya tree priors of the sort considered in the Chapter 6 fulfill the
required conditions on the prior.
We then turn to the Dirichlet mixtures of normal. It turns out that the random
densities are sufficiently well behaved that the proof for results similar to that outlined
in the previous paragraph can be simplified to some extent.
It may be observed that as in the Chapter 6 it may be tempting to use a Dirichlet
prior on F. It is easy to show that the posterior for α, β would be pathological in
exactly the same way, namely, it would be identical with the posterior arising from
assigning a parametric prior on F. The proof is quite similar.
In the literature, the regression problem has been handled by putting a Dirichlet
mixture of normals but without symmetrization. This means that there is an identi-
fiability problem for the constant but not for the regression coefficient β. Of course,
the posterior for αcannot be consistent, but one can show posterior consistency for β.
In many examples, one would want consistency for both αand β, so symmetrization
seems desirable. See , Burr et al.[29] for an interesting application.
The final section discusses binary response regression with nonparametric link func-
tions. This chapter is based heavily on [134] and unpublished work of Messan.
7.2 Schwartz Theorem
Fix f0,α0,β0.Let
fα,β,i =fα+βxi(y)=f(y(α+βxi)) (7.2)
and put f0i=f000,i.
For any two densities fand g,let
K(f,g)=flog f
g,V(f, g)=flog f
g2
(7.3)
and put
Ki(f,α,β)=K(f0i,f
α,β,i),V
i(f,α,β)=V(f0i,f
α,β,i) (7.4)
As mentioned in the introduction, the main tool we use is a variant of Schwartz’s
theorem. The following theorem is an adaptation to the case when the Yis are inde-
pendent but not identically distributed. Here the xis are nonrandom.
Definition 7.2.1. Let W⊂F×R×R. A sequence of test functions Φn(Y1,...,Y
n)
is said to be exponentially consistent for testing
H0:(f,α,β)=(f0
0
0) against H1:(f,α,β)∈W (7.5)
7.2. SCHWARTZ THEOREM 199
if there exist constants C1,C2,C>0 such that
(a) En
1
f0i
ΦnC1enC ,and
(b) inf
(f,α,β)∈W En
1
fα,β,i
n)1C2enC .
Theorem 7.2.1. Suppose ˜
Πis a prior on Fand µis a prior for (α, β).LetW⊂
R×R.If
(i) there is an exponentially consistent sequence of tests for
H0:(f,α,β)=(f0
0
0)against H1:(f,α,β)⊂W
(ii) for all δ>0,
Π(f,α,β):Ki(f,α,β) for all i,
i=1
Vi(f,α,β)
i2<>0
then with
i=1 Pf0iprobability 1, the posterior probability
Π(W|Y1,...,Y
n)= Wn
i=1
fα,β i(Yi)
f0i(Yi)dΠ(f,α,β)
R×Rn
i=1
fα,β i(Yi)
f0i(Yi)dΠ(f,α,β)0 (7.6)
Note that Vi(f,α,β) bounded above in iis sufficient to ensure the summability of
i=1 Vi(f,α,β)/i2.
Proof. The proof is similar to the proof of Schwartz’s theorem. If we write (7.6) as
Π(W|Y1,...,Y
n)=I1n(Y1,...,Y
n)
I2n(Y1,...,Y
n)(7.7)
it can be shown, as in the proof of Schwartz’s theorem (Chapter 4), that condition
(i) implies that “ there exists a d>0 such that endI1n(Y1,...,Y
n)0a.s.”
The denominator can be handled similarly, using Kolomogorov’s strong law of large
numbers for independent but not identically distributed random variables. Yet, with
200 7. REGRESSION PROBLEMS
a later application in mind, we give an argument here with a somewhat weaker as-
sumption than (ii). For any two densities fand g,let
V+(f,g)=flog+
f
g2
(7.8)
and put
V+i(f,α,β)=V+(f0i,f
α,β,i) (7.9)
We will show that “ for all d>0, endI2n(Y1, ..., Yn)→∞a.s.” under the assumption,
(ii)For al l δ>0,
Π(f,α,β):Ki(f,α,β) for all i,
i=1
V+i(f,α,β)
i2<>0
Because V+(f,g)V(f,g) it is easy to see that (ii) implies (ii).
Let Vbe the set
(f,α,β):Ki(f,α,β) for all i,
i=1
V+i(f,α,β)
i2<
and Wi=log
+(f0i/fα,β,i)(Yi). Applying Kolmogorov’s strong law of large numbers
for independent non-identical variables to the sequence WiE(Wi), it follows that
for each f∈V, a.s.
i=1 Pf0i,
lim inf
n→∞ 1
n
n
i=1
log fα,β,i(Yi)
f0i(Yi)
≥−lim sup
n→∞ 1
n
n
i=1
log+
f0i(Yi)
fα,β,i(Yi)
=lim sup
n→∞
1
n
n
i=1
K+
i(f,α,β) (7.10)
≥−lim sup
n→∞ 1
n
n
i=1
Ki(f,α,β)+ 1
n
n
i=1 Ki(f,α,β)/2
≥−lim sup
n→∞
1
n
n
i=1
Ki(f,α,β)+4
5
5
6
1
n
n
i=1
Ki(f,α,β)/2
7.3. EXPONENTIALLY CONSISTENT TESTS 201
Since for f∈V,n1n
i=1 Ki(f,α,β),wehaveforeachf∈V,
lim inf
n→∞
1
n
n
i=1
log fα,β,i(Yi)
f0i(Yi)≥−(δ+δ/2) (7.11)
Choosing Cso that δ+δ/2C/8 and noting that
I2nV
n
i=1
fα,β,i(Yi)
f0i(Yi)dΠ(f,α,β)
it follows from Fatou’s lemma that
enC/4I2n→∞ (7.12)
a.s.
i=1 Pf0i.
Remark 7.2.1.Condition (ii) of the theorem can be weakened. It can be seen from
the proof that if the prior assigns positive probability to the following set
1
n
n
i=1
Ki(f,α,β)for all n,
i=1
Vi(f,α,β)+K2
i(f,α,β)
i2<
then also the posterior is consistent.
7.3 Exponentially Consistent Tests
Our goal is to establish consistency for (f,α,β)orfor(α, β)at(f0
0
0), and thus
the sets Wof interest to us are of the type W=Uc,whereUis a neighborhood of β0
or α0alone or of (f0
0
0). In the first case we write Wof this type as a finite union
of Wis and show that condition (i) of Theorem 7.2.1 holds for each of these Wis.
We begin with a couple of lemmas.
Lemma 7.3.1. For i=1,2, let g0iand gibe densities on R. If for each ithere
exists a function Φi,0Φi1such that
Eg0ii)=αiγi=Egii) (7.13)
and if
lim inf
n→∞
1
n
n
i=1
(γiαi)>0 (7.14)
then there exists a constant C, sets BnRn,n=1,2,..., and n0— all depending
only on (γi
i), such that for n>n
0
202 7. REGRESSION PROBLEMS
[n
i=1 Pg0i](Bn)<e
nC , and
[n
i=1 Pgi](Bn)>1enC .
We refer to [134] for a proof. For a density gand θR,letgθstand for the density
gθ(y)=g(yθ).
Lemma 7.3.2. Let g0be a continuous symmetric density on R, with g0(0) >0.
Let ηbe such that inf|y|g0(y)=C>0.
(i) For any >0,there exists a set Bsuch that
Pg0(B)1
2C(∆ η)
and for any symmetric density g
Pgθ(B)1
2for all θ
(ii) For any <0, there exists a set ˜
Bsuch that
Pg0(˜
B)1
2C(∆ η)
and for any symmetric density g
Pgθ(˜
B)1
2for all θ
Proof. (i) Take B=(,). Since θ∆andgθis symmetric around θ,Pgθ(B)
1
2.Ontheotherhand
Pg0(B)=1
2
0
g0(y)dy 1
2η
0
g0(y)dy 1
2C(∆ η) (7.15)
Similarly ˜
B=(−∞,∆) would satisfy condition (ii).
Remark 7.3.1.By considering IB(yθ0), it is easy to see that Lemma 7.3.2 holds
if we replace g0by g00and require θθ0>∆orθθ0<.
7.3. EXPONENTIALLY CONSISTENT TESTS 203
Assumption A. There exists ε0>0 such that the covariate values xisatisfy
lim inf
n→∞
1
n
n
i=1
I{xi<ε0}>0,lim inf
n→∞
1
n
n
i=1
I{xi
0}>0
Remark 7.3.2.Assumption A forces the covariate xto take both positive and neg-
ative values, i.e., values on both sides of 0. If the condition is satisfied around any
point, then by a simple location shift, we can bring it to the present form.
Proposition 7.3.1. If Assumption A holds, f0is continuous at 0and f0(0) >0,
then there is an exponentially consistent sequence of tests for
H0:(f,α,β)=(f0
0
0)against H1:(f,α,β)∈W
in each of the following cases:
(i) W={(f,α,β): α>α
0β0>};
(ii) W={(f,α,β): α<α
0β0>};
(iii) W={(f,α,β): α>α
0β0<}; and
(iv) W={(f,α,β): α<α
0β0<}.
Proof. (i) Let Kn={i:1in, xi
0}and #Knstand for the cardinality of
Kn. We will construct a test using only those Yis for which the corresponding iis in
Kn.
If iKn,then (α+βxi)(α0+β0xi)>xi, and by Lemma 7.3.2 for each iKn,
there exists a set Aisuch that
αi:= Pf0i(Ai)<1
2C(ηxi)
and
γi:= inf
(f,α,β)∈W Pfα,β,i (Ai)1
2
where “:=” stands for equality by definition.
204 7. REGRESSION PROBLEMS
If inand i/Kn,setAi=R, so that αi=γi=1.Thus
lim inf
n→∞ n1
n
i=1
(γiαi)
lim inf
n→∞ n1
iKn
C(ηxi)(7.16)
C(η) lim inf
n→∞ #Kn/n > 0
With Φi=IAi, the result follows from Lemma 7.3.1.
(ii) In this case we construct tests using Yisuch that iMn:= {1in:xi<
ε0}.IfiMn, then
(α+βxi)(α0+β0xi)<xi<ε0
Now using condition (ii) of Lemma 7.3.2, we get sets ˜
Biand then obtain exponentially
consistent tests using Lemma 7.3.1 as in part (i). The other two cases follow similarly.
The union of the W’s in Proposition 7.3.1 is the set {(f, α,β):|ββ0|>}.
The case for αalone can be proved in exactly the same way. Combining all eight
exponentially consistent tests for αand βone can get an exponentially consistent
test for α=α0=β0.
If random fs are not symmetrized around zero, αis not identifiable. So the posterior
distribution for αwill not be consistent. Consistency for βwill continue to hold under
appropriate conditions. To prove the existence of uniformly consistent tests for βin
the nonsymmetric case, we pair Yis and consider the difference YiYj,whichhas
a density that is symmetric around β(xixj). We can now handle the problem in
essentially the same way as in Proposition 7.3.1 to construct strictly unbiased tests.
The verification of the other conditions in Sections 7.4, 7.5 and 7.6 is along similar
lines.
The next proposition considers neighborhoods of f0to get posterior consistency
for the true density rather than only the parametric part. We need an additional
assumption.
Assumption B. For so m e L,|xi|<Lfor all i.
Proposition 7.3.2. Suppose that Assumption Bholds. Let Ube a weak neighbor-
hood of f0and let W=Uc×{(α, β):|αα0|<,|ββ0|<}. Then there exists
7.3. EXPONENTIALLY CONSISTENT TESTS 205
an exponentially consistent sequence of tests for testing
H0:(f,α,β)=(f0
0
0)against H1:(f,α,β)∈W
Proof. Without loss of generality take
U=f:Φ(y)f(y)Φ(y)f0(y)
(7.17)
where 0 Φ1 and Φ is uniformly continuous.
Since Φ is uniformly continuous, given ε>0, there exists δ>0 such that |y1y2|<
δimplies |Φ(y1)Φ(y2)|/2.
Let ∆ be such that
|(αα0)+(ββ0)xi|
for α, β ∈Wand all xi. Set ˜
Φi(y)=Φ(y(α0+β0xi)). Then
Ef0i˜
Φi=Ef0Φi,E
fα,β,i ˜
Φi=Ef(αα0),(ββ0),i Φ (7.18)
Noting that
Φ(y((αα0)+(ββ0)xi))f(αα0)+(ββ0)xi(y)dy
=Φ(y)f(y)dy
we have
˜
Φi(y)fα,β,i(y)dy
Φ(y)f(y)dy |Φ(y)Φ(y((αα0)+(ββ0)xi))|
×f(αα0)+(ββ0)xi(y)dy
Φ(y)f(y)dy ε
2
in the last step, we used the uniform continuity of Φ. An application of Lemma 7.3.1
completes the proof.
If one is interested in showing posterior probability of fU, |αα0|<,|ββ0|<
δgoesto1a.s.(f0
0
0), then it is necessary to get an exponential sequence of tests
for H0:(f,α,β)=(f0
0
0) against H1:fUcor |αα0|>Aor |ββ0|.For
this, one has only to combine Propositions 7.3.1, its analogoue for α, and Proposition
7.3.2.
206 7. REGRESSION PROBLEMS
7.4 Prior Positivity of Neighborhoods
In this section we develop sufficient conditions to verify condition (ii) of Theorem
7.2.1. A similar problem in the context of location parameter was studied in Chapter
6. There, we managed with Kullback-Leibler continuity of f0at θ0—the true value
of the location parameter, and the requirement that Π{K(f
0,f)}>0 for all θ
in a neighborhood of θ0and where f
0is close to but different from f0. However,
this approach does not carry over to the regression context because, even though
the true parameter remains (α0
0), for each iwe encounter different parameters
θi=α0+β0xi. Here we take a different approach. Since we have no assumptions on
the structure of the random density f, the assumption on f0is somewhat strong. This
condition is weakened in Section 7.7, where we consider Dirichlet mixture of normals.
In that case, the random fis better behaved.
Lemma 7.4.1. Suppose f0∈Fsatisfies the following condition: There exists η>0,
Cηand a symmetric density gηsuch that, for |η|,
f0(yη)<C
ηgη(y)for all y(7.19)
Then
(a) for any f∈Fand |θ|
K(f0,f
θ)Cηlog Cη+K(gη,f)+7K(gη,f)
(b) if, in addition, vargη(log(gη/f)) <, then
sup
|θ|
varf0log+
f0
fθ<
Proof. Part (a) is an immediate consequence of Lemma 6.3.2 and the fact that
K(f0,f)=K(f0,f
θ), which follows from the symmetry of f0and f.
For (b), note that
f0log+
f0
fθ2
=f0log+
f0
f2
Cηgηlog+
Cηgη
f2
(7.20)
A remark here: We work with varf0log+f0/fθrather than varf0(log f0/fθ)be-
cause the condition fθ<C
ηgηdoes not imply [log f0/f ]2Cηgη[log Cηgη/f ]2.
7.4. PRIOR POSITIVITY OF NEIGHBORHOODS 207
We write the assumption of Lemma 7.4.1 as follows:
Assumption C. For η>0, sufficiently small, there is gη∈Fand constant Cη>0
such that for |η|,
f0(yη)<C
ηgη(y) for all y
and
Cη1asη0
Proposition 7.4.1. Suppose Assumptions B and C hold. Let ˜
Πbe a prior for f
and µbe a prior for (α, β).If(α0
0)is in the support of µand if for all ηsufficiently
small and for all δ>0
˜
ΠK(gη,f), vargηlog gη
f<>0 (7.21)
then for all δ>0and some M>0,
(˜
Π×µ){(f,α,β):Ki(f,α,β), V
i(f,α,β)<M for all i}>0 (7.22)
Proof. Choose η,δ0such that (7.21) holds with δ=δ0and
(Cη+1)logCη+Cη δ0+δ0!
Let
V=(α, β):|αα0|<η
2,|ββ0|<η
2L
Note that
Ki(f0)=K(f0,f
(αα0)+(ββ0)xi)
and
Vi(f0)=V(f0,f
(αα0)+(ββ0)xi)
and (α, β)Vimplies that |(αα0)+(ββ0)xi| for all xi. An application of
Lemma 7.19 immediately gives the result.
Theorem 7.4.1. Suppose that
(i) the covariates x1,x
2,... satisfy Assumptions A and B;
(ii) f0is continuous, f0(0) >0, and f0satisfies Assumption C;
208 7. REGRESSION PROBLEMS
(iii) for all sufficiently small ηand for all δ>0,
˜
Π{K(gη,f), V(gη,f)<∞} >0
where gηis as in Assumption C.
Then for any neighborhood Uof f0,
Π{(f,α,β):f∈U,|αα0|,|ββ0||Y1,Y
2,...,Y
n}→1 (7.23)
a.s.
i=1 Pf0i.
In other words, the posterior distribution is weakly consistent at (f0
0
0).
Proof. The proof follows from the remarks after Proposition 7.3.2.
Remark 7.4.1.Assumption (ii) of Theorem 7.4.1 is satisfied if f0is Cauchy or
normal. If f0is Cauchy, then gη=f0satisfies Assumption C. If f0is normal, then
Assumption C holds with gη=fs
0,where
fs
0=1
2{f0(yη)+f0(yη)}(7.24)
Remark 7.4.2.Assumption B is used in two places: Propositions 7.3.2 and 7.4.1.
For specific f0s one may be able to obtain the conclusion of Proposition 7.4.1 without
Assumption B. In such cases one would be able to get consistency at (α0
0) without
having to establish consistency at (f0
0
0).
7.5 Polya Tree Priors
In this section we note that Polya tree priors, with a suitable choice of parameters,
satisfy condition (iii) of Theorem 7.19 and hence the posterior distribution is weakly
consistent. To obtain a prior on symmetric densities, we consider Polya tree priors on
densities fon the positive half-line and then considering the symmetrization fs(y)=
1
2f(|y|).Since K(f,g)=K(fs,g
s)andV(f, g)=V(fs,g
s), this symmetrization
presents no problems.
We briefly recall Polya tree priors from Chapter 3. Let E={0,1},Em={0,1}m
and E=8
m=1 Em.Foreachm,{B:Em}is a partition of R+and for each ,
{B0,B
1}is a partition of B. Further {B:E}generates the Borel σ-algebra.
7.6. DIRICHLET MIXTURE OF NORMALS 209
A random probability measure Pon R+is said to be distributed as a Polya tree
with parameters (Π,A), where Π is a sequence of partitions as described in the last
paragraph, and A={α:E}is a collection of nonnegative numbers, if there
exists a collection {Y:E}of mutually independent random variables such that
(i) each Yhas a beta distribution with parameters α0;andα1
(ii) the random measure Pis given by
P(B1···m)=
m
j=1,
j=0
Y1···j1
m
j=1,
j=1
(1 Y1···j)
We restrict ourselves to partitions Π = {Πm:m=0,1,...}that are determined
by a strictly positive, continuous density αon R+in the following sense: The sets in
Πmare intervals of the form
y:k1
2m<y
−∞
α(t)dt k
2m
Theorem 7.5.1. Let ˜
Πbe a Polya tree prior on densities on R+with α=rmfor
all Em.If
m=1 r1/2
m<, then for any density gsuch that K(g, α)<and
varg(log g)<for all δ>0,
lim
M→∞
˜
Π{f:K(g, f), V(g, f)<M}>0 (7.25)
The proof is along similar lines as that of Theorem 6.4.1. We refer to [134] for
details.
Although Polya trees give rise to naturally interpretable priors on densities and
leads to consistent posterior, sample paths of Polya trees are, however, very rough
and have discontinuities everywhere. Such a drawback can be easily overcome by
considering a mixture of Polya trees. Posterior consistency continues to hold this case,
because by Fubini’s theorem, prior positivity holds under mild uniformity conditions.
Such priors are worth further study.
7.6 Dirichlet Mixture of Normals
In this section, we look at random densities that arise as mixtures of normal densities.
Let φhdenote the normal density with mean 0 and standard deviation h. For any
210 7. REGRESSION PROBLEMS
probability Pon R,fh,P will stand for the density
fh,P (y)=φh(yt)dP (t) (7.26)
Our model consists of prior µfor hand a prior ˜
ΠforP. Consistency issues related
to these priors, in the context of density estimation, based on [74], were discussed in
Chapter 5. Here we look at similar issues when the error density fin the regression
model is endowed with these priors.
To ensure that the prior sits on symmetric densities, we let Pbe a random proba-
bility on R+and set
fh,P (y)=1
2φh(yt)dP (t)+1
2φh(y+t)dP (t) (7.27)
We will denote by ˜
Π both the prior for Pand the prior for fh,P .
The following lemma shows that the random fgenerated by the prior under con-
sideration is more regular than those generated by Polya tree priors, and hence the
conditions on f0are more transparent than those in Section 7.5 or those in Ghosal,
Ghosh, and Ramamoorthi [78].
Lemma 7.6.1. Let f0be a density such that
y2f0(y)dy < and f0(y)logf0(y)dy < (7.28)
If f(y)=φh(yt)dP (t)and t2dP (t)<, then
(i) lim
θ0f0(y)log f0(y)
fθ(y)dy =f0(y)log f0(y)
f(y)dy, and
(ii) lim
θ0f0(y)log f0(y)
fθ(y)2
dy =f0(y)log f0(y)
f(y)2
dy.
Proof. We have
log fθ(y)=logφh(y(t+θ))dP (t)
and hence
|log fθ(y)|≤|log 2πh|+
log e(yθt)2/(2h2)dP (t)
(7.29)
7.6. DIRICHLET MIXTURE OF NORMALS 211
Since log e(yθt)2/(2h2)dP (t)<0, by Jensen’s inequality applied to log x, the last
expression is bounded by
|log 2πh|+(yθt)2
h2dP (t)
Hence
f0(y)log f0(y)
fθ(y)
≤|f0(y)logf0(y)|+f0(y)|log fθ(y)|
≤|f0(y)logf0(y)|+|log 2πh|+f0(y)(yθt)2
h2dP (t)
The dominated Convergence Theorem now yields the result.
We now return to the regression model.
Theorem 7.6.1. Suppose ˜
Πis a normal mixture prior for f.If
(i) Assumptions A and B hold,
(ii) ˜
Π{f:K(f0,f), V(f0,f)<∞} >0for all δ>0,
(iii) Ef0(log f0)2<, and
(iv) t2dP (t)d˜
Π(P)<,
then the posterior Π(·|Y1,...,Y
n)is weakly consistent for (f,α,β)at (f0
0
0)pro-
vided (α0
0)is in the support of the prior for (α, β).
Proof. By condition (iv), P:t2dP (t)<has ˜
Π probability 1. So we may assume
that
˜
Πf:f=fP,(ii) holds, t2dP (t)<>0 (7.30)
Let U=f:f=fP,(ii) holds, t2dP (t)<.
For every f∈U, using Lemma 7.6.1, choose δfsuch that, for θ<δ
f
f0log f0
ff0log f0
fθ
(7.31)
212 7. REGRESSION PROBLEMS
Now choose εfsuch that |αα0+(ββ0)xi|
fwhenever |αα0|
f,|ββ0|<
εf/L.
Clearly, if f∈Uand |αα0|
fand |ββ0|
f/L,wehave
Ki(f,α,β)<2δand Vi(f,α,β)<V(f0,f)+δ(7.32)
Since ˜
Π{(f,α,β):f∈U,|αα0|
f,|ββ0|
f/L}>0 (7.33)
we have
Π(f,α,β):Ki(f0) for all i,
i=1
Vi(f,α,β)
i2<>0 (7.34)
An application of Theorem 7.2.1 completes the proof.
It was shown in Chapter 5 that if f0has compact support or if f0=fPwith P
having compact support, then ˜
Π{f:K(f0,f)}>0 for all δ>0. The argument
given there also shows that in these cases, (ii) of Theorem 7.6.1 holds when ˜
Πis
Dirichlet with base measure γ. In Chapter 5 we also described f0s whose tail behavior
is related to that of γsuch that ˜
Π{f:K(f0,f)}>0. In the case when the prior
is Dirichlet, the double-integral in (iv) is finite if and only if t2(t)<. While
normal f0is covered by these results, the case of Cauchy f0cannot be resolved by
the methods in that chapter. However, Dirichlet mixtures of both location and scale
parameters of normal may be able to handle Cauchy, which is a scale mixture of
normal. Results of Chapter 5 may need to be generalized to prove posterior consistency
for these priors. .
7.7 Binary Response Regression with Unknown Link
One of the most popular models in bioassay involves regression of the probability of
some event on a covariate x. The regression is taken to be linear in logit or probit
scale. In this section we consider the same problem with a nonparametric link func-
tion, instead of a logit or probit model. We indicate, without going into details, how
posterior consistency can be established.
Consider klevels of a drug on a suitable scale, say, x1,...,x
k, with probability of
a response (which may be death or some other specified event) pi,i=1,...,k. The
ith level of the drug is given to nisubjects and the number of responses rinoted.
7.7. BINARY RESPONSE REGRESSION WITH UNKNOWN LINK 213
We thus get kindependent binomial variables B(ni,p
i). The object is often to find x
such that p=0.5. Often, piis modeled as
pi=F(α+βxi)=H(xi) (7.35)
where Fis a response distribution and α+βxiis a linear representation of F1(pi)=
yi. Here pimay be estimated by ri/ni, but if the nis are small, the estimates will
have large variances, so the model provides a way of combining all the data. In a
logit model, Fis taken as a logistic distribution function. In a probit model the link
function is the normal distribution function. The choice of the functional form of
the link function is somewhat arbitrary, and this may substantially affect inference,
particularly at the two ends where data are sparse. In recent years, there has been
a lot of interest in link functions with unknown functional form. In nonparametric
problems of this kind, one puts a prior on For H.Suchanapproachwastakenby
Albert and Chib ([1]) , Chen and Dey ([31]), Basu and Mukhopadhyay ([11, 12])
and some other authors. If one puts a prior on F, one has to put conditions on F
like specifying two values of two quantiles to make (F, α,β ) identifiable. In this case,
one can develop sufficient conditions for posterior consistency at (F0
0
0) using
our variant of Schwartz’s theorem. However, in practice, one often puts a Dirichlet
process or some other prior on Fand independently of this, a prior on (α, β). Due
to the discreteness of Dirichlet selections, many authors actually prefer the use of
other priors such as Dirichlet scale mixtures of normals, see Basu and Mukhopadhyay
([11, 12]) and the references therein. Because of the lack of identifiability, the posterior
for (α, β) is not consistent. On the other hand, a Dirichlet process prior and a prior
on (α, β) provides a prior on Hand one can ask for posterior consistency of H1(1/2)
at, say, H1
0(1/2). This problem can be solved by the methods developed earlier in
this chapter.
Without loss of generality, one may take ni= 1 for all i. To verify condition (ii) of
Theorem 7.2.1, consider
Zi=log(H0(xi))ri(1 H0(xi))1ri
(H(xi))ri(1 H(xi))1ri(7.36)
where riis 1 or 0 with probability H(xi)and1H(xi), respectively, and the true H
is denoted by H0. Then it is easily found that
EH0(Zi)=H0(xi)log H0(xi)
H(xi)+(1H0(xi)) log 1H0(xi)
1H(xi)(7.37)
214 7. REGRESSION PROBLEMS
and
EH0(Z2
i)2H0(xi)log H0(xi)
H(xi)2
+2(1H0(xi)) log 1H0(xi)
1H(xi)2
(7.38)
Assume that xis lie in a bounded interval containing H1
0(1/2), and the support of H0
contains a bigger interval. Since the range of xis is bounded, the sequence of formal
empirical distributions n1n
i=1 δxiof x1,...,x
nis relatively compact. Assume that
all limits of subsequences converge to distributions which give positive measure to all
nondegenerate intervals, provided they lie in a certain interval containing H1
0(1/2).
Therefore, a positive fraction of xis lie in an interval of positive length if the interval is
close to the point H1
0(1/2). Also assume that H0is continuous and the support of the
prior for Hcontains H0. For example, if the prior is Dirichlet with a base measure
whose support contains the support of H0, then the above condition is satisfied.
Mixture priors often have large supports also. For instance, the Dirichlet scale mixture
of normal prior used by Basu and Mukhopadhyay ([11, 12]) will have this property
if the true link function is also a scale mixture of normal cumulative distribution
functions.
If Hνis a sequence converging weakly to H0, then by Polya’s theorem, the conver-
gence is uniform. Note that for 0 <p<1, the functions plog(p/q)+(1p) log((1
p)/(1 q)) and p(log(p/q))2+(1p)(log((1 p)/(1 q)))2in qconverge to 0 as
qp, uniformly in plying in a compact subinterval of (0,1). Thus given δ>0, we
can choose a weak neighborhood Uof H0such that if H∈U, then EH0(Zi)and
EH0(Z2
i)’s are bounded. By the assumption on the support of the prior, condition (ii)
of Theorem 7.2.1 holds.
For existence of exponentially consistent tests in condition (i) of Theorem 7.2.1,
consider, without loss of generality, testing H1(1/2) = H1
0(1/2) against H1(1/2) >
H1
0(1/2) + εfor small ε>0. Let
Kn=i:H1
0(1/2) + ε/2xiH1
0(1/2) + ε
Since
EH(ri)=H(xi)H(H1
0(1/2) + ε)1
2(7.39)
and
EH0(ri)=H0(xi)H0(H1
0(1/2) + ε/2) >1
2(7.40)
7.8. STOCHASTIC REGRESSOR 215
the test 1
#Kn
iKn
ri<1
2+η(7.41)
for η=(H0(H1
0(1/2) + ε/2) 1/2)/2 is exponentially consistent by Hoeffeding’s
inequality and the fact that #Kn/n converge to positive limits along subsequences.
Therefore Theorem 7.2.1 applies and the posterior distribution of H1(1/2) is consis-
tent at H1
0(1/2).
7.8 Stochastic Regressor
In this section, we consider the case that the independent variable Xis stochastic.
We assume that the Xobservations X1,X
2,... are i.i.d. with a probability density
function g(x) and are independent of the errors 1,
2,.... We will argue that all the
results on consistency hold under appropriate conditions.
Let G(x)=x
−∞ g(u)du, denote the cumulative distribution function of X.Weshall
assume that the following condition holds.
Assumption D. The independent variable Xis compactly supported and 0 <
G(0)G(0) <1.
Under these assumptions, results follow from a conditionality argument and the
corresponding results for the nonstochastic case, conditioned on a sequence x1,x
2,...
such that Assumptions A and B hold. Note that if gsatisfies Assumption D, under
P
g, almost all sequences x1,x
2,... satisfy Assumptions A and B. For details see
[134]. Thus if Xis stochastic and Assumption D replaces Assumptions A and B in
Theorems 7.5.1 and 7.6.1, posterior consistency holds.
7.9 Simulations
Additional insight can often be obtained by carrying out simulations. In the mixture
model that we have discussed, one can study the effect on the posterior of βby varying
the ingredients in the mixture model. There is an additional issue of symmetrization.
After fixing the prior, one can generate observations from carefully chosen parameters
and error density and in each case examine the behavior of the posterior. Extensive
simulations of this kind have been done by Charles Messan using WINBUGS, and we
present a few of these.
First we look at two cases for the kernel: normal and Cauchy. The base measure
for the Dirichlet process is N(0,1). Figure 7.1 displays the simulated posterior when
216 7. REGRESSION PROBLEMS
observations were generated from (true f0is) normal. The value of βis 3.0., and
the random densities are not symmetrized. It is clear from the graphs that, in this
case, the posterior behaves well, and in addition to consistency also shows asymptotic
normality.
In figure 7.2, the setup for priors is the same as that just considered, but the
posterior is evaluated when the true f0is Cauchy. Clearly, things do not seem to go
well. Both consistency and asymptotic normality seem to be in doubt.
One could see if the introduction of a hyperparameter for the base measure of the
Dirichlet process would lead to amelioration of the situation. Figures 7.3 and 7.4 show
the result of simulations with a hyperparameter for the base measure. There seems to
be some improvement. The estimates are closer to the true value of β= 3, and there
is a suggestion of asymptotic normality.
7.9. SIMULATIONS 217
218 7. REGRESSION PROBLEMS
7.9. SIMULATIONS 219
220 7. REGRESSION PROBLEMS
Figure 7.4: Sample size n=50 Truef0= cauchy(0, 0.5) Priors: base measure of Dirichlet: N(
µ
,σ)
µ|σ~N(0,2σ)
Classical estimate of beta: σ~Unif(0,10)
β
ˆ= 2.4641, Var(
β
ˆ) = infinite Bandwidth h:h~Unif(0,4)
MCMC estimates of beta: Hyperparameter of Dirichlet M= 100
Dirichlet mixture of cauchy: C
β
ˆ= 2.898 Var( C
β
ˆ) = 0.0053 Skewness = - 0.0753
Kurtosis = 0.2729
Dirichlet mixture of normal: N
β
ˆ= 2.899 Var( N
β
ˆ) = 0.0050 Skewness = - 0.0623
Kurtosis = 0.3620
8
Uniform Distribution on Infinite-Dimensional
Spaces
8.1 Introduction
Except for a noninformative choice of the base measure αfor a Dirichlet very little
is known about noninformative priors in nonparametric or infinite-dimensional prob-
lems. In this chapter we explore how one may construct a prior that is noninformative,
i.e., completely nonsubjective in the sense of Chapter 1, for nonparametric problems.
One way of thinking of them is as a uniform distribution over an infinite-dimensional
space. Our approach has some similarities with that of Dembski [40], as well as many
differences.
Several new approaches to construction of such a prior are discussed in Section 8.2.
The remaining sections attempt some validation. In Section 8.3 we show that one of
our methods would lead to the Jeffreys prior for parametric models under regularity
conditions. We also briefly discuss what would be reference priors from this point of
view. Section 8.4 contains an application of our ideas to a density estimation problem
of Wong and Shen [172]. We show that for our hierarchical noninformative prior, the
posterior is consistent–a sort of weak frequentist validation. The proof of consistency is
interesting in that the Schwartz condition is not assumed. We also show that the rate
of convergence of the posterior is optimal. In particular, this implies that the Bayes
estimate of the density corresponding to this prior achieves the optimal frequentist
rate–a strong frequentist validation. We offer these tentative ideas to be tried out
222 8. UNIFORM DISTRIBUTION ON INFINITE-DIMENSIONAL SPACES
on different problems. Computational or other considerations may require replacing
Piby other sieves, which need not be finite, changing an index ito h, which may
take values in a continuum, and distributions on Piwhich are not uniform. These
relaxations will create a very large class of priors that are nonsubjective in some
sense and from which it may be convenient to elicit a prior. This approach includes
some of the priors in Chapter 5, namely, the random histograms and the Dirichlet
mixture of normals with standard deviation h. The parameter hcan be viewed as
indexing a sieve. This chapter is almost entirely based on [73] and [80]
8.2 Towards a Uniform Distribution
8.2.1 The Jeffreys Prior
By way of motivation we begin with a regular parametric model. Let Θ Rp.A
uniform distribution on Θ should be associated with the geometry on Θ induced by
the statistical problem. To do this, let I(θ)=[Ii,j(θ)] be the p×pFisher information
(positive definite) matrix. As shown by Rao [2], the matrix induces a Riemannian
metric on Θ through the integration of
ρ()=
i
j
Ii,j(θ)ij
over all curves connecting θto θand minimizing over curves. The minimizing curve is
a geodesic. If the model is N(θ,Σ), then Ii,j
1and we get the famous Mahalanobis
distance. Cencov [30] has shown the Riemannian geometry induced by Rao’s metric
is the unique Riemannian metric that changes in a natural way under 1-1 smooth
transformations of Θ onto itself. The Jeffreys prior {detI(θ)}1/2can be motivated as
follows.
Fix a θand consider a 1-1 smooth transformation
θ→ ψ(θ)=ψ
such that the information matrix Iψwith the new parametrization ψis identity at
ψ(θ0). This implies that the local geometry in the ψ-space is Euclidean near ψ(θ0)
and hence the Lebesgue measure is a suitable uniform distribution near ψ(θ0). If
we lift this back to the θ-space making use of the Jacobian and the elementary fact
[∂θj
∂ψi
][Ii,j(θ)][ ∂θj
∂ψi
]=Iψ=I
8.2. TOWARDS A UNIFORM DISTRIBUTION 223
we get Jeffreys prior in the θ-space, namely,
== {det[∂θi
∂ψj
]}1={det[Ii,j(θ)]}1/2
Another way of deriving the Jeffreys prior in a similar spirit is given in Hartigan ([93]
pp. 48, 49). The basic paper for the Jeffreys prior is Jeffreys [106]. These references
are relevant for Section 8.3 especially Remark 8.4.1.
8.2.2 Uniform Distribution via Sieves and Packing Numbers
Suppose we have a model Pwhich is equipped with a metric ρandiscompact.In
applications we use the Hellinger metric. The compactness assumption can then be
relaxed in at least some σcompact cases in a standard way. Our starting point is a
sequence idiminishing to zero and sieves Piwhere Piis a finite set whose elements
are separated from each other by at least iand has cardinality D(i,P), the largest m
for which there are P1,P
2,...,P
m∈Pwith ρ(Pj,P
j)>
i,j =j,j,j=1,2,...,m.
Clearly, given any P∈Pthere exists P∈P
isuch that ρ(P, P )i.ThusPi
approximates Pwithin iand no subset of it will have this property.
In the first method we choose i(n), tending to 0 in some suitable way. It is then
convenient to think of Pi(n)as a finite approximation to Pwith the approximation
depending on the sample size n. The idea is that the approximating finite model is
made more and more accurate by increasing its cardinality with sample size. In the
first method our noninformative prior is just the uniform distribution on Fi(n).
This seems to accord well with Basu’s [9] recommendation in the parametric case to
approximate the parameter space Θ by a finite set and then put a uniform distribution.
It is also intuitively plausible that the complexity or richness of a model Pi(n)may
be allowed to depend on the sample size. Since this prior depends on the sample size,
we consider two other approaches that are more complicated but do not depend on
sample size.
In the second approach, we consider the sequence of uniform distributions Πion
Piand consider any weak limit Πof {Πi}as a noninformative prior. If Πis unique,
it is simply the uniform distribution defined and studied by Dembski [40].
In the infinite-dimensional case, evaluation of the limit points may prove to be
impossible. However, the first approach may be used, and Πi(n)may be treated as an
approximation to a limit point Π.
We now come to the third approach. Here, instead of a limit, we consider the index
as a hyperparameter and consider a hierarchical prior which picks up the index iwith
probability λiand then uses Πi.
224 8. UNIFORM DISTRIBUTION ON INFINITE-DIMENSIONAL SPACES
8.3 Technical Preliminaries
Let Kbe a compact metric space with a metric ρ. A finite subset Sof Kis called
-dispersed if ρ(x, y)for all x, y S, x =y.Amaximal-dispersed set is called
an -net and an -net with maximum possible cardinality is said to be an -lattice.
The cardinality of an -lattice is called the packing number (or -capacity)ofKand is
denoted by D(, K)=D(, K, ρ). As Kis totally bounded, D(, K ) is finite. Closely
related to packing numbers are covering numbers N(, K, ρ)–the maximum number
of balls of radius needed to cover K. Clearly,
N(, K, ρ)D(, K, ρ)N(/2,K)
In view of this, our arguments could also be stated in terms of covering numbers.
Define the -probability Pby
P(X)=D(, X)
D(, K),XK
It follows that 0 P(·)1,P
()=0,P
(K)=1.P
is subadditive and for
X, Y K. Because Kis compact, subsequences of µwill have weak limits. If all the
subsequences have the same limits, then Kis called uniformizable and the common
limit point is called the uniform probability on K.
The following result of Dembski [40]) will be used in the next section.
Theorem 8.3.1 (Dembski). Let (K, ρ)be a compact metric space. Then the
following assertions hold.
(a) If Kis uniformizable with uniform probability µ, then lim0P(X)=µ(X)for
all XKwith µ(∂X)=0.
(b) If lim0P(X)exists on some convergence-determining class in K, then Kis
uniformizable.
To extend these ideas to noncompact σ-compact spaces, one can take a sequence
of compact sets KnKhaving uniform probability µn. Any positive Borel measure
µsatisfying
µ(·∩Kn)=µn(·∩Kn)
µn(K1)
may be thought of as an (improper) uniform distribution on K. Such a measure would
be unique up to a multiplicative constant by lemma 2 of Dembski [40].
8.4. THE JEFFREYS PRIOR REVISITED 225
8.4 The Jeffreys Prior Revisited
Let Xis be i.i.d. with density f(.;θ)(with respect to a σ-finite measure ν), and Θ is
an open subset of Rd. Assume that {f(.;θ):θΘ}is a regular parametric family,
i.e., there exist {ψ(.;θ)(L2(ν))dsuch that for any compact KΘ
sup
θK|f1/2(x;θ+h)f1/2(x;θ)hTψ(x;θ)|2ν(dx)=o(h2) (8.1)
as h→0. Define the Fisher information by the relation
I(θ)=4ψ(x;θ)(ψ(x;θ))Tν(dx) (8.2)
Assume that I(θ) is positive definite and the map θ→ I(θ) is continuous. Further,
assume the following stronger form of identifiability: On every compact set KΘ,
inf{f1/2(x;θ1)f1/2(x;θ2)2ν(dx):θ1
2K, θ1θ2≥}>0,>0
For i.i.d. observations equip Θ with the Hellinger distance, as defined in Chapter
1, namely,
H(θ1
2)=|f1/2(x;θ1)f1/2(x;θ2)|2ν(dx)1/2
(8.3)
The following result is the main theorem of this section.
Theorem 8.4.1. Fix a compact subset Kof Θ. Then for all QKwith vol
(∂Q)=0, we have
lim
0
D(, Q)
D(, K)=QdetI(θ)
KdetI(θ)(8.4)
By using Theorem 8.3.1 we conclude that the Jeffreys measure µon Θ defined by
µ(Q)KdetI(θ) Q Θ (8.5)
is the (possibly improper) noninformative prior on Θ in the sense of the second ap-
proach described in the introduction.
The idea is to approximate the packing number of relatively small sets by the
Jeffreys prior measure for those sets (see 8.13, 8.14) and then fit these small sets into
a given set Qor K. One has to check that the approximation remains good at this
higher scale [vide 8.16].
226 8. UNIFORM DISTRIBUTION ON INFINITE-DIMENSIONAL SPACES
Proof. Fix 0 <η<1. Cover Kby Jcubes of length η.Ineachcubefixaninterior
cube with length ηη2. The interior cube will provide an approximation from below.
Since by continuity, the eigenvalues of I(θ) are uniformly bounded away from zero
and infinity on K, by standard arguments [see theorem I.7.6. in [102]], it follows from
(8.1) that there exist M>m>0 such that
mθ1θ2≤H(θ1
2)Mθ1θ2
1
2K(8.6)
Given η>0choose>0sothat/(2m)
2. Any two interior cubes are separated
by at least η/m in terms of Euclidean distance and by in terms of the Hellinger
distance.
For QK,letQjbe the intersection of Qwith the jth cube and Q
jbe the
intersection with the jth interior cube, j=1,2...,J. Then
Q1Q2...QJ=Q
1Q
2...Q
J(8.7)
Hence J
j=1
D(, Q
j,H)D(, Q, H)
J
j=1
D(, Qj,H) (8.8)
In particular, with Q=K,weobtain
J
j=1
D(, K
j,H)D(, K, H )
J
j=1
D(, Kj,H) (8.9)
where Kjand K
jare defined in the same way.
For th e jth cube, choose θjK. By an argument similar to that for (8.6), for all
θ, θin the jth cube,
λ(η)
27(θθ)TI(θj)(θθ)H(θ, θ)¯
λ(η)
27(θθ)TI(θj)(θθ) (8.10)
where ¯
λ(η)andλ(η)tend to 1 as η0.
Let
Hj(θ, θ)=λ(η)
27(θθ)TI(θj)(θθ)
and
¯
Hj(θ, θ)= ¯
λ(η)
27(θθ)TI(θj)(θθ)
8.4. THE JEFFREYS PRIOR REVISITED 227
Then from (8.10),
D(, Qj,H)D(, Qj,H) (8.11)
D(, Q
j,H)D(, Q
j,¯
H) (8.12)
By the second part of theorem IX of Kolmogorov and Tihomirov [115], for some
constants τj
jand absolute constants Ad(depending only on the dimension d),
D(, Qj,H)Advol(Qj)7detI(θj)(λ(η))dd(8.13)
and
D(, Qj,¯
H)Advol(Q
j)7detI(θj)(¯
λ(η))dd(8.14)
where the symbol means that the limit of the ratio of the two sides is 1 as 0.
As all metrics, Hjand ¯
Hj;j=1,2,...,J arise from elliptic norms, it can be easily
concluded by making a suitable linear transformation that τj=τ
j=τ(say) for all
j=1,2,...,J. Thus we obtain from (8.7)–(8.14) that
lim sup
0
D(, Q, H)
D(, K, H )J
j=1 vol(Qj)detI(θj)
J
j=1 vol(Kj)detI(θj)¯
λ(η)
λ(η)d
(8.15)
and
lim sup
0
D(, Q, H)
D(, K, H )J
j=1 vol(Q
j)detI(θj)
J
j=1 vol(Kj)detI(θj)λ(η)
¯
λ(η)d
(8.16)
Now let η0. By the convergence of sums J
j=1 vol(Qj)detI(θj)toQI(θ)
and J
j=1 vol(Q
j)detI(θj)QI(θ)and similarly for sums involving Kjsand
K
js. Also λ(η)1and¯
λ(η)1, so the desired result follows.
Remark 8.4.1.It has been pointed out to us by Prof.Hartigan that Jeffreys had en-
visaged constructing noninformative priors by approximating Θ with Kullback-Leibler
neighborhoods . He asked us if the construction in this section can be carried out us-
ing the Kullback-Leibler neighborhoods . Because the Kullback-Leibler divergence is
not a metric there would be obvious difficulties in formalizing the notion of an -net.
However, if the family of densities {fθ:θΘ}have well-behaved tails such that, for
any θ, θ,K(θ, θ)φ(H(θ, θ)), where φ()goesto0asgoes to 0, then any -net
{θ1,...,θ
k}in the Hellinger metric can be thought of as a Kullback-Leibler net in the
sense that
228 8. UNIFORM DISTRIBUTION ON INFINITE-DIMENSIONAL SPACES
1. K(θi
j)>for i, j, =1,2,...k;and
2. for any θthereexistsanisuch that K(θi)().
In such situations, the above theorems allow us to view the Jeffreys prioras a limit
of uniform distributions arising out of Kullback-Leibler neighborhoods. Wong and
Shen [172] show that a suitable tail behavior is that for all θ, θ,
fθ/fθexp 1
δ
fθ(fθ
Fθ
)δ<M
We now consider the case when there is a nuisance parameter. Let θbe the pa-
rameter of interest and φbe the nuisance parameter. We can write the information
matrix as I11(θ, φ)I12(θ, φ)
I12(θ, φ)I22(θ, φ)(8.17)
In view of Theorem 8.4.1, and in the spirit of reference priors of Bernardo [18],
the prior for φgiven θis specified as Π(φ|θ)=I11(θ, φ). So it is only necessary to
construct a noninformative marginal prior for θ. Assume, as before, that the parameter
space is compact. With ni.i.d. observations, the joint density of the observations given
θonly is given by
g(xn)=(c(θ))1n
1
f(xi)I22 (θ, φ)(8.18)
where c(θ)=n
1f(xi)I22 (θ, φ)is the constant of normalization. Let In(θ,g )
denote the information in the family {g(xn):θΘ}. Under appropriate regularity
conditions, it can be shown that the information per observation In(θ, g)/n satisfies
lim
n→∞ In(θ, g)/n =(c(θ))1I11.2(θ, φ)I22(θ, φ)=J(θ) ( say) (8.19)
where I11.2=I11 I2
12/I22 is the (11) element in the inverse of the information
matrix. Let Hn(θ, θ +h) be the Hellinger distance between g(xn)andg(xn+h).
Locally, as h0,H2
n(θ, θ +h) behaves like h2In(θ, g). Hence by Theorem 8.4.1, the
noninformative (marginal) prior for θwould be proportional to In(θ, g). In view
of (8.19), passing to the limit as n→∞, the (sample size–independent) marginal
noninformative prior for θshould be taken to be proportional to (J(θ))1/2, and so the
8.5 POSTERIOR CONSISTENCY 229
prior for (θ, φ) is proportional to J(θ)π(φ|θ). Generally, for noncompact parameter
space, one can proceed like Berger and Bernardo [14]. Informally, we can sum up
as follows. The prior for θbased on the current approach is obtained by taking the
average of I11(θ, φ) with respect to I22(θ, φ) and then taking the square root. The
reference prior of Berger and Bernardo or the probability matching prior takes average
geometric and harmonic means of other functions of I11(θ, φ) and then transforms
back. In the examples of Datta and Ghosh [38], we believe that they reduce to the
same prior.
8.5 Posterior Consistency for Noninformative Priors for
Infinite-Dimensional Problems
In this section, we show that in a certain class of infinite dimensional families, the
third approach mentioned in the introduction leads to consistent posterior.
Theorem 8.5.1. Let Pbe a family of densities where P, metrized by the Hellinger
distance, is compact. Let εnbe a positive sequence satisfying
n=1
n1/2εn<
Let Pnbe a εn-net in P,µnbe the uniform distribution on Pn, and µbe the probability
on Pdefined by µ=
n=1 λnµn, where λns are positive numbers adding up to unity.
If for any β>0,
lim
n→∞ eβn λn
D(εn,Pn)=(8.20)
then the posterior distribution based on the prior µand i.i.d. observations X1,X
2,...
is strongly consistent at every p0∈P.
Proof. Since Pis compact under the Hellinger metric, the weak topology and the
Hellinger topology coincide on P. Consequently weak neighborhoods and strong
neighborhoods coincide and so do the notions of weak and strong consistency.
To prove consistency, by Remark 4.5.1, it is enough to show that for every δ,if
Uδ
n={P:H(P0,P)/n}then for all β>0,
eΠ(Uδ
n)→∞
Because
n=1 n1/2εn<,givenδ, there is a n0such that for n>n
0
n/n;so
that for n>n
0, there is a Pn∈P
nsuch that H(P0,P
n)/n.
230 8. UNIFORM DISTRIBUTION ON INFINITE-DIMENSIONAL SPACES
Since Π{Pn}=λn/D(εn,Pn) and by assumption, for all β>0,
lim
n→∞ eβn λn
D(εn,Pn)=
and Π(Uδ
n)>Π{Pn}; consistency follows.
Remark 8.5.1.Consistency is obtained in the Theorem 8.5.1 by requiring (8.20)
for sieves whose width εnwas chosen carefully. However, it is clear from the proof
that consistency would follow for sieves with width εn0 by imposing (8.20) for a
carefully chosen subsequence.
Precisely, if εn0,Pnan εn-net, µis the probability on Pdefined by µ=
1λnµn
and δnis a positive summable sequence, then by choosing j(n)with
εj(n)2
nδn(8.21)
the posterior is consistent, if
exp[]λj(n)
D(εj(n),Pn)→∞ (8.22)
A useful case corresponds to
D(ε, P)Aexp[cα] (8.23)
where 0 <α<2/3andAand care positive constants, δn=nγfor some γ>1. If
in this case j(n) is the smallest integer satisfying (8.21), then (8.22) becomes
exp[α
j(n)]λj(n)→∞ (8.24)
If εn=ε/2nfor some ε>0andλndecays no faster than nsfor some s>0 then
(8.24) holds. Moreover, the condition 0 <α<2 in (8.23) is enough for posterior
consistency in probability.
We can apply this in the following example [see Wong and Shen [172]] the following.
Example 8.5.1. Let
P={g2:gCr[0,1],1
0
g2(x)dx =1,
g(j)sup Lj,j =1,2,...r
|g(r)(x1)g(r)(x2)|≤Lr+1|x1x2m}
where ris a positive integer and 0 m1andL’s are fixed constants. By theorem
15 of Kolomogorov and Tihomirov [115] D(ε, P,h)exp[1/r+m].
8.6. CONVERGENCE OF POSTERIOR AT OPTIMAL RATE 231
8.6 Convergence of Posterior at Optimal Rate
This section is based on Ghosal, Ghosh and van der Vaart ([80]).
We present a result concerning rate of convergence of the posterior relative to L1,L
2,
and Hellinger metrics. The two main elements controlling the rate of convergence are
the size of the model (measured by packing or covering numbers) and the amount
of prior mass given to a shrinking ball around the true measure. It is the latter
quantity that is easy to estimate for the hierarchical noninformative priors introduced
in Section 8.1. and appearing in Theorem 8.5.1 of the preceding section. See also Shen
and Wasserman [150]
Theorem 8.6.1. Suppose for a sequence nwith n0and n2
n→∞, a constant
C>0and sets Pn⊂Pwe have
log D(n,Pn,d)n2
n(8.25)
Πn(P\Pn)exp(n2
n(C+ 4)) (8.26)
ΠnP:E0(log p
p0
)2
n,E
0(log p
p0
)22
nexp(n2
nC).(8.27)
Then for sufficiently large M, we have that
Πn(P:d(P, P0)Mn|X1,X
2,...,X
n)0in Pn
0probability
See [80] for a proof.
Condition (8.25) requires that the “model” Pnis not too big and (8.26) ensures
that its complement will not alter too much. It is true for every
nnas soon
at it is true for nand thus can be seen as defining a minimal possible value of
n. Condition (8.25) ensures the existence of certain tests and could be replaced by
a testing condition in the spirit of LeCam [120]. Note that the metric dused here
reappears in the assertion of the theorem. Since the total variation metric is bounded
above by twice the Hellinger metric, the assertion of the theorem using the Hellinger
metric is stronger, but also condition (8.25) will be more restrictive, so that we really
have two theorems. In the case that the densities are uniformly bounded, one can have
a third theorem, when using the L2-distance, which in that case will be bounded above
by a multiple of the Hellinger distance. If the densities are also uniformly bounded
and uniformly bounded away from zero, then these three distances are equivalent
and are also equivalent to the Kullback-Leibler number and L2-norm appearing in
condition (8.27).
232 8. UNIFORM DISTRIBUTION ON INFINITE-DIMENSIONAL SPACES
Aratensatisfying (8.25) for P=Pnand dthe Hellinger metric is often viewed as
giving the “optimal” rate of convergence for estimators of Prelative to the Hellinger
metric, given the model P. Under certain conditions, such as likelihood ratios bounded
away from zero and infinity, this is proved as a theorem by Birg´e [22] and LeCam [122]
and [120]. See also Wong and Shen [172]. From Birg´e’s work it is clear that condition
(8.25) is a measure of the complexity of the model.
Condition (8.27) is the other main condition. It requires that the prior measures
put a sufficient amount of mass near the true measure P0. Here “near” is measured
through a combination of the Kullback-Leibler divergence of pand p0and the L2(P0)-
norm of log(p/p0). Again, this condition is satisfied for
nnif it is satisfied for n
and thus is another restriction on a minimal value of n.
The assertion of the theorem is an in-probability statement that the posterior
mass outside a large ball of radius proportional to nis approximately zero. The in-
probability statement can be improved to an almost-sure assertion, but under stronger
conditions, as indicated below.
Let hbe the Hellinger distance and write log+xfor (log x)0.
Theorem 8.6.2. Suppose that conditions (8.25) and (8.26) hold as in the preceding
theorem and neBn2
n<for every B>0and
ΠnP:h2(P, P0)/
/
/p0/p/
/
/2
nen2
nC
Then for sufficiently large M, we have that Πn(P:d(P, P0)Mn|X1,...,X
n)0
in Pn
0-almost surely.
See also theorem 2.3 in [80].
These theorems are not tailored for finite-dimensional models. For such cases and
for finite-dimensional sieves, they yield an extra logarithmic factor in addition to the
correct rate of 1/n. Suitable refinements of (8.25) and (8.27) to address this issue
are in [80].
Convergence of the posterior distribution at the rate nimplies the existence of
point estimators, which are Bayes in that they are based on the posterior distribution,
which converge at least as fast as nin the frequentist sense. One possible construction
is to define ˆ
Pnas the (near) maximizer of
QΠnP:d(P, Q)<
n|X1,...,X
n
Theorem 8.6.3. Suppose that Πn(P:d(P, P0)n|X1,...,X
n)converges to 0,
almost surely (respectively, in-probability) under Pn
0and let ˆ
Pnmaximize, up to o(1),
8.6. CONVERGENCE OF POSTERIOR AT OPTIMAL RATE 233
the function Q→ ΠnP:d(P, Q)<
n|X1,...,X
n. Then d(ˆ
Pn,P
0)2neventually
almost surely (respectively, in-probability) under Pn
0.
Proof. By definition, the n-ball around ˆ
Pncontains at least as much posterior prob-
ability as the n-ball around P0, both of which by posterior convergence at rate n,
has posterior probability close to unity. Therefore, these two balls cannot be disjoint.
Now apply the triangle inequality.
The theorem is well - known (See e.g. Le Cam ([120] or Le Cam and Yang [121]). If
we use the Hellinger or total variation metric (or some other bounded metric whose
square is convex), then an alternative is to use the posterior expectation, which typ-
ically has a similar property.
In order to state the next theorem we need a strengthening of the notion of entropy.
Given two functions l, u :X→Rthe bracket [l, u] is defined as the set of all
functions f:X→Rsuch that lfueverywhere. The bracket is said to be of
size relative to the distance dif d(l, u)<. In the following we use the Hellinger
distance hfor the distance dand the brackets to consist of nonnegative functions,
integrable with respect to a fixed measure µ.LetN[](, P,h) be the minimal number
of brackets of size needed to cover P. The corresponding bracketing entropy is
defined as the logarithm of the bracketing number N[](, P,h). It is easy to see that
N[](, P,h)isbiggerthanN[](/2,P,h) and hence bigger than D(, P,h). However,
in many examples, bracketing and packing numbers lead to the same values of the
entropy up to an additive constant.
In the spirit of Section 8.2.2 we now construct a discrete prior supported on densities
constructed from minimal sets of brackets for the Hellinger distance. For a given
number n>0letPinbe the uniform discrete measure on the N[](n,P,h) densities
obtained by covering Pwith a minimal set of n-brackets and then renormalizing
the upper bounds of the brackets to integrate to one. Thus if [l1,u
1],...,[lN,u
N]are
the N=N[](n,P,h) brackets, then Πnis the uniform measure on the Nfunctions
uj/uj. Finally, construct the hierarchical prior
Π=
n∈N
λnΠn
for a given sequence λnwith λn0andnλn= 1. This is essentially the third
approach of Section 8.2.2. As before the rate at which λn0 is important.
Theorem 8.6.4. Suppose that nare numbers decreasing in nsuch that
log N[](n,P,h)n2
n
234 8. UNIFORM DISTRIBUTION ON INFINITE-DIMENSIONAL SPACES
for every n, and
n2
n/log n→∞
. Construct the prior Πas given previously for a sequence λnsuch that λn>0for all
nand log λ1
n=O(log n). Then the conditions of Theorem 8.6.2 are satisfied for n
a sufficiently large multiple of the present nand hence the corresponding posterior
converges at the rate nalmost surely, for every P0∈P, relative to the Hellinger
distance.
There are many specific applications. The situation here is similar to that in several
recent papers on rates of convergence of (sieved) maximum likelihood estimators, as
in Birg´e and Massart, (1996, 1997), Wong and Shen [172], or chapter 3.4 of van der
Vaart and Wellner [161]. We consider again Example 8.5.1 of smooth densities of the
previous section.
Example 8.6.1 (Smooth densities). Because upper and lower brackets can be
constructed from uniform approximations, this shows that the bracketing Hellinger
entropies grow like 1/r,sothatwecantakenof the order nr/(2r+1) to satisfy
the relation log N[] (n,P,h)n2
n. This rate is known to be the frequentist optimal
rate for estimators. From Theorem 8.6.3, we therefore conclude that for the prior
constructed earlier, the posterior attains the optimal rate of convergence.
Since the lower bounds of the brackets are not really needed, the theorem can be
generalized by defining N](, P,h) as the minimal number of functions u1,...,u
msuch
that for every p∈Pthere exist a function uisuch that both puiand h(ui,p)<.
Next we construct a prior Π as before. These upper bracketing numbers are clearly
smaller than the bracketing numbers N[](, P,h), but we do not know any example
where this generalization could be useful.
So far, we have implicitly required that the model Pis totally bounded for the
Hellinger metric. A simple modification works for countable unions of totally bounded
models, provided that we use a sequence of priors. Suppose that the bracketing num-
bers of Pare infinite, but there exist subsets Pn↑Pwith finite bracketing numbers.
Let nbe numbers such that log N[](n,Pn,h)n2
nandbesuchthatn2
nis increasing
with n2
n/log n→∞. Then we construct Πnas before with Preplaced by Pn, but we
do not mix these uniform distributions. Instead, we consider Πnitself as the sequence
of prior distributions. Then the corresponding posteriors achieve the convergence rate
n.
It is worth observing that we use a condition on the entropies with bracketing, even
though we apply Theorem 8.6.2, which demands control over metric entropies only.
8.6. CONVERGENCE OF POSTERIOR AT OPTIMAL RATE 235
This is necessary because the theorem also requires control over the likelihood ratios.
If, for instance, the densities are uniformly bounded away from zero and infinity, so
that the quotients p0/p are uniformly bounded, then we can replace the bracketing
entropy also by ordinary entropy. Alternatively, if the set of densities Ppossesses
an integrable envelope function, then we can construct priors achieving the rate n
determined by the covering numbers up to logarithmic factors. Here we define nas
the minimal solution of the equation log N(, P,h)n2and N(, P,h) denotes the
Hellinger covering number (without bracketing).
We assume that the set of densities Phas a µ-integrable envelope function: a
measurable function mwith mdµ < such that pmfor every p∈P. Given
n>0let{s1,n,...,s
Nn,n}be a minimal n-net over P(hence Nn=N(n,P,h)) and
put
gj,n =(s1/2
j,n +nm1/2)2/cj,n
where cj,n is a constant ensuring that gj,n is a probability density. Finally, let Πnbe
the uniform discrete measure on g1,n,...,g
Nn,n and let Π =
n=1 λnΠnbe a convex
combination of the Πnas before. This is similar to the construction of sieved MLE in
theorem 6 of Wong and Shen [172]. The following result guarantees an optimal rate
of convergence.
Theorem 8.6.5. Suppose that nare numbers decreasing in nsuch that
log N(n,P,h)n2
n
for every nand n2
n/log n→∞. Construct the prior Π=
n=1 λnΠnas given
previously for a sequence λnsuch that λn>0for all nand log λ1
n=O(log n).
Assume mis a µ-integrable envelope. Then the corresponding posterior converges at
the rate nlog(1/n)in probability, relative to the Hellinger distance.
We omit the proof.
9
Survival Analysis—Dirichlet Priors
9.1 Introduction
In this chapter, our interest is in the distribution of a positive random variable X,
which arises as the time to occurrence of an event. What makes the problem different
from those considered so far is the presence of censoring. Typically, one does not
always get to observe the value of Xbut only obtains some partial information about
X, like Xaor aXb. This loss of information is often modeled through
various kinds of censoring mechanisms: left, right, interval, etc. See Andersen et
al. [3] for a deep development of various censoring models. The earliest frequentist
methods for censored data were in the context of right censored data, and it is this
kind of censoring that we will study in this and in Chapter 10. Bayesian analysis of
other kinds of censored data is still tentative, and much remains to be done.
Let Xbe a positive random variable with distribution Fand let Ybe independent
of Xwith distribution G. The model studied in this section is: FΠ, given F;
X1,X
2,...,X
nare i.i.d F; given G;Y1,Y
2,...,Y
nare i.i.d Gand are independent of
the Xis; the observations are (Z1
1),(Z2
2),...,(Zn
n)whereZi=(XiYi)and
δi=I(XiYi).
Our interest is in the posterior distribution of Fgiven (Zi
i):1in.
Under the assumption that Xand Yare independent, the posterior distribution of
Fgiven (Z, δ) is independent of G.IfZi=ziand δi= 0, the observation is referred
238 9. SURVIVAL ANALYSIS—DIRICHLET PRIORS
to as (right) censored at zi, and in this case it is intuitively clear that the information
we have about Xis just that Xi>z
iand hence the posterior distribution of F
given (Zi=zi
i=0)i(·|Xi>z
i). Similarly, the posterior distribution of Fgiven
(Zi=zi
i=1)isΠ(·|Xi=zi).
In Section 9.1, we study the case when the underlying prior for Fis a Dirichlet
process. This model was first studied by Susarla and Van Ryzin [154]. They obtained
the Bayes estimate of F, and later Blum and Susarla [26] gave a mixture represen-
tation for the posterior. Here we develop a different representation for the posterior
and show that the posterior is consistent.
In Section 9.2, we briefly discuss the notion of cumulative hazard function, describe
some its properties, and use it to describe a result of Peterson who shows that, under
mild assumptions, both Fand Gcan be recovered from the distribution of (Z, δ).
This result is used in Section 9.3.
In Section 9.3, we start with a Dirichlet prior for the distribution of (Z,δ )and
through the map discussed in Section 9.2, transfer this to a prior for F. The properties
discussed in Section 9.2 are used to study these priors.
In the last section, we look at Dirichlet process priors for interval censored data
and note that some of the properties analogous to the right censored case do not hold
here. Some of the material in this chapter is taken from [81] and [87].
9.2 Dirichlet Prior
Let αbe a finite measure on (0,). The model that we consider here is FDα;Given
F;X1,X
2,...,X
nare i.i.d F;GivenG;Y1,Y
2,...,Y
nare i.i.d Gand are independent
of the Xis; the observations are (Z1
1),(Z2
2),...,(Zn
n)whereZi=(XiYi)
and δi=I(XiYi).
Our interest is in the posterior distribution of Fgiven (Zi
i):1in. Under
the independence assumption the distribution of Gplays no role in the posterior
distribution of F.
The posterior can be represented in many ways. Susarla and Van Ryzin [154], who
first investigated, obtained a Bayes estimate for F and showed that this Bayes es-
timate converges to the Kaplan-Meier estimate as α(R+)0. Blum and Susarla
[26] complemented this result by showing that the posterior distribution is a mix-
ture of Dirichlet processes. This mixture representation, while natural, is somewhat
cumbersome.
9.2. DIRICHLET PRIOR 239
Lavine [118] observed that the posterior can be realized as a Polya tree process.
Under this representation computations are more transparent, and this is the repre-
sentation that we use in this chapter. A more elegant approach comes from viewing a
Dirichlet process as a neutral to right prior. This method is discussed in Chapter 10.
Since a Dirichlet process is also a Polya tree, we begin with a proposition that
indicates that a Polya tree prior can be easily updated in the presence of partial
information. The proof is straightforward and omitted.
Proposition 9.2.1. Let µbe a PT(T). Given P;X1,X
2,...,X
nare i.i.d P. The
posterior given IB1(X1),I
B2(X2),...,I
Bn(Xn)is again a Polya tree with respect to
Tand with parameters α
=α+#{i:BiB}.
Let Z=(Z1,Z
2,...,Z
n), where Z1<···<Z
n. Consider the sequence of nested
partitions {πm(Z)}m1given by:
π1(Z):B0=(0,Z
1],B
1=(Z1,)
π2(Z):B00,B
01,B
10 =(Z1,Z
2],B
11 =(Z2,)
and for l(n1),let
πl+1(Z):B0l0,B
0l1,...,B
1l,0=(Zl,Z
l+1],B
1l1=(Zl+1,)
where 1lis a string of 1s of length l, and 0lis a string of 0s of length l. The remaining
Bs are arbitrarily partitioned into two intervals such that {πm(Z)}m1forms a
sequence of nested partitions that generates B(R+).
Let α1,...,l=α(B1,...,l), and Cn
1,...,l=δi=0 I[(Zi,)B1,...,l]. Also, let
Ui=#(Zi
i):Zi>Z
(i)
i=1
be the number of uncensored observations strictly larger than Z(i).
Similarly denote by Cithe number of censored observations that are greater than
or equal to Z(i), i.e.
Ci=#(Zi
i):ZiZ(i)
i=0
where ni=Ci+Ui1is the number of subjects alive at time Z(i)and n+
i=Ci+Ui
is the number of subjects who survived beyond Z(i). To evaluate the posterior given
(z1
1),...,(zn
n), first look at the posterior given all the uncensored observations
among (z1
1),...,(zn
n) . Since the prior on M(X)—the space of all distributions
for X–is a Dα, the posterior on M(X) is Dirichlet with parameter α+(i:∆i=1) δZi.
240 9. SURVIVAL ANALYSIS—DIRICHLET PRIORS
Because a Dirichlet process is a Polya tree with respect to every partition, it is
so with respect to T(Z). Proposition 9.2.1 easily leads to the updated parameters
α
1,2,...,k.We summarize these observations in the following theorem.
Theorem 9.2.1. Let µ=Dα×δG0be the prior on M(R+)×M(R+). Then the pos-
terior distribution µ1(·|(z1
1),...,(zn
n)) is a Polya tree process with parameters
π(Z,δ)
nand α(Z,δ)
n={´α1,...,l}, where ´α1,...,l=α1,...,l+Ui]+Ci.
Remark 9.2.1.Note that if B1,...,l=(Zk,) then
α
1,...,l=α(B1,...,l) + number of individuals surviving at time Zk
and for every other B1,...,l,
α
1,...,l=α(B1,...,l) + number of uncensored observations in B1,...,l
The representation immediately allows us to find the Bayes estimate of the survival
function ¯
F=1F. Fix t>0 and let Z(k)t<Z
(k+1). Then, with Z(0) =0
¯
F(t)=k
1
¯
F(Z(i))
¯
F(Z(i1))¯
F(t)
¯
F(Z(k))(9.1)
A bit of reflection shows that Theorem 9.2.1 continues to hold if we change the parti-
tion to include t, i.e., partition B1kinto (Z(k),t] and (t, ) and then continue as before.
Thus the factors in (9.1) are independent beta variables and ˆ
¯
F(t)=E(¯
F(t)|(Zi
i):
1in) is seen to be
ˆ
¯
F(t)=k
1
α(Z(i),)+Ui+Ci
α(Z(i1),)+Ui1+Ciα(t, )+Ut+Ct
α(Z(k),)+Uk+Ct
(9.2)
Rewrite expression (9.2) as
k
1
α(Z(i),)+Ui+Ci
α(Z(i),)+Ui+Ci+1 α(t, )+Ut+Ct
α(0,)+n(9.3)
If the censored observations and the uncensored observations are distinct (as would
bethecaseifFand Ghave no common discontinuity), then at any Z(i)that is an
9.2. DIRICHLET PRIOR 241
uncensored value, Ci=Ci+1 and the corresponding factor in (9.3) is 1. Thus (9.3)
can be rewritten as
Z(i)t,δi=0
α(Z(i),)+Ui+Ci
α(Z(i),)+Ui+Ci+1
α(t, )+Ut+Ct
α(0,)+n(9.4)
This is the expression obtained by Susarla and Van Ryzin [154]. The expression is
a bit misleading because it appears that the estimate, unlike the Kaplan-Meier, is a
product over censored values. Keeping in mind that Ct=Ck+1,itiseasytoseethatif
tis a censored value, then the expression is left-continuous at t,andbeingasurvival
function it is hence continuous at t. Similarly, it can be seen that the expression has
jumps at uncensored observations. Thus the expression can be rewritten as a product
over censored observations times a continuous function. This form appears in the
Chapter 10.
As α(0,)0, (9.1) goes to
k
1
Ui+Ci
Ui1+CiUt+Ct
Uk+Ck
(9.5)
If Z(i)is uncensored then Ui+Ci=N+
iand Ui1+Ci=Ni.IfZ(i)is censored then
Ui+Ci=Ui1+Ciand we get the usual Kaplan-Meier estimate.
We next turn to consistency.
Theorem 9.2.2. Let F0and Ghave the same support and no common point of
discontinuity. Then for any t>0,
(i) E(¯
F(t)|(Zi
i):1in)¯
F0(t)a.e. P
F0×G; and
(ii) V(¯
F(t)|(Zi
i):1in)0a.e. P
F0×G.
Hence the posterior of Fis consistent (F0.
Proof. Because F0and Ghave the same support and no common point of discon-
tinuity, the censored and uncensored observations are distinct. Note that if a, b, c
0,a+b/a +cb/c. Using this fact, it is easy to see that (9.1) is larger than (9.5),
and hence
lim
n→∞ E(¯
F(t)|(Zi
i):1in)¯
F0(t) a.e. P
F0×G
242 9. SURVIVAL ANALYSIS—DIRICHLET PRIORS
On the other hand, writing (9.4) as An(t)Bn(t)where
An(t)=α(t, )+Ut+Ct
α(0,)+n,B
n(t)=
Z(i)t,δi=0
α(Z(i),)+Ui+Ci
α(Z(i),)+Ui+Ci+1
it is easy to see that An(t)¯
F0(t)¯
G0(t)and
(Bn(t))1
Z(i)t,δi=0
Ui+Ci
Ui+Ci+1
The right side of the last expression is the Kaplan-Meier estimate of ¯
G,andso
lim
n→∞ (Bn(t))1¯
G(t)
and
lim
n→∞ Bn(t)¯
G(t)1
so that
lim
n→∞ An(t)Bn(t)¯
F0(t)
Since the factors in (9.1) are beta variables, it is easy to write E(¯
F2(t)|(Zi
i):
1in). A bit of tedious calculation will show that
E(¯
F2(t)|(Zi
i):1in)¯
F2
0(t)
We leave the details to the reader.
9.3 Cumulative Hazard Function, Identifiability
Let Fbe a distribution function on (0,). So the survival function ¯
F=1Fis de-
creasing, right-continuous and limt0¯
F(t)=1,limt→∞ ¯
F(t) = 0. We will often write
F(A),¯
F(A) for the probability of a set Aunder the probability measure correspond-
ing to the distribution function F.ThusF{t}=¯
F{t}=F(t)F(t)= ¯
F(t)¯
F(t)
is the probability of {t}.
A concept of importance in survival analysis is failure rate and the related cumu-
lative hazard function. For the distribution function Fof a discrete probability, a
9.3. CUMULATIVE HAZARD FUNCTION, IDENTIFIABILITY 243
natural expression for the hazard rate at sis F{s}/¯
F(s). Summing this over st
gives a notion of cumulative hazard function for a discrete Fat tas
H(F)(t)=
s(t)
F{s}
¯
F(s)=(·)
0
dF (s)
¯
F(s)
Extending this notion, cumulative hazard function for a general Fis defined by
H(F)(·)=(·)
0
dF (s)
¯
F(s)
More precisely, let F∈Fand let TF= inf{t:F(t)=1}.NotethatTFmay be .
Set
H(F)=HF(t)=(0,t]
dF (s)
F[s,),for tTF
HF(TF)fort>T
F
1. Let {s1,s
2,...}be a dense subset of (0,). For each n,let s(n)
1<···<s
(n)
n
be an ordering of {s1,...,s
n}.Lets(n)
0= 0 and define
Hn
F(t)=
s(n)
i<t
F(s(n)
i,s(n)
i+1]
F(s(n)
i,)for tTF
Hn
F(TF)fort>T
F
Then, for all t,Hn
F(t)HF(t)asn→∞.
2. HFis nondecreasing and right-continuous. The fact that HFis nondecreasing
follows trivially because Fis nondecreasing. To see that HFis right-continuous,
fix a point tand note that if j=max{in:s(n)
i<t}, then
HF(t+) HF(t) = lim
n→∞
F(s(n)
j+1,s
(n)
j+2]
F(s(n)
j+1,)
where both {s(n)
j+1}and {s(n)
j+2}are nondecreasing sequences converging to tfrom
above. Thus F(s(n)
j+1,s
(n)
j+2]0asn→∞.
If t<T
F, then the denominator of the right hand side of the equation is
positive for some n, hence right-continuity follows. For tTFit follows from
the definition.
244 9. SURVIVAL ANALYSIS—DIRICHLET PRIORS
It is easy to see that HF(t)<for every t<T
F.AswithF, we will think of HF
simultaneously as a function and a measure. Thus the measure of any interval
(s, t] under HFwill be defined as HF(s, t]=HF(t)HF(s). For TF<s<t,
define HF(s, t]=0.
3. For any t,HFhas a jump at tiff Fhas a jump at t, i.e. {t:HF{t}>0}={t:
F{t}>0}.
4. It follows from preceding that
(a) TF= inf{t:HF(t)=or HF{t}=1},
(b) HF{t}≤1 for all t,
(c) HF(TF)=if TFis a continuity point of F,and
(d) HF{TF}=1ifF{TF}>0.
These and other properties of Hand details can be found in Gill and Johansen
[90].
Let Abe the space of all functions on [0,) that are nondecreasing, right-continuous,
and may, at any finite point, be infinity, but has jumps no greater than one, i.e.,
A={B∈H|B{t}≤1 for all t}
Equip Awith the smallest σ-algebra under which, the maps {A→ A(t),t0}are
measurable. Hmaps Finto Aand His measurable. The actual range of H, which
we will now describe, is smaller.
For A∈A
,letTA= inf{t:A(t)=or A{t}=1}.LetAbe the space of all
cumulative hazard functions on [0,). Formally define Aas
A={A∈A
|A(t)=A(TA) for all tTA}
Endow Awith the σ-algebra which is the restriction of the σ-algebra on Ato A.
The map His a 1-1 measurable map from Fonto Aand, in fact, has an inverse [see
Gill and Johansen [90]]. We consider this inverse map next and briefly summarize its
properties.
Let A∈A
.Let{s1,s
2,...}be dense in (0,). For each n,let s(n)
1<···<s
(n)
n
be as before. Fix s<t.IfA(t)<, define the product integral of Aby
(s,t]
(1 dA) = lim
n→∞
s<s(n)
it
(1 A(s(n)
i1,s
(n)
i])
9.3. CUMULATIVE HAZARD FUNCTION, IDENTIFIABILITY 245
where A(a, b]=A(b)A(a)fora<b.IfA(t)=and A(s)<, set (s,t](1dA)=
0. If A(s)=,set(s,t](1 dA)=1.
Theorem 9.3.1. Let A∈A. Then Fgiven by
F(t)=1
(0,t]
(1 dA)
is an element of F. Further,
A(t)=(0,t]
dF (s)
F[s, )
The following properties of the product integral are included to lend the reader a
better understanding of the nature of the map Hand will be useful later. For details,
we again refer to Gill and Johansen [90].
5. Like H, the product integral also has an explicit expression:
(0,t]
(1 dA)=exp(Ac(t))
st
(1 A{s})
where Acis the continuous part of A.
6. Let ρSdenote the Skorokhod metric on D[0,) and let {Hn}be a sequence in
A.SaythatρS(Hn,A)0forsomeA∈Aas n→∞,ifρS(HT
n,A
T)0for
all T>0whereHT
nand ATare restrictions of Hnand Ato [0,T]. It may be
shown, following Hjort [([100], Lemma A.2, pp. 1290–91), that if {Hn},A ∈A
and ρS(Hn,A)0, then H1(Hn)w
H1(A). Thus, if Ais endowed with the
Skorokhod metric, then H1is a continuous map.
Let Fbe a distribution function. In the literature
A(F)=log ¯
F
is also used to formalize the notion of “cumulative hazard function of F.” Aarises
by defining the hazard rate at sfor a continuous random variable as
r(s) = lim
s0
1
s
P(sX<s+∆s)
P(Xs
246 9. SURVIVAL ANALYSIS—DIRICHLET PRIORS
If Xhas a distribution Fwith density fthen r(s)=f(s)/¯
F(s) and if the cumulative
hazard function is defined as (.)
0r(s)ds then this gives A(F)=log ¯
F(·). One
extends the definition for a discrete Fformally to give A.
It is easy to see that the two definitions coincide when Fis continuous. However, in
estimating a survival function or a cumulative hazard function one typically employs a
discontinuous estimate. Further, priors like the Dirichlet sit on discrete distributions.
The nature of the map, therefore, plays an important role in inference about lifetime
distributions and hazard rates. For us, the cumulative hazard function of a distribution
will be H(F).
We next turn to identifiability of (F, G)by(Z, δ). As before, let Xand Ybe
independent with XF, Y G.LetT(x, y)=(z, δ)=(xy),I(xy)) and denote
by T(F, G) the distribution of Twhen XF, Y G.
T(F, G) is thus a probability measure on T=(0,)×{0,1}. Any probability
measure Pon Tgives rise to two subsurvival functions,
S0(t)=P((t, )×{0})
and
S1(t)=P((t, )×{1})
These satisfy
S0(0+) + S1(0+) = 1,S
i(t) decreasing in tlim
t→∞ Si(t) = 0 (9.6)
Conversely, any pair of subsurvival functions satisfying (9.6) correspond to a prob-
ability on T. The following proposition, due to Peterson [138], shows that under mild
assumptions Fand Gcan be recovered from T(F, G).
Proposition 9.3.1. Assume that Fand Ghave no common points of discontinuity.
Let T(F, G)=(S0,S1). Then for any tsuch that Si(t)>0,i =0,1;
1.
HF(t)=(0,t]
dS1(s)
S0(s)+S1(s)(9.7)
2.
¯
F(t)=et
0
dS1
c(s)
S0(s)+S1(s)
st,S1{s}>01S1{s}
S0(s)+S1(s)(9.8)
9.4. PRIORS VIA DISTRIBUTIONS OF (Z, δ) 247
3.
sup
t|Fn(t)F(t)|+|Gn(t)G(t)|→0iff
sup
t|S0
n(t)S0(t)|+|S0
n(t)S0(t)|0 (9.9)
A similar expression holds for ¯
G. Thus, if we assume that Fand Ghave no com-
mon points of discontinuity and have the same support, then both Fand Gcan be
recovered from T(F, G).
9.4 Priors via Distributions of (Z, δ)
It might be argued that in the censoring context, subjective judgments such as ex-
changeability are to be made on the observables (Z, ∆) and would hence lead to
priors for the distribution of (Z, ∆). The model of independent censoring can be used
to transfer this prior to the distribution of the lifetime X.
Formally, let M0M(X)×M(Y) be the class of all pairs of distribution functions
(F, G) such that
1. Fand Ghave no points of discontinuity in common, and
2. for all t0,F(t)<1andG(t)<1.
Denote by Tthe function T(x, y)=(xy, Ixy)andbyTthe function on M(§×Y)
defined by T(P, Q)=(P, Q)T1, i.e., T(P, Q) is the distribution of Tunder (P, Q).
Let M0=T(M0). From the last section we know that on M0,Tis 1-1. Note that
every prior on M0gives rise to a prior on M0via Tand every prior on M0induces
a prior on M0through (T)1.
Theorem 9.4.1. Let Πbe a prior on M0and Π=µφ1be the induced prior
on M0.
(i) If Π(·|(Zi
i):1in)on M0is weakly consistent at T(P0,Q
0), and
(P0,Q
0)is continuous then the posterior Π(·|(Zi
i):1in)on M0is
weakly consistent at (P0,Q
0).
(ii) If Π(U|(Zi
i):1in)1for Uof the form
U={(S0,S1):sup
t
[|S0(t)S0
0(t)|+|S1(t)S1
0(t)|<]}
248 9. SURVIVAL ANALYSIS—DIRICHLET PRIORS
(here (S0
0,S1
0)are the subsurvival functions corresponding to T(P0,Q
0)), then
the posterior Π(·|(Zi
i):1in)on M0is weakly consistent at P0.
Proof. (i) immediately follows from the fact that for continuous distributions the
neighborhoods arising from supremum metric and weak neighborhoods coincide (see
Proposition 2.5.3). The second assertion follows from the continuity property de-
scribed in Proposition 9.3.1 and by noting that Π(.|(Zi
i):1in)onM0is just
the distribution of (T)1under Π(.|(Zi
i):1in).
We have so far not demonstrated any prior on M0. We next argue that it is in
fact possible to obtain a Dirichlet prior on M(T) that gives mass 1 to M0.
Theorem 9.4.2. Let αbe probability measure on T=(0,)×{0,1}and let
(S0
α,S1
α)be the corresponding subsurvival functions. Assume
(a) S0
αand S1
αhave the same support and have no common points of discontinuity;
and
(b) if for i=0,1,Hi(t)=(0,t]dSi
α(s)/((S0
α(s)+S1
α(s))) satisfies
lim
t→∞ Hi(t)=for i=0,1
then for any c>0,D(M0)=1.
Proof. We will work with pairs of random subsurvival functions than with random
probabilities on T. We will show that with Dprobability 1,
(a) S0and S1have the same support and have no common points of discontinuity;
and
(b) for i=0,1,(0,)dSi(s)/(S0(s)+S1(s)) =
That (a) holds with probability 1 is immediate from assumption (a). For (b), let
t1,t
2,..., continuity points of S0
α,besuchthat
i
S1
α(ti1,t
i]
(S0
α(ti1)+S1
α(ti1)) =
Such tis can be chosen by first choosing siwith H1(si)↑∞and then choosing tisin
(si,s
i+1]with
tj(si,si+1]
S1
α(ti1,t
i]
(S0
α(ti1)+S1
α(ti1)) H1(si)H1(si1)+2
i
9.5. INTERVAL CENSORED DATA 249
Let Yi=S1(ti1,t
i]/((S0(ti1)+S1(ti1))),clearly iYidSi(s)/(S0(s)+
S1(s)). Further, the Yi’s are bounded by 1 and under Dirichlet, are independent.
Note that (S0
α(ti1)+S1
α(ti1)) and Yiare independent and hence
E(Yi)= S1
α(ti1,t
i]
(S0
α(ti1)+S1
α(ti1))
Assumption (b) guarantees E(Yi)=. This in turn gives E(Yi)=[See
Loeve, [132] p 248)].
In addition to consistency, if the empirical distribution of (Z, ∆) is a limit of Bayes
estimate on M0, then so is the Kaplan-Meier estimate. This method of constructing
priors on M0is appealing and merits further investigation—for instance the Dirichlet
process on M0arises through a Polya urn scheme, and it would be of interest to see
the corresponding process for the induced prior.
9.5 Interval Censored Data
Susarla and Van Ryzin showed that the Kaplan-Meier estimate, which is also the non-
parametric MLE, is the limit of Bayes estimates with a Dαprior for the distribution
of X. The observations in this section show that this result does not carry over to
other kinds of censored data.
Here our observation consists of npairs (Li,R
i]; 1 inwhere LiRiand
corresponds to the information X(Li,R
i].We assume that (Li,R
i]; 1 in
are independent and that the underlying censoring mechanism is independent of the
lifetime Xso that the posterior distribution depends only on (Li,R
i]; 1 in.
Let t1<t
2< ...,t
k+1 denote the endpoints of (Li,R
i]; 1 inarranged in
increasing order and let Ij=(tj,t
j+1]. For simplicity we assume that t1= min
iLiand
tk+1 =max
iRi.
Our starting point is a Dirichlet prior D(1,cα
2,...,cα
k)for(p1,p
2,...,p
k)where
pj=P{XIj}. Turnbull [159] suggested the use of the nonparametric maximum
likelihood estimate obtained from the likelihood function
n
i=1
Ij(Li,Ri]
pj
If (p1,p
2,...,p
k)hasaD(1,cα
2,...,cα
k) prior then the posterior distribution of
(p1,p
2,...,p
k) given (Li,R
i]; 1 inis a mixture of Dirichlet distributions.
250 9. SURVIVAL ANALYSIS—DIRICHLET PRIORS
Call a vector a=(a1,a
2,...,a
n), where each ai,isaninteger,animputation of
(Li,R
i]; 1 inif Iai(Li,R
i]. For an imputation a, let nj(a) be the number of
observations assigned to the interval Ij. Formally nj(a)=#{i:ai=j}.
Let the order O(a) of an imputation be #{j:nj(a)>0}.LetAbe the set of all
imputations of (Li,R
i]; 1 inand let m= minaAO(a). Call an imputation a
minimal if O(a)=m.
It is not hard to see that the posterior distribution of (p1,p
2,...,p
k) given (Li,R
i]; 1
inis
aA
CaD(1+n1(a),cα
2+n2(a),...,cα
k+nk(a))
where
Ca=k
1Γ(j+nj(a))
aAk
1Γ(j+nj(a))
The Bayes estimate of any pjis
ˆpj=
aA
Ca
j+nj(a)
c+n
As c0,(j+nj(a))/(c+n)nj(a)/n. The behavior of Cais given by the next
proposition.
Proposition 9.5.1. limc0Ca>0iff ais a minimal imputation.
Proof. Suppose ais not minimal. Let a0be an imputation with O(a)>O(a0):
Cak
1Γ(j+nj(a))
k
1Γ(j+nj(a0)) =k
j=1 Γ(j)
k
j=1 Γ(j)j:nj(a)=0) nj(a)
0(j+i)
j:nj(a0)=0) nj(a0)
0(j+i)
Since O(a)>O(a0) the ratio goes to 0. Conversely, if ais minimal it is easy to see
that
1
Ca
=
aAk
1Γ(j+nj(a))
k
1Γ(j+nj(a))
converges to a positive limit.
9.5. INTERVAL CENSORED DATA 251
Thus the limiting behavior is determined by minimal imputations. A few examples
clarify these notions.
Example 9.4.1. Consider the right censoring case, i.e., for each ieither Li=Ri
or Ri=tk. Any minimal imputation is given by assigning compatible observations
to the singletons corresponding to uncensored observations and Ikif the last(largest)
observation is censored.
Example 9.4.2. Consider the case when we have current status or case I interval
censored data. Here for each i, either Li=t1or Ri=tk+1 sothatallweknowisif
Xiis to the right of Lior to the left of Ri.
(i) If maxiLi<miniRithe minimal imputation is allocation of all the observations
to the interval (maxiLi,miniRi].
(ii) In general, the minimal imputations have order 2. For example, a consistent
assignment of the data to (t1,miniRi],(maxiLi,t
k+1] would yield a minimal
imputation.
A couple of simple numerical examples help clarify the different cases. In the follow-
ing examples the prior of the distribution is D,whereαis a probability measure.
The limit is taken as c0. Corresponding to any imputation a, we will call the
intervals Ijs for which nj(a)>0,an allocation, and an allocation corresponding to a
minimal imputation will be called a minimal allocation.
Example (a): This example illustrates that the limit of Bayes estimates could be
supported on a much bigger set than the NPMLE. The observed data consist of the
four intervals (1, ), (2, ), (0, 3], (4, ).
The limit of Bayes estimates in this case turns out to be;
˜
F(0,1] = 1/22,
˜
F(1,2] = 2/22,
˜
F(2,3] = 6/22, and
˜
F(4,]=13/22,
while the NPMLE is given by,
ˆ
F(2,3] = 1/2and
ˆ
F(4,]=1/2.
In the example, each minimal allocation consists of only two subntervals.
(i) (0,1], and (4, ), with the corresponding numbers of Xis in the subintervals being
1 and 3, respectively, represents a minimal allocation.
(ii) (2, 3] and (4, ) with the corresponding numbers of Xis in the subintervals being
252 9. SURVIVAL ANALYSIS—DIRICHLET PRIORS
1 and 3, respectively, represents another minimal allocation.
(iii) (2, 3] and (4, ) with the corresponding numbers of Xis in the subintervals being
2 and 2, respectively, represents yet another minimal allocation.
Example (b): This example shows that the limit of Bayes estimates could be sup-
ported on a smaller set than the NPMLE. The observed data consist of the intervals
(0, 1], (2, ), (0, 3], (0, 4], and (5, ).
The limit of Bayes estimates in this case turns out to be:
˜
F(0,1] = 3/5, and
˜
F(5,)=2/5.
while the NPMLE is given by:
ˆ
F(0,1] = 1/2,
ˆ
F(2,3] = 1/6, and
ˆ
F(5,)=1/3.
As c0, while Dirichlet priors lead to strange estimates for the current status data,
the case c= 1 seems to present no problems. Even when c0 we expect that the
limiting behavior will be more reasonable when the data are case II interval censored,
in the sense described in [91]. In this case, the tendency to push the observation to
the extremes would be less pronounced.
In the current status data case the limit (as c0) of the posterior itself exhibits
degeneracy. The following proposition is easy to establish.
Proposition 9.5.2. Let R= inf
i:Li=0 Riand L=sup
i:Ri=tk+1
Li.
(i) If R<L
then as c0the posterior distribution of P(R,L
)converges to the
measure degenerate at 0
(ii) If L<R
then as c0the posterior distribution of P(L,R
)converges to the
measure degenerate at 1
10
Neutral to the Right Priors
10.1 Introduction
In Chapter 3, among other aspects, we looked at two properties of Dirichlet processes-
the tail free property and the neutral to the right property. In this chapter we discuss
priors that generalize Dirichlet processes via the neutral to the right property.
Neutral to the right priors are a class of nonparametric priors that were introduced
by Doksum [48]. Historically, the concept of neutrality is due to Connor and Mosimann
[34] who considered it in the multinomial context. Doksum extended it to distributions
on the real line in the form of neutral to the right priors and showed that if Π is neutral
to the right, then the posterior given nobservations is also neutral to the right. This
result was extended to the case of right-censored data by Ferguson and Phadia [64].
These topics are discussed in Section 10.2.
Doksum and Hjort showed that a prior is neutral to the right iff the cumulative
hazard function has independent increments. Since independent increment processes
are well understood, this connection provides a powerful tool for studying neutral
to the right priors. In particular, independent increment processes have a canonical
structure, the so-called L´evy representation. The associated L´evy measure can be
used to elucidate properties of neutral to the right priors. For instance Hjort provides
an explicit expression for the posterior given nindependent observations in terms of
254 10. NEUTRAL TO THE RIGHT PRIORS
the L´evy representation when the L´evy measure is of a specific form. In Section 10.3
we summarize these results.
In Section 10.4 we discuss beta processes. Hjort [100] and Walker and Muliere [166],
respectively, developed beta processes and beta-Stacy processes, which provide con-
crete and useful classes of neutral to the right priors. These priors are analogous to
the beta prior for the Bernoulli (θ), are analytically tractable, and are flexible enough
to incorporate a wide variety of prior beliefs.
The rest of the chapter is devoted to consistency results for neutral to the right
priors. These results center around an example of Kim and Lee [114] of a neutral to
the right prior that is inconsistent at all continuous distributions.
10.2 Neutral to the Right Priors
For any F∈F,asintheChapter9 ¯
F(·)=1F(·) is the survival function corre-
sponding to F.Let ¯
F(0) = 1. We also continue to denote by F(A) the measure of
the set Aunder the probability measure corresponding to F.
Definition 10.2.1. ApriorΠonFis said to be neutral to the right if, under Π,
for all k1andall0<t
1<...< t
k,
¯
F(t1),¯
F(t2)
¯
F(t1),..., ¯
F(tk)
¯
F(tk1)
are independent.
If Π is neutral to the right, we will also refer to a random distribution function
Fwith distribution Π as being neutral to right. Note that (0/0) is defined here and
throughout to be 1.
For a fixe d F,ifXis a random variable distributed as F,thenforevery0s<t,
¯
F(t)/¯
F(s) is simply the conditional probability F(X>t|X>s). For t>0, ¯
F(t)is
viewed as the conditional probability F(X>t|X>0).
Example 10.2.1. Consider a finite ordered set {t1,...,t
n}of points in (0,). To
construct a neutral to right prior on the set Ft1,...,tnof distribution functions supported
by the points t1,...,t
n, we only need to specify (n1) independently distributed
[0,1]-valued random variables V1,...,V
n1, and then set ¯
F(ti)/¯
F(ti1)=1Vifor
1in1. Finally, set ¯
F(tn)/¯
F(tn1) = 0. Observe that ¯
F(tn) = 0 and, for
10.2. NEUTRAL TO THE RIGHT PRIORS 255
1in1,
¯
F(ti)=
i
j=1
(1 Vj)
Example 10.2.2. In a similar fashion we can construct a neutral to right prior on
the space FTof all distribution functions supported by a countable subset T={t1<
t2<...}of (0,).
Let {Vi}i1be a sequence of independent [0,1]-valued random variables such that,
for some η>0,
i1
P(Vi)=
This happens, for instance, when Vis are identically distributed with P(Vi)>0.
As before, for i1, set ¯
F(ti)/¯
F(ti1)=1Vi. In other words, ¯
F(tk)=k
i=1(1 Vi),
for all k1. By the Borel-Cantelli lemma, we have
P
i1
(1 Vi)=0
=1
This defines a neutral to right prior Π on Fbecause
lim
t→∞
¯
F(t) = lim
k→∞
k
i=1
(1 Vi)=0,a.s. Π
Dirichlet process priors of course provide a ready example of a family of neutral to
the right priors. Other examples are the beta process and beta-Stacy process , to be
discussed later.
As before, we consider the standard Bayesian set-up where Π is a prior and given F,
X1,X
2,... be i.i.d. F.Foreachn1, denote by ΠX1,...,Xna version of the posterior
distribution, i.e. the conditional distribution of Fgiven X1,...,X
n.
Following are some notations:
For n1, define the observation process Nn(.) as follows:
Nn(t)=
in
I(0,t](Xi) for all t>0
For every n1, let Nn(0) 0. Observe that Nn(.) is right-continuous on [0,). Let
Gt1...tk=σ¯
F(t1),¯
F(t2)
¯
F(t1),..., ¯
F(tk)
¯
F(tk1).
256 10. NEUTRAL TO THE RIGHT PRIORS
Thus Gt1...tkdenotes the collection of all sets of the form
D=(¯
F(t1),¯
F(t2)
¯
F(t1),..., ¯
F(tk)
¯
F(tk1))C
where C∈B
k
[0,1].
Theorem 10.2.1 (Doksum). Let Πbe neutral to the right. Then ΠX1,...,Xnis also
neutral to the right.
Proof. Fix k1 and let t1<t
2<···<t
kbe arbitrary points in (0,). Denote by
Qthe set of all rationals in (0,) and let Q=Q∪{t1,...,t
k}.Let{s1,s
2,...}be
an enumeration of Q. Observe that, for large enough m,{t1,...,t
k}⊂{s1,...,s
m}.
For su ch an m, let s(m)
1<···<s
(m)
mbe an ordering of {s1,...,s
m}.LetY(m)
i=
¯
F(s(m)
i)/¯
F(s(m)
i1) and, under Π, let Π(m)
idenote the distribution of Y(m)
i.
Let n1···nm. Then, given {Nn(s(m)
1)=n1,...,N
n(s(m)
m)=nm}, the posterior
density of (Y(m)
1,...,Y(m)
m) is written as
fY(m)
1,...,Y (m)
m(y1,...,y
m)= m
i=1(1 yi)nini1ynni
i
m
i=1(1 yi)nini1ynni
idΠ(m)
i(yi)
=
m
i=1
(1 yi)nini1ynni
i
(1 yi)nini1ynni
idΠ(m)
i(yi)
This shows that (Y(m)
1,...,Y(m)
m) are independent under the posterior given
{Nn(s(m)
1),...,N
n(s(m)
m)}. Hence,
¯
F(ti)
¯
F(ti1)=
ti1<s(m)
jti
¯
F(s(m)
j)
¯
F(s(m)
j1),i=1,...,k
are also independent under the posterior given the same information.
Now, by the right-continuity of Nn(·)wehave,asn→∞,
σ{Nn(sj),j m}↑σ{Nn(t),t0}≡σ(X1,...,X
n)
Hence, for any A∈G
t1...tk, by the martingale Convergence theorem, we have
Π(A|Nn(s(m)
1),...,N
n(s(m)
m)) Π(A|X1,...,X
n) almost surely
Since for each m, the random quantities ¯
F(t1),¯
F(t2)/¯
F(t1)..., ¯
F(tk)/¯
F(k1)are
independent given σ(Nn(s(m)
1),...,N
n(s(m)
m)), independence also holds as m→∞.
10.2. NEUTRAL TO THE RIGHT PRIORS 257
A perusal of the proof given above suggests that for any t1<t
2the posterior
distribution of ¯
F(t2)/¯
F(t1) depends on {Nn(s):t1st2}.Inwords,thepos-
terior depends on the number of observations less than t1, the exact observations
between t1and t2and the number of observations greater than t2. This was observed
by Doksum. The following theorem proved in [42] shows that this property essen-
tially characterizes neutral to the right priors. Walker and Muliere [167] have also
obtained characterizations of neutral to the right priors. Their results are presented
in a different flavor.
Theorem 10.2.2. Let Πbe a prior on Fsuch that Π{0<F(t)<1,for all t}=1.
Then the following are equivalent:
(i) Πis neutral to the right
(ii) for every t
L¯
F(t)|Π(X1,X
2,...,X
n)) = L¯
F(t)|Nn(s):0<s<t
where L(.)stands for the Law of (.).
Thus, if one wants to estimate the probability that a subject survives beyond t
years based on nsamples of which n1fell below t, then a neutral to the right prior
would lead to the same estimate if the remaining nn1observations fell just above
tor far beyond it. This is a property that is also shared by the empirical distribution
function. This suggests that neutral to the right priors are appropriate when the
interest is in all of Fand inappropriate if the interest is in a local neighborhood of a
fixed time point.
Ferguson and Phadia [64] extend Doksum’s result in the case of inclusively and
exclusively right censored observations. Let xbe a real number in (0,). Given a
distribution function F∈F,anobservationXfrom Fis said to be exclusively right
censored if we only know Xxand inclusively right-censored if we know X>x.
We state their result next. The proof is straightforward.
Theorem 10.2.3 (Ferguson and Phadia). Let Fbe a random distribution func-
tion neutral to the right. Let Xbe a sample of size one from F, and let xbe a number
in (0,). Then
(a) the posterior distribution of Fgiven X>xis neutral to the right, and
(b) the posterior distribution of Fgiven Xxis neutral to the right.
258 10. NEUTRAL TO THE RIGHT PRIORS
10.3 Independent Increment Processes
As mentioned in the introduction, neutral to the right priors relate to independent
increment process via the cumulative hazard function. To recall from Chapter 9, the
cumulative hazard function is given by
H(F)(t)=HF(t)=(0,t]
dF (s)
F[s,)for tTF
HF(TF)fort>T
F
and discussed its properties.
The next result establishes the connection between neutral to the right priors and
independent increment processes with nondecreasing paths via the map H.
Theorem 10.3.1. Let Πbe a neutral to the right prior on F. Then, under the
measure Πon Ainduced by the map H,{A(t):t>0}has independent increments.
Conversely, if Πis a probability measure on Asuch that the process {A(t):t>0}
has independent increments, then the measure induced on Fby the map
H1:A→ 1
(0,t]
(1 dA)
is neutral to the right.
Proof. First suppose that Π is neutral to the right on Fand let t1<··· <t
kbe
arbitrary points in (0,). Consider, as before, a dense set {s1,s
2,...}in (0,). Let,
for each n,s(n)
1<···<s
(n)
nbe as before.
Suppose nis large enough that s(n)
ntk. Then, for each 1 ik,wehavewith
An
Fas
An
F(ti)An
F(ti1)=
ti1<s(n)
jti
F(s(n)
j1,s
(n)
j]
F(s(n)
j1,)
=
ti1<s(n)
jti1
¯
F(s(n)
j)
¯
F(s(n)
j1)
Because for each n,¯
F(s(n)
1), ¯
F(s(n)
2)/¯
F(s(n)
1), ..., ¯
F(s(n)
n)/¯
F(s(n)
n1) are independent,
An
F(t1), An
F(t2)An
F(t1), ..., An
F(tk)An
F(tk1) are also independent. Letting n→∞,
we get that AF(t1), AF(t2)AF(t1), ..., AF(tk)AF(tk1) are independent.
10.3. INDEPENDENT INCREMENT PROCESSES 259
For the converse, suppose Πon Asuch that, under Π,{A(t):t>0}is an
independent increment process. Again, let t1<···<t
kbe arbitrary points in (0,).
Then with s(n)
1<···<s
(n)
nas before, let, for 1 ik,
¯
Fn
A(ti)=
s(n)
jti
(1 A(s(n)
j1,s
(n)
j])
If FA=H1(A), then it follows from the definition of the product integral that
¯
F(n)
A(t)¯
FA(t) for all t,asn→∞.Now,observethat,for1ik,
¯
Fn
A(ti)
¯
Fn
A(ti1)=
ti1<s(n)
jti
(1 A(s(n)
j1,s
(n)
j])
Since A(s(n)
j1,s
(n)
j],1jnare independent for each nso are ¯
Fn
A(ti)/¯
Fn
A(ti1),1
ik. Consequently, we have independence in the limit, i.e., ¯
FA(t1), ¯
FA(t2)/¯
FA(t1),
..., ¯
FA(tk)/¯
FA(tk1) are independent.
It is not hard to verify that for a neutral to the right prior Π,
EΠH(F)=H(EΠF)
Since the posterior given X1,X
2,...,X
nis again neutral to the right, the above
property continues to hold for ΠX1,X2,...,Xn. It is shown in Dey etal. [43] that in the
time discrete case the above property characterizes neutral to the right property. We
expect a similar result to hold in general.
Doksum was the first to observe a connection between neutral to the right priors
and independent increment processes. He, however, considered cumulative hazard
function defined by D(F)(t)=DF(t)=log ¯
F(t). The proof of Theorem 10.3.2 is
straightforward.
Theorem 10.3.2 (Doksum). A prior Πon Fis neutral to the right if and only
if ˜
Π=ΠD1is an independent increment process measure such that ˜
Π{H∈H:
limt→∞ H(t)=∞} =1.
The theory of neutral to the right priors owes much of its development and analytic
elegance to its connection with independent increment processes. The principal ex-
amples of general families of neutral to the right priors have been constructed via this
connection. Next, we briefly discuss the relevant theory of these processes in terms of
a representation due to P. L´evy. Following is a brief description of the representation.
260 10. NEUTRAL TO THE RIGHT PRIORS
The following facts are wellknown and can be found in , for example, Ito [104] and
Kallenberg [110].
Definition 10.3.1. A stochastic process {A(t)}t0is said to be an independent
increment process if A(0) = 0 almost surely and if, for every kand every {t0<t
1<
···<t
k}⊂[0,), the family {A(ti)A(ti1)}k
i=1 is independent.
Let Hbe a space of functions defined by
H={H|H:[0,)→ [0,],H(0) = 0,H non-decreasing, right-continuous}
(10.1)
Let B(0,)×[0,]be the Borel σ-algebra on (0,)×[0,].
Theorem 10.3.3. Let Πbe a probability on H. Under Π,{A(t):t>0}is an
independent increment process if and only if the following three conditions hold. There
exists
1 a finite or countable set M={t1,t
2,...}of points in (0,)and, for each tiM,
a positive random variable Yidefined on Hwith density fi;
2 a nonrandom continuous nondecreasing function b; and
3 a measure λon (0,)×[0,],B(0,)×[0,]that for all t>0, satisfies
(a) λ({t[0,]) = 0 and
(b) 
0<st
0u≤∞
u
1+uλ(ds du)<,
such that
A(t)=b(t)+
tit
Yi(A)+ 
0<st
0u≤∞
(ds du, A) (10.2)
where, for each A∈H,µ(·,A)is a measure on (0,)×[0,],B(0,)×[0,]such
that, under Π,µ(·,·)is a Poisson process with parameter λ(·), i.e., for arbitrary dis-
joint Borel subsets E1,...,E
kof (0,)×[0,],µ(E1,·),...,µ(Ek,·)are independent,
and
µ(Ei,·)P oisson(λ(Ei)) for 1ik
Note the following facts about independent increment processes, which will be
useful later and facilitate understanding of the remaining subject matter.
10.3. INDEPENDENT INCREMENT PROCESSES 261
(1) The measure λon (0,)×[0,] is often expressed as a family of measures
{λt:t>0}where λt(A)=λ((0,t]×A) for Borel sets A.
(2) The representation may be expressed equivalently in terms of the moment-generating
function of A(t)as
E(eθA(t))=eb(t)
titE(eθYi)exp

0<st
0u≤∞
(1 eθu)λ(ds du)
(3) The random variables Yioccurring in the decomposition arise from the jumps
of the process at fixed points. Say that tis a fixed jump-point of the process if
Π(A{t}>0) >0. It is known that there are at most countably many such fixed
jump-points, and the set Mis precisely the set of such points and that Yi=A{ti}.
(4) The random measure A→ µ(·,A) also has an explicit description. For any Borel
subset Eof (0,)×[0,],
µ(E,A)=#{(t, A{t})E:A{t}>0}
(5) Let Ac(t)=A(t)b(t)titA{ti}. Then
Ac(t)= 
0<st
0u≤∞
(du ds, A)
(6) The countable set M, the set of densities {fi:i1}, the measure λ, and the
nonrandom function bare known as the four components of the process {A(t):
t>0}, or, equivalently, of the measure Π. The measure λis known as the L´evy
measure of Π.
(7) A L´evy process Πwithout any non-random component, i.e., for which b(t)=0,
for all t>0, has sample paths that increase only in jumps almost surely Π.Most
of the L´evy processes that we encounter here will be of this type.
262 10. NEUTRAL TO THE RIGHT PRIORS
10.4 Basic Properties
Let Π be a neutral to the right prior on F. From what we have seen so far, the maps D
and Hyield independent increment process measures ˜
Π and Π, respectively. Let the
evy measures of ˜
Πan
be denoted ˜
λand λ, respectively. The next proposition
establishes a simple relationship between ˜
λand λ.
Proposition 10.4.1. Suppose ˜
λand λare as earlier. Then
1 for each t,˜
λtis the distribution of x→−log(1 x)under the measure λ
t, and
2 for each t,λ
tis the distribution of x→ 1exunder ˜
λt
Proof. The proposition is an easy consequence of the following easy fact.
If ω→ µ(·)isanM(X)-valued random measure which is a Poisson process with
parameter measure λ, then for any measurable function g:XX, the random
measure ω→ µ(g1(·)) is a Poisson process with parameter measure λg1.
Note that
D(F)(t)D(F)(t)=log F(t, )
F[t, )
=log 1F{t}
F[t, )
=log[1 (H(F)(t)H(F)(t))]
It is of interest to know if we can choose neutral to the right priors with large
support. The next proposition gives a sufficient condition that will ensure that the
support is all of F. Recall that the (topological) support Eof a measure µon a
metric space Xis the smallest closed set Ewith µ(Ec)=0.WeviewFas a metric
space under convergence in distribution.
Proposition 10.4.2. If the support of the L´evy measure λHis all of [0,)×[0,1]
then the support of Πis all of F.
Proof. We need to show that every open set (in the topology given by convergence
in distribution) has positive Π measure. Since the set of continuous distributions is
dense in F, it is enough to show that neighborhoods of continuous distributions have
positive Π measure. We will establish a stronger fact, namely, that every uniform
neighborhood has positive prior probability.
10.4. BASIC PROPERTIES 263
Let F0be a continuous distribution , A0=H(F0) be the hazard function of F0and
let U={F:sup
0<st|F(s)F0(s)|<}. In view of the last section, Ucontains a set
H1(V), where Vis of the form V={A:sup
0<st|A(s)A0(s)|}. We will show
that Π(U)>0 by showing that Π H1(V)>0.
To see this, set δ0=δ/3andchoose0=t0<0<t
1<t
2... < t
k<t
k+1 =tsuch
that for i=1,2,...,(k+1);A0(ti)A0(ti1)
0.
Recall the definition of µ(.;A). Let
W={A:µ(Ei;A)=1,i=1,2,...,k}
where Ei=(ti1,t
i]×(A0(ti)A0(ti1)δ0/k, A0(ti)A0(ti1)+δ0/k).
If ti<sti+1,
|A(s)A0(s)|
i
1|(A0(tj)A0(tj1)) (A(tj)A(tj1))|+|(A0(s)A0(ti))(A(s)A(ti))|
The first term on the right-hand side is less than 0/k and the second term is less
than 2δ0so that for every s(0,t],|A(s)A0(s)|. Hence WV.
Under the measure induced by H1, the random variables µ(Ei;A)=1,i =
0,1,2,...,k1 are independent Poisson random variables with parameters λ(Ei),i =
1,2,...,k. These are positive by assumption and hence Vhas positive Π H1mea-
sure.
Let Abe a right continuous function increasing to . A convenient class of neutral
to the right priors are those with L´evy measure λHof the form
H(x, s)=a(x, s)dA(x)ds 0<x<,0<s<1 (10.3)
with 1
0sa(x, s)ds < for all x. Without loss of generality we assume that for all
x, 1
0sa(x, s)ds = 1. This ensures that the prior expectation of A(t)isA0(t).
Every neutral to the right prior gives rise to a L´evy measure via λH. Is every L´evy
measure on R+×[0,1] obtainable as λHof a neutral to the right prior? The next
proposition answers the question for the class of measures just discussed.
Proposition 10.4.3. Let Abe H(F)for some distribution function Fand
H(x, s)=a(x, s)dA(x)ds 0<x<,0<s<1
264 10. NEUTRAL TO THE RIGHT PRIORS
such that for all x, 1
0sa(x, s)ds =1so that E(A(t)=A(t).
The function A→ (0,t](1 dA(s)) (where (0,t]stands for the product integral)
defines a neutral to the right prior on F.
Proof. It can be easily deduced from the basic properties of the product integral that
the function A→ (0,t](1 dA(s)) induces a probability measure on the set of all
functions which are right continuous and decreasing. In order to show that this is a
prior on Fwe need to verify that if ¯
F(t)=(0,t](1 dA(s)), then with probability 1
limt→∞ ¯
F(t) = 0. This follows because the property of independent increments gives
E
(0,t]
(1 dA(s)) =
(0,t]
(1 dE(A)(s)) = ¯
F
Each ¯
F(t) is decreasing in tand limt→∞ E(¯
F(t) = limt→∞ ¯
F(t)=0.
evy representation plays a central role in the study of posteriors of neutral to
the right priors. When the prior is neutral to the right, since the posterior given
X1,X
2,...,X
nis again neutral to the right, this posterior has a L´evy representation.
An expression for the posterior in terms of λDcan be found in Ferguson [62] and in
terms of λHcan be found in Hjort [100]. There is another proof due to Kim [113].
James [105] has a some what different approach, an approach we believe is promising
and deserves further study. We will give a result from [100] without proof.
Our setup consists of random variables X1,X
2,...,X
nthat are independent iden-
tically distributed Fand Y1,Y
2,...,Y
n, which are independent of the Xisandare
independent identically distributed as G0. The observations are Zi=XiYiand
δi=I(XiYi). Let
Nn(t)=
n
1
I(Zi>t) be the number of observations greater than t
and
Mn(t) be the number of Zisequal to t
Theorem 10.4.1 (Hjort). Let Πbe a neutral to the right prior with L´evy measure
of the form (10.3). When all the uncensored values—the Zis with δi=1—are distinct
among themselves, and from the values of the censored observations, the posterior has
the L´evy representation given by
10.5. BETA PROCESSES 265
1Mn
u: the set of uncensored values are points of fixed jumps. The distribution of the
jump at Zihas the density
(1 s)Nn(Zi)sa(Zi,s)
1
0(1 s)Nn(Zi)sa(Zi,s)ds
2 the L´evy measure of the continuous part has
ˆa(x, s)=(1s)Nn(x)+Mn(x)
Remark 10.4.1.Consequently
E¯
F(t2)
¯
F(t1)|Π(
(Zi
i):in)
=
ZiMn
u:t1<Zit21
0(1 s)Nn(Zi)+1sa(Zi,s)ds
1
0(1 s)Nn(Zi)sa(Zi,s)ds
et2
t11
0(1s)Nn(z)+Mn(z)sa(z,s)dsd ˆ
A(z)
(10.4)
10.5 Beta Processes
Beta processes, introduced by Hjort [100] are continuous analogs of a time-discrete
case where (see Example 10.2.2) the Vis are independent beta random variables. The
continuous case is obtained as a limit of the time-discrete case. However, in order
to ensure that the limit exists, the parameters of the beta random variables have to
be chosen carefully. In addition to introducing beta processes and elucidating their
properties for right censored data, Hjort [100] studied extensions to situations more
general than right censored data. This chapter only deals with a part of [100].
10.5.1 Definition and Construction
Let Abe a hazard function with finitely many jumps. Let t1,...,t
kbe the jump-
points of A.Letc(·) be a piecewise continuous non-negative function on [0,)and
let A,c denote the continuous part of A.LetA(t)<for all t.
Definition 10.5.1. An independent increment process Ais said to be a beta pro-
cess with parameters c(.)andA, written Abeta(c, A), if the following holds: A
has L´evy representation as in Theorem 10.3.3 with
266 10. NEUTRAL TO THE RIGHT PRIORS
1M={t1,...,t
k}and the jump-size at any tjgiven by
YjA{tj}∼beta(c(tj)A{tj},c(tj)(1 A{tj}))
2L´evy measure given by
λ(ds du)=c(s)u1(1 u)c(s)1du dA,c(s)
for 0 s<,0<u<1; and
3b(t)0 for all t>0.
The existence of such a process is guaranteed by Proposition 10.4 but this existence
result does not give any insight into the prior. A better understanding of the prior
comes from the construction of Hjort who obtained these priors as weak limits of
time-discrete processes on Aand showed that the sample paths are almost surely
in A. In a very similar spirit, we construct the prior on Fas a weak limit of priors
sitting on a discrete set of points on (0,).
Let F∈Fand, to begin, assume that it is continuous. Let A=H(F)bethe
cumulative hazard function corresponding to F.
Let Qbe a countable dense set in (0,), enumerated as {s1,s
2,...}.Foreach
n1, let {s(n)
1<···<s
(n)
n}be an ordering of s1,...,s
n. Construct a prior Πnon
Fs1,...,snas in Example 10.2.1 by requiring that, under Πn,
V(n)
ibeta c(s(n)
i1)¯
F(s(n)
i)
¯
F(s(n)
i1),c(s(n)
i1)1¯
F(s(n)
i)
¯
F(s(n)
i1) for 1 in1.(10.5)
Let V(n)
n1 and let Fbe a random distribution function , such that, under Πn,
L(¯
F(t)) = L
s(n)
it
(1 V(n)
i)
for all t>0
Theorem 10.5.1. {Πn}n1converges weakly to a neutral to the right prior Πon
F, which corresponds to a beta process.
10.5. BETA PROCESSES 267
Proof. First observe that, as n→∞,
EΠn(¯
F(t)) =
s(n)
itEΠn(1 V(n)
i)
=
s(n)
it1F(s(n)
i1,s
(n)
i]
F(s(n)
i1,)
(0,t]
(1 dH(F))
=
(0,t]
(1 dA)= ¯
F(t)
for all t0. Thus EΠn(F)=Fn
w
Fas n→∞. Hence, by Theorem 2.5.1, {Πn}is
tight.
We now follow Hjort’s calculations to show that the finite-dimensional distributions
of the process F, under the prior Πn, converges weakly to those under the prior
induced by a beta process with parameters cand A0on H.
Consider, for each n1, an independent increment process Ac
nwith process mea-
sure Π
non Asuch that, for each fixed t>0,
L(Ac
n(t)) = L(
s(n)
it
V(n)
i)
Thus, for each n1, Ac
nis a pure jump-process with fixed jumps at s(n)
1,...,s
(n)
n1
and with random jump sizes given by V(n)
i,...,V(n)
n1at these sites. Clearly, Π
ninduces
the prior Πnon F.
Now, for any fixed t>0, repeating computations as in Hjort [ [100], Theorem 3.1,
pp. 1270-72] with
cn,i =c(s(n)
i1),b
n,i =cn,i
¯
F,c(s(n)
i)
¯
F,c(s(n)
i1)and an,i =cn,i bn,i
one concludes that, for each θ,asn→∞,
E[eθAc
n(t)]exp 1
0t
0
(1 eθu)λ(ds du)
268 10. NEUTRAL TO THE RIGHT PRIORS
and, similarly,
Eexp
m
j=1
θjAc
n(aj1,a
j]exp
m
j=1 1
0aj
aj1
(1 eθju)λ(ds du)
Thus the finite-dimensional distributions of the independent increment processes An
converge to the finite-dimensional distributions of an independent increment process
with L´evy measure as in Definition 10.5.1. If the process measure is denoted by Π
and the corresponding induced measure on Fis denoted by Π, then considering the
Skorokhod topology on Aand by the continuity of H1, we conclude that, for all
a1,...,a
m,
L(¯
F(a1),..., ¯
F(am)|Πn)w
→L(¯
F(a1),..., ¯
F(am)|Π)
Therefore, {Πn}converges weakly to Π, a neutral to the right prior on F.
10.5.2 Properties
The following properties of beta processes are from Hjort [100].
1LetA∈Abe a hazard function with finitely many points of discontinuity and let
cbe a piecewise continuous function on (0,).
If Abeta(c,A) then E(A(t)) = A(t). In other words F=H1(A) follows a
beta(c,F) prior distribution and we have E(F(t)) = F(t)whereF=H1(A).
The function centers the expression for the variance. If M={t1,...,t
k}is the set
of discontinuity points of A0then
V(A(t)) =
tjt
A{tj}(1 A{tj})
c(tj)+1 +t
0
dA,c(s)
c(s)+1
where A,c(t)=A(t)titA{ti}.
2LetAbeta(c,A) where, as before, Ahas discontinuities at points in M.Given
F,letX1,...,X
nbe i.i.d. F. Then the posterior distribution of Fgiven X1,...,X
n
is again a beta process, i.e., the corresponding independent increment process is
again beta.
10.5. BETA PROCESSES 269
To describe the posterior parameters, let Xnbe the set of distinct elements of
{x1,...,x
n}. Define
Yn(t)=
n
i=1
I(Xit)and ¯
Yn(t)=
n
i=1
I(Xi>t)
With Nn(t)asbefore,notethat ¯
Yn(t)=nNn(t)andYn(t)=nNn(t).
Using this notation, the posterior beta process has parameters
cX1...Xn(t)=c(t)+Yn(t)
A
X1...Xn(t)=t
0
c(z)dA(z)+dNn(z)
c(z)+Yn(z)
More explicitly, A
X1...Xnhas discontinuities at points in M=MXn, and for
tM,
A
X1...Xn{t}=c(t).A{t}+Nn{t}
c(t)+Yn(t)
A,c
X1...Xn(t)=t
0
c(z)dA,c(z)
c(z)+Yn(z)
Note that if tM,
A{t}∼beta (c(t)A{t}+Nn{t},c(t)(1 A{t})+Yn(t)Nn{t}).
3 Our interest is in the following special case of 2. Suppose Abeta(c,A)andt
Ais continuous. Then the posterior given X1,...,X
nis again a beta process with
parameters
cX1...Xn(t)=c(t)+Yn(t)
and
AX1...X
n(t)=A,dX1...X
n(t)+A,cX1...X
n(t)
where
A,d
X1...Xn(t)=
tiXn
tit
Nn{ti}
c(ti)+Yn(ti)
and
A,c
X1...Xn(t)=t
0
c(z)dA(z)
c(z)+Yn(z)
270 10. NEUTRAL TO THE RIGHT PRIORS
As a consequence, if tXn, then under the posterior ΠX1,...,Xnwe have
A{t}∼beta(Nn{t},c(t)+ ¯
Yn(t)).
Also note that the Bayes estimates are
EΠX1,...,Xn(A(t)) = AX1...X
n(t)
and
EΠX1,...,Xn(¯
F(t)) =
tiXn
tit1Nn{ti}
c(ti)+Yn(ti)exp t
0
c(z)dA(z)
c(z)+Yn(z)(10.6)
4 A neat expression for the posterior and the Bayes estimate for right censored data
can be easily obtained using Theorem 10.4.1. We leave the details to the reader.
Using these explicit expressions it is not very difficult to show that beta processes
lead to consistent posteriors. However since we take up the consistency issue more
generally in the next section we do not pursue it here.
Like the Dirichlet, any two beta processes tend to be mutually singular. This is
proved in [43].
Walker and Muliere [167] started with a positive function Don (0,) and a distri-
bution function ˆ
Fand constructed a class of priors on Fcalled beta-Stacy processes.
We again consider the simple case when ˆ
Fis continuous. The beta-Stacy process is
the neutral to the right prior with
D(s, x)=D(x)esD(x)¯
F(x)
1esdsd ˆ
Ax;0<x<,0<s<
The beta process prior thus relates to an independent increment process via Hand
the beta -Stacy via D. Viewing the processes as measures on Fprovides a mean to
calibrate the prior information in Hin terms of that in Dand vice versa. Though not
explicitly formulated in the following form, the relationship between the two priors is
already implicit in remark 2 and remark 4 of [167].
Theorem 10.5.2. Πis a Beta Stacy (D, ˆ
F)process iff Πis a Beta (C, ˆ
A)process
prior where C=D¯
ˆ
Fand ˆ
Ais the cumulative hazard function of ˆ
F.
Proof. Because Beta Stacy process has λDgiven above, we can compute its λHusing
Proposition 10.4.1. This immediately yields the assertion.
10.6. POSTERIOR CONSISTENCY 271
10.6 Posterior Consistency
Since neutral to the right priors, like tail free priors, possess nice independence and
conjugacy properties it appeared that they would always yield consistent posteriors.
However, Kim and Lee [114] gave an example of a neutral to the right prior which is
inconsistent. Their elegant example is constructed with a homogeneous L´evy measure
and is inconsistent at every continuous distribution.
Recall from Theorem 4.2.1 that to establish posterior consistency at F0, it is enough
to show that with F
0probability 1, for all t
(i) lim
n→∞ E(F(t)|X1,X
2,...,X
n)=F0(t)and
(ii) lim
n→∞ V(F(t)|X1,X
2,...,X
n)=0.
The next theorem shows that for neutral to the right priors consistency of Bayes
estimates ensures consistency of the posterior.
Theorem 10.6.1. Let Πbe a neutral to the right prior of the form (10.3). If
lim
n→∞ E(F(t)|X1,X
2,...,X
n)=F0(t)
then
lim
n→∞ V(F(t)|X1,X
2,...,X
n)=0
Proof. Let X[1] <X
[2] ...X
[k]be the ordering of the observations X1,X
2,...,X
n
which are less than t. Then, apart from an exponential factor going to 1,
E(¯
F(t)2|X1,X
2,...,X
n)=
k
21
0(1 s)j+2a(s, X[j])ds
1
0(1 s)ja(s, X[j])ds
multiplying each term by 1
0(1 s)j+1a(s, X[j])ds/ 1
0(1 s)j+1a(s, X[j])ds,weget
=
k
21
0(1 s)j+2a(s, X[j])ds
1
0(1 s)j+1a(s, X[j])ds
k
11
0(1 s)j+1a(s, X[j])ds
1
0(1 s)ja(s, X[j])ds (¯
F0(t))2
There is another structural aspect of neutral to the right priors. Consistency for
the censored case follows from consistency for the uncensored case. Following is the
result. For a proof, see Dey et al. [43]
272 10. NEUTRAL TO THE RIGHT PRIORS
Theorem 10.6.2. Suppose Xis a survival time with distribution Fand Yis a
censoring time distributed as G.X1,X
2,..., are given F=F, i.i.d. Fand Y1,Y
2,...,
be i.i.d. G, where Gis continuous and has support all of R+. We also assume that the
Xis and Yis are independent. Let Zi=XiYiand i=I(XiYi).IfΠis a neutral
to the right prior for Fwhose posterior is consistent at all continuous distributions
F0, then the posterior given (Zi,i):i1is also consistent at all continuous F0.
Proof. Fix t1<t
2. since the exponential term in 10.4 goes to 0 as n→∞, our assump-
tion on consistency translates into: for any continuous distribution F,ifX1,X
2,...,X
n
are i.i.d. F, then
lim
n→∞
Xi(t1,t2]0,1s(1 s)Nn(Xi)+1a(Xi,s)ds
0,1s(1 s)Nn(Xi)a(Xi,s)ds =¯
F(t2)
¯
F(t1)
Fix F0continuous. Let X1,X
2,...,X
nbe i.i.d. F0and Y1,...,Y
nbe i.i.d. G, and let
(Zi,i)beasabove.Wewillfirstshowthat
lim
n→∞
ZiM
n(0,t]0,1s(1 s)Nn(Xi)+1a(Xi,s)ds
0,1s(1 s)Nn(Xi)a(Xi,s)ds =¯
F(t)a.s.(F0×G)
where M
n={Zj:∆
j=0}.
With t1<t
2fixed, let φbe an increasing continuous mapping of (t1,)into(t2,)
and define
Z
i=ZiI(∆i=1)+φ(Zi)I(|Deltai=0)
Then Z
iare again i.i.d. with a continuous distribution F
0such that
F
0(t2)
F
0(t1)=¯
J(t1,1) ¯
J(t2,1)
¯
J(t1)
where ¯
J(t)=P(Z>t)and ¯
J(t1=P(Z>t,∆=1).
Now using our assumption, if Nn
(t)=n
i=1 I(Z
i>t) then
lim
n→∞
Z
i(t1,t2]0,1s(1 s)Nn
(Z
i)+1a(Z
i,s)ds
0,1s(1 s)Nn
(Z
ia(Z
i,s)ds =¯
J(t1,1) ¯
J(t2,1)
¯
J(t1)a.s
10.6. POSTERIOR CONSISTENCY 273
Note that the above product is only over the uncensored Zis and that, for each t1<t
2
with ∆i=1,Nn(Zi)Nn
(Zi). Now using the Cauchy-Schwarz inequality we get
1
0
(1 s)n+2sa(x, s)ds1
0
(1 s)nsa(x, s)ds
=1
0
[(1 s)(n+2)/2]2sa(x, s)ds1
0
[(1 s)(n)/2]2sa(x, s)ds
1
0
(1 s)n+1sa(x, s)ds2
and consequently 1
0(1 s)n+1sa(x, s)/1
0(1 s)nsa(x, s)ds is decreasing in n. Hence,
we have
lim
n→∞
ZiM
n(t1,t2]0,1s(1 s)Nn(Zi)+1a(Zi,s)ds
0,1s(1 s)Nn(Zi)a(Zi,s)ds
lim
n→∞
Z
i(t1,t2]0,1s(1 s)Nn
(Z
i)+1a(Z
i,s)ds
0,1s(1 s)Nn
(Z
i)a(Z
i,s)ds
=¯
J(t1,1) ¯
J(t2,1)
¯
J(t1)
Let 0 = t0<t
1<t
2<...< t
k=tbe a partition of (0,t]. Then
lim
n→∞
ZiM
n(0,t]0,1s(1 s)Nn(Zi)+1a(Zi,s)ds
0,1s(1 s)Nn(Zi)a(Zi,s)ds
k
1
¯
J(ti1,1) ¯
J(ti,1)
¯
J(ti1)
As the width of the partition max |titi1goes to 0, the right-hand side converges
to the product integral (0,t](1 J(ds, 1)/¯
J(s)), which from Peterson [138] is equal
to ¯
F(t).
Let ˆ
¯
Fndenote the Bayes estimate of ¯
Fgiven X1,X
2,...,X
nand let ¯
F
ndenote the
Bayes estimate of ¯
Fgiven (Zi
i):1in.wehaveshownthatforallt,
¯
F
n(t)ˆ
¯
Fn(t) and hence lim inf
nF
n¯
F0
Similarly, by considering the “Bayes” estimate for G,withMn
0={(Zj,j:∆
j=
0)},
lim inf
n
Zit:ZiMn
01
0(1 s)Nn(Zi)+1a(Zi,s)ds
1
0(1 s)Nn(Zi)a(Zi,s)ds ¯
G
274 10. NEUTRAL TO THE RIGHT PRIORS
Consider,
Zit:ZiMn
u1
0(1 s)Nn(Zi)+1a(Zi,s)ds
1
0(1 s)Nn(Zi)a(Zi,s)ds
Zit:ZiMn
01
0(1 s)Nn(Zi)+1a(Zi,s)ds
1
0(1 s)Nn(Zi)a(Zi,s)ds (10.7)
but this is equal to
Zit1
0(1 s)Nn(Zi)+1a(Zi,s)ds
1
0(1 s)Nn(Zi)a(Zi,s)ds
But this is just the Bayes estimate based on i.i.d. observations from the continuous
survival distribution ¯
F0(t)¯
G(t) and by assumption (10.7) converges to ¯
F0(t)¯
G(t). The
conclusion follows easily.
Thus, as far as consistency issues are concerned, we only need to study the uncen-
sored case. We begin looking at the simple case when the L´evy measure is homoge-
neous. In the sequel for any a, b > 0, we denote by B(a, b),the usual beta function
given by
B(a, b)=Γ(a)Γ(b)
Γ(a+b)=1
0
(1 s)a1sb1ds
If fis an integrable function on (0,1) we set
K(n, f )=1
0
(1 s)nf(s)ds
We will repeatedly use the fact that
for any p, q; lim
n→∞ nqpΓ(n+p)
Γ(n+q)=1
Lemma 10.6.1. Suppose fis a nonnegative function on (0,1) such that
(a) 0<1
0f(s)ds < and
(b) for some α<1,0<lim
s0sαf(s)=b<.
Then
lim
n→∞
K(n, f )
B(n+1,1α)=b
10.6. POSTERIOR CONSISTENCY 275
Proof. Since
1
(1 s)nf(s)ds (1 )n1
0
f(s)ds =o(n(1α))
and as n→∞,n
1αB(n, 1α)Γ(1 α), we have
lim
n→∞ 1
(1 s)nf(s)ds
B(n+1,1α)= 0 (10.8)
Similarly, because α<1,
1
(1 s)nsαds (1 )n
0
sαds (1 )n11α
1α=o(n(1α))
which in turn yields
lim
n→∞ 1
(1 s)nsαds
B(n+1,1α)= 0 (10.9)
Given δ, use assumption (b) to choose >0 such that for s<
(bδ)sα<f(s)<(b+δ)sα
Then
K(n, f )(b+δ)B(n+1,1α)+1
(1 s)nf(s)ds
and by (10.8) we have
lim
n→∞
K(n, f )
B(n+1,1α)(b+δ)
A similar argument using (10.9) shows that
lim
n→∞
K(n, f )
B(n+1,1α)(bδ)
Since δis arbitrary, the lemma follows.
Theorem 10.6.3. Let Abe a cumulative hazard function which is continuous and
finite for all x. Suppose that a neutral to the right prior with no fixed jumps has the
expected hazard function Aand the L´evy measure
H(x, s)=a(s)dA(x)ds 0<x<,0<s<1
276 10. NEUTRAL TO THE RIGHT PRIORS
such that
for some α<1,0<lim
s0s1+αa(s)=b<(10.10)
If F0is a continuous distribution with F0(t)>0for all t, then with F
0-probability
1, the posterior converges weakly to the measure degenerate at F1α
0. In particular, if
(10.10) holds with α=0then the posterior is consistent at F0.
Proof. Set f(s)=sa(s). We have 1
0f(s)ds = 1. Using (10.4),
E(¯
F(t)|X1,X
2,...,X
n)=
Xit
K(Nn(Xi)+1,f)
K(Nn(Xi),f)eψn(t)(10.11)
where ψn(t)=t
01
0(1 s)Nn(x)+Mn(x)sa(s)dsdA(x).
For any x<t,(1 s)Nn(x)+Mn(x)<(1 s)Nn(t)and hence ψn(t) is bounded above
by (1
0(1 s)Nn(t)ds)A(t). Since Nn(t)→∞as n→∞, it follows that ψn(t)0as
n→∞. Hence the exponential factor goes to 1.
If X(1) <X
(2) ... < X
(nNn(t)) is an ordering of the nNn(t) samples that are
less than t, then, since with F0probability 1 the X1,X
2,...,X
nare all distinct,
Nn(X(1))=n1,Nn(X(2))=n2, and so on. Thus the first term in (10.11) reduces
to
(i=nNn(t))
i=0
K(ni, f )
K((ni1),f)=K(n, f)
K(Nn(t)1,f)
It follows from Lemma 10.6.1 that
lim
n→∞
K(n, f )
K(Nn(t)1,f)= lim
n→∞
B(Nn(t)1,1α)
B(n, 1α)
= lim
n→∞
Γ(Nn(t)α)
Γ(Nn(t)1)
Γ(n)
Γ(n+1α)
= lim
n→∞ Nn(t)
n+11α
=¯
F0(t)1αa.s. F
0
Remark 10.6.1.The Kim-Lee example had the homogeneous L´evy measure given
by a(s)=2s3/2. In this case the conditions of the Theorem 10.6.3 are satisfied with
α=1/2 so that the posterior converges to F1/2
0.
10.6. POSTERIOR CONSISTENCY 277
We next turn to a sufficient condition for consistency in the general case. We begin
with an extension of Lemma 10.6.1.
For ea ch xin a set Xlet f(x, .) be a non negative function on (0,1). Let mnbe a
sequence of integers such that lim
n→∞
mn
n=c, 0<c<1.
Lemma 10.6.2. Suppose
(a) 0<supx1
0f(x, s)ds =I<and
(b) As s0,f(x, s)converges uniformly (in x), to the constant function 1, i.e., as
0,
δ=sup
x
sup
s< |f(s, x)1|→0
Then
lim
n→∞
n
mni+2
i+1
1
0(1 s)i+1f(xi,s)ds
1
0(1 s)if(xi,s)ds =1
and the convergence is uniform in the x
is.
Proof. To avoid unpleasant expressions involving fractions of integrals, set
Ki,x =1
0
(1 s)if(x, s)ds and Li,x =1
0
s(1 s)if(x, s)ds
We will show that for any x,givenδsmall, there is an m0such that, for i>m
0,
i+12δ
i+2 Ki+1,x
Ki,x i+1+2δ
i+2 (10.12)
The bounds in inequality 10.12 do not depend on the xis. Consequently, we have
uniformly in the xis,
12δ
mn+1
nmn
n
mn
i+2
i+1
Ki+1,x
Ki,x 1+ 2δ
mn+1
nmn
For small positive y,e2y<1y<1+y<e
y. Hence, as n→∞, the left-hand
side converges to e4δ(1c)/c and the right side to e2δ(1c)/c. Letting δgoto0wehave
the result.
278 10. NEUTRAL TO THE RIGHT PRIORS
To prove (10.12) note that
Ki+1,x
Ki,x
=1Li,x
Ki,x
For any 0 <=1α<1,
(1 δHi,)Ki,x (1 + δ)Hi, +αiI
and
(1 δJi,)Li,x (1 + δ)Ji, +αiI
where
Hi, =
0
(1 s)ids =1αi+1
i+1
and
Ji, =
0
s(1 s)ids =1αi+1(1 + +i)
(i+1)(i+2)
Now
(i+2)Li,x
Ki,x (i+2)(1 + δ)Ji,
(1 δ)Hi,
+αiI
(1 δ)Hi,
which goes to (1+δ)/(1δ)asi→∞. Further, the right-hand side does not involve
x, and hence this convergence is uniform in x.
On the other hand,
(i+2)Li,x
Ki,x (i+2)(1 δ)Ji,
(1 + δ)(Hi, +αiI)
which goes to (1 δ)/(1 + δ), again uniformly in x.
Because δ0asgoes to 0, given any δ>0, for sufficiently small ,(1δ)/(1+δ)
is larger than (1 δ) and (1 + δ)/(1 δ) is smaller than (1 + δ).
Thus given any δ>0,thereisannsuch that for i>n
,
1δ(i+2)Li,x
Ki,x 1+δ
Using Ki+1,x/Ki,x =1(Li,x/Ki,x), we get
11+δ
i+2 <Ki+1,x
Ki,x
<1+1δ
i+2
and this is (10.12)
10.6. POSTERIOR CONSISTENCY 279
Remark 10.6.2.In the Lemma 10.6.2, assumption (a) can be replaced by:
(a’) 0 <supx1
0(1 s)f(x, s)ds < .
This follows from setting g(s, x)=(1s)f(x, s) and noting that gsatisfies as-
sumptions (a) and (b) and that
1
0(1 s)n+1f(x, s)ds
1
0(1 s)nf(x, s)ds =1
0(1 s)ng(x, s)ds
1
0(1 s)n1g(x, s)ds
and observing that (n+2)/(mn+2)=n
mn[(i+2)/(i+1)and(n+1)/(mn+1)=
n
mn[(i+1)/i both converge to the same limit 1/c.
Theorem 10.6.4. Let Πbe a neutral to the right prior with
H(x, s)=c(x)a(x, s)dA(x)ds 0<x<,0<s<1
If f(x, s)=sa(x, s)satisfies the assumption of the Lemma 10.6.2 (or the remark
following it) then the posterior is consistent at any continuous distribution F0.
Proof. Since the exponential factor in equation (10.4) goes to 1, it follows immediately
from Lemma 10.6.2 that for each twith ¯
F0(t)>0,
E(¯
F(t)|Π()|X1,X
2,...,X
n)¯
F0(t)
Theorem 10.6.5. The posterior of the beta(C, A)prior is consistent at all con-
tinuous distribution F0.
Proof. Since the L´evy measure satisfies the conditions of Remark 10.6.2, this is an
immediate consequence of Theorem 10.6.4.
Remark 10.6.3.Kim and Lee [114] have shown consistency when
1(1s)f(x, s)1and
2asx0,f(x, s) converges uniformly in xto a positive continuous function b(x).
The result is marginally more general than that of Kim and Lee. The methods that
we have used are more elementary.
280 10. NEUTRAL TO THE RIGHT PRIORS
To summarize, neutral to priors are an elegant class of priors that can, in terms of
mathematical tractability, conveniently handle right censored data. We have also seen
that some caution is required if one wants consistent posteriors. As with the Dirichlet,
mixtures of neutral to priors would yield more flexibility in terms of prior opinions
and posteriors that are amenable to simulation. These remain to be explored.
11
Exercises
11.0.1. If two probability measures on RKagree on all sets of the form (a1,b
1]×
(a2,b
2],...×(ak,b
k] then they agree on all Borel sets in Rk.
11.0.2. Let Mtbe the median of Beta(ct, c(1t)) where 0 <t<1. Show that Mt1
2
iff t1
2. [Hint: If x1
2show that xct1(1 x)c(1t)1is increasing in t. Suppose
t1
2and Mt<1
2. Then 1/2
0xct1(1 x)c(1t)1dx 1
2. Make the change of variable
x→ (1 x) to obtain a contradiction]
11.0.3. Suppose αis a finite measure. Define X1,X
2,... by
X1is distributed as ¯α
, for any n1,
P(Xn+1 B|X1,X
2,...,X
n)=α(B)+n
1δXi(B)
α(R+n)
Show that X1,X
2,... form an exchangeable sequence and the corresponding
DeFinneti measure on M(R)isDα
11.0.4. Assume a Dirichlet prior and show that the predictive distribution of Xn+1
given X1,X
2,...,X
n, converges to P0weakly almost surely P0. Examine what hap-
pens when the prior is a mixture of Dirichlet processes.
282 11. EXERCISES
11.0.5. Show that if PMαand Uis a neighborhood of Pin set-wise convergence
then Dα(U)>0. However Mαis not the smallest closed set with this property.
11.0.6. Show that a Polya tree prior is a Dirichlet process iff for any E
i
0+α1=
α.
11.0.7. Let Lµbe the set of all probability measures dominated by a σfinite
measure µ. Verify that, when restricted to Lµall the three σalgebras discussed in
section 2.2 coincide.
11.0.8. Let Ebe a measurable subset of Θ ×X such that θ=θ,E
θEθ=and for
all θ,Pθ(Eθ) = 1. For any two priors Π1,Π2on Θ show that Π1Π2=λ1λ2,
where λiare the respective marginals on X.
Derive the Blackwell- Dubins merging result from Doob’s theorem
11.0.9. Consider fθ=U(0); 0 <θ<1. Show that the Schwartz condition fails at
θ= 1 but posterior consistency holds. Can you use the results in Section 4.3 to prove
consistency?
11.0.10. Suppose X1,X
2,...,X
nare i.i.d. Ber(p), i.e.,
Pr(Xi=1)=p=1Pr(Xi=0)
Apriorforpmay be elicited by asking for a rule for predicting Xn+1. Suppose for all
n1, one is given the rule
Pr(Xn+1 =1|X1,X
2,...,X
n)=a+n
1Xi
a+b+n
Assuming that the prediction loss is squared error, show that there is a unique
prior corresponding to this rule and identify the prior
11.0.11. With Xis as in Exercise11.0.11, consider a conjugate prior and a realization
of the Xis such that ˆp=n
1Xi/n is bounded away from 0 and 1 as n→∞.
Show directly (without using the results established in the text) that as n→∞, the
posterior distribution of npp)/p(1 ˆp) converges weakly to N(0,1)
11.0.12. Let X1,X
2...,X
nbe i.i.d. N(0,1). Consider a Bayesian who does not know
the true density and who uses the model, θN(µ, η) and given θ,X1,X
2...,X
nbe
i.i.d. N(θ, 1). Calculate the posterior of θgiven X1,X
2...,X
nand verify that with
probability 1 under the joint distribution under N(0,1), the density of n(θ¯
X)
converges in L1distance to N(0,1).
283
11.0.13. Consider Xis as in Exercise11.0.11. Consider a beta prior, i.e., a prior with
density
Π(p)=cpα1(1 p)β10
a Discuss why relatively small values of α+βindicate relative lack of prior information
b Consider a sequence of hyperparameters αi
isuch that αi+βi0 but αii
C, 0<C<1. Show that the corresponding sequence of priors converge weakly, and
determine the limiting prior. Would you call this prior noninformative? Reconcile
your answer with the discussion in (a)
11.0.14. (1). For a multinomial with probabilities p1,p
2,...,p
kfor kclasses,calculate
the Jeffreys prior. [Hint: Use the following well known identity (see [144]): Let B
be a positive definite matrix. Let A=B+xxT. Then det A= det B(1+xTB1x)
]
(2). In the above problem calculate the reference prior for (p1,p
2) assuming k=3.
For the next four problems PDαand given P,X1,X
2,...,X
nare i.i.d. P.
11.0.15. Assume
−∞ x2dα < . Calculate the prior variance of the population
meanxdP
11.0.16. Assuming αhas the Cauchy density
1
π
1
1+x2
and xdP =T(P) is well defined for almost all P, show that T(P) has the same
Cauchy distribution.
[Hint: Use Sethuraman’s construction]
11.0.17. For ¯αCauchy, show that xdP =T(P) is well defined for almost all P.
[Hint: If Yiis a sequence of independent random variables such that n
1Yiconverges
in distribution, then
1Yiis finite a.s. Alternatively, use methods of Doss and Selke
[55]]
11.0.18. Let αθ=N(θ, 1) and θN(µ, η). Given X1,X
2,...,X
nare all distinct,
calculate the posterior distribution of θ.
For the next three problems, let PDα,Pa convolution of Pand N(0,h
2)and
hhave the prior density Π(h). Given P,letX1,X
2,...,X
nbe i.i.d. f,wherefis the
density of P
284 11. EXERCISES
11.0.19. Let Cnbe the information that all the Xis are distinct. For any fixed x
calculate E(f(x)|X1,X
2,...,X
n,C
n) assuming the Xis are all distinct.
11.0.20. Let the true density f0be uniform on (0,1). Verify if the Bayes estimate
E(f|X1,X
2,...,X
n,h) is consistent in the L1distance
11.0.21. Let f0be Normal or Cauchy with location and scale parameters chosen by
you but not equal to 0 and 1. Set n=50fromf0, draw a sample of size n,namely,
X1,X
2,...,X
n. Simulate the Bayes estimate of f(x) when the prior is a Dirichlet
mixture of normal and ¯α=N(0,1) or N(µ, σ2)withµand σ2independent, µnormal
and σ2is inverse gamma truncated above.
Plot f0and the Bayes estimate. Discuss whether putting a prior on µ, σ2leads to a
Bayes estimate that is closer to f0than the Bayes estimate under a prior with fixed
values of µand σ2. (Base your comments from 10 simulations on each case).
11.0.22. Let f0be normal or Cauchy. Using the Polya tree prior recommended in
Chapter6 and a normal or Cauchy prior for the location parameter, calculate numeri-
cally the posterior for θ, for various values of nand various choices of X1,X
2,...,X
n.
11.0.23. (a) Assume the regression model discussed in Chapter7 with a prior for the
random density fthat is Dirichlet mixture of Normal or Cauchy . Calculate and
plot the posterior for βfor the different priors listed in Exercise11.0.21.
(b) Do the same but symmetrize faround 0. Discuss whether the behavior of the
posterior for βis similar o that in (a)
11.0.24. Examine Doob’s theorem in the regression set up considered in Chapter 7
11.0.25. Show that the Bayes estimate for survival function under a Dirichlet prior
with censored data has a representation as a product of survival probabilities and
that it converges to the Kaplan-Meier estimate as α(R)0.
11.0.26. Show that the Bayes estimate for the bivariate survival function is incon-
sistent in the following example (due to R.Pruitt):
(T1,T
2)Fand FDαwhere αis the uniform distribution on (0,2) ×(0,2). The
censoring random variable (C1,C
2)takesthevalues(0,2),(2,0) and (2,2) with equal
probability of 1/3. The Bayes estimator for Fis inconsistent when F0is the uniform
distribution on (1,2) ×(0,1).
11.0.27. Show, in the context of Chapter9 that if one starts with a Dirichlet prior
for the distribution of (Z, ∆) (i.e., a prior for probability measures on {0,1R+),
then the induced prior for F-the distribution of the survival time XisaBetaprocess.
References
[1] James H. Albert and Siddhartha Chib. Bayesian analysis of binary and
polychotomous response data. J. Amer. Statist. Assoc., 88(422):669–679, 1993.
[2] S.-I. Amari, O. E. Barndorff-Nielsen, R. E. Kass, S. L. Lauritzen,
and C. R. Rao.Differential geometry in statistical inference. Institute of
Mathematical Statistics, Hayward, CA, 1987.
[3] Per Kragh Andersen, Ørnulf Borgan, Richard D. Gill, and Niels
Keiding.Statistical models based on counting processes. Springer-Verlag, New
York, 1993.
[4] Charles E. Antoniak. Mixtures of Dirichlet processes with applications to
Bayesian nonparametric problems. Ann. Statist., 2:1152–1174, 1974.
[5] Andrew. Barron, Mark J. Schervish, and Larry Wasserman. The
consistency of posterior distributions in nonparametric problems. Ann. Statist.,
27(2):536–561, 1999.
[6] Andrew R. Barron. The strong ergodic theorem for densities: generalized
Shannon-McMillan-Breiman theorem. Ann. Probab., 13(4):1292–1303, 1985.
[7] Andrew R. Barron. Uniformly powerful goodness of fit tests. Ann. Statist.,
17(1):107–124, 1989.
286 References
[8] Andrew R. Barron. Information-theoretic characterization of Bayes perfor-
mance and the choice of priors in parametric and nonparametric problems. In
Bayesian statistics, 6 (Alcoceber , 1998), pages 27–52. Oxford Univ. Press, New
York, 1999.
[9] D. Basu. Statistical information and likelihood. Sankhy¯a Ser. A, 37(1):1–71,
1975. Discussion and correspondance between Barnard and Basu.
[10] D. Basu and R. C. Tiwari. A note on the Dirichlet process. In Statistics
and probability: essays in honor of C. R. Rao, pages 89–103. North-Holland,
Amsterdam, 1982.
[11] S. Basu and S. Mukhopadhyay. Binary response regression with normal
scale mixture links. In Generalized Linear Models- A Bayesian Perspective,
pages 231–241. Marcel-Dekker, New York, 1998.
[12] S. Basu and S. Mukhopadhyay. Bayesian analysis of binary regression
using symmetric and asymmetric links. Sankhy¯a Ser. B, 62:372–387, 2000.
[13] James O. Berger.Statistical decision theory and Bayesian analysis.
Springer-Verlag, New York, 1993. Corrected reprint of the second (1985) edi-
tion.
[14] James O. Berger and Jose-M. Bernardo. Estimating a product of
means: Bayesian analysis with reference priors. J. Amer. Statist. Assoc.,
84(405):200–207, 1989.
[15] James O. Berger and Jos´
e M. Bernardo. On the development of ref-
erence priors. In Bayesian statistics, 4 (Pe˜ıscola, 1991), pages 35–60. Oxford
Univ. Press, New York, 1992.
[16] James O. Berger and Luis R. Pericchi. The intrinsic Bayes factor for
model selection and prediction. J. Amer. Statist. Assoc., 91(433):109–122, 1996.
[17] Robert H. Berk and I. Richard Savage. Dirichlet processes produce
discrete measures: an elementary proof. In Contributions to statistics, pages
25–31. Reidel, Dordrecht, 1979.
[18] Jose-M. Bernardo. Reference posterior distributions for Bayesian inference.
J. Roy. Statist. Soc. Ser. B, 41(2):113–147, 1979. With discussion.
References 287
[19] P. J. Bickel. On adaptive estimation. Ann. Statist., 10(3):647–671, 1982.
[20] P. J. Bickel and J. A. Yahav. Some contributions to the asymptotic theory
of Bayes solutions. Z. Wahrscheinlichkeitstheorie und Verw. Gebiete, 11:257–
276, 1969.
[21] Patrick Billingsley.Convergence of probability measures. John Wiley &
Sons Inc., New York, second edition, 1999. A Wiley-Interscience Publication.
[22] Lucien Birg´
e. Approximation dans les espaces m´etriques et th´eorie de
l’estimation. Z. Wahrsch. Verw. Gebiete, 65(2):181–237, 1983.
[23] David Blackwell. Discreteness of Ferguson selections. Ann. Statist., 1:356–
358, 1973.
[24] David Blackwell and Lester Dubins. Merging of opinions with increas-
ing information. Ann. Math. Statist., 33:882–886, 1962.
[25] David Blackwell and James B. MacQueen. Ferguson distributions via
olya urn schemes. Ann. Statist., 1:353–355, 1973.
[26] J. Blum and V. Susarla. On the posterior distribution of a Dirichlet pro-
cess given randomly right censored observations. Stochastic Processes Appl.,
5(3):207–211, 1977.
[27] J. Borwanker, G. Kallianpur, and B. L. S. Prakasa Rao. The
Bernstein-von Mises theorem for Markov processes. Ann. Math. Statist.,
42:1241–1253, 1971.
[28] Olaf Bunke and Xavier Milhaud. Asymptotic behavior of Bayes esti-
mates under possibly incorrect models. Ann. Statist., 26(2):617–644, 1998.
[29] Burr, D, Cooke G.E., Doss H. and P.J. Goldschmidt-Clermont.A
meta analysis of studies on the association of the platlet p1a polymorphism of
glycoprotein iiia and risk of coronary heart disease. Technical report, 2002.
[30] N. N. ˇ
Cencov.Statistical decision rules and optimal inference. American
Mathematical Society, Providence, R.I., 1982. Translation from the Russian
edited by Lev J. Leifman.
288 References
[31] Ming-Hui Chen and Dipak K. Dey. Bayesian modeling of correlated binary
responses via scale mixture of multivariate normal link functions. Sankhy¯a Ser.
A, 60(3):322–343, 1998. Bayesian analysis.
[32] Ming-Hui Chen, Qi-Man Shao, and Joseph G. Ibrahim.Monte Carlo
methods in Bayesian computation. Springer-Verlag, New York, 2000.
[33] Bertrand S. Clarke and Andrew R. Barron. Information-theoretic
asymptotics of Bayes methods. IEEE Trans. Inform. Theory, 36(3):453–471,
1990.
[34] Robert J. Connor and James E. Mosimann. Concepts of independence
for proportions with a generalization of the Dirichlet distribution. J. Amer.
Statist. Assoc., 64:194–206, 1969.
[35] Harald Cram´
er.Mathematical Methods of Statistics. Princeton University
Press, Princeton, N. J., 1946.
[36] Harald Cram´
er and M. R. Leadbetter.Stationary and related stochastic
processes. Sample function properties and their applications. John Wiley & Sons
Inc., New York, 1967.
[37] Sarat Dass and Jayeong Lee. A note on the consistency of bayes factors
for testing point null versus nonparametric alternatives.
[38] G. S. Datta and J. K. Ghosh. Noninformative priors for maximal invariant
parameter in group models. Test, 4(1):95–114, 1995.
[39] A. P. Dawid, M. Stone, and J. V. Zidek. Marginalization paradoxes in
Bayesian and structural inference. J. Roy. Statist. Soc. Ser. B, 35:189–233,
1973. With discussion by D. J. Bartholomew, A. D. McLaren, D. V. Lindley,
Bradley Efron, J. Dickey, G. N. Wilkinson, A. P.Dempster, D. V. Hinkley, M.
R. Novick, Seymour Geisser, D. A. S. Fraser and A. Zellner, and a reply by A.
P. Dawid, M. Stone, and J. V. Zidek.
[40] William A. Dembski. Uniform probability. J. Theoret. Probab., 3(4):611–
626, 1990.
[41] Luc Devroye and L´
aszl´
oGy
¨
orfi. No empirical probability measure
can converge in the total variation sense for all distributions. Ann. Statist.,
18(3):1496–1499, 1990.
References 289
[42] J. Dey, L. Drˇ
aghici, and R. V. Ramamoorthi. Characterizations of tail
free and neutral to the right priors. In Advances on methodological and applied
aspects of probability and statistics, pages 305–325. Gordon and Breach science
publishers.
[43] J. Dey, R.V. Erickson, and R.V. Ramamoorthi. Some aspects of neutral
to right priors. submitted . Bayesian analysis.
[44] P. Diaconis and D. Freedman. Partial exchangeability and sufficiency.
In Statistics: applications and new directions (Calcutta, 1981), pages 205–236.
Indian Statist. Inst., Calcutta, 1984.
[45] P. Diaconis and D. Freedman. On inconsistent Bayes estimates of location.
Ann. Statist., 14(1):68–87, 1986.
[46] Persi Diaconis and David Freedman. On the consistency of Bayes esti-
mates. Ann. Statist., 14(1):1–67, 1986. With a discussion and a rejoinder by
the authors.
[47] Persi Diaconis and Donald Ylvisaker. Conjugate priors for exponential
families. Ann. Statist., 7(2):269–281, 1979.
[48] Kjell Doksum. Tailfree and neutral random probabilities and their posterior
distributions. Ann. Probability , 2:183–201, 1974.
[49] J. L. Doob. Application of the theory of martingales. In Le Calcul des
Probabilit´es et ses Applications., pages 23–27. Centre National de la Recherche
Scientifique, Paris, 1949. Colloques Internationaux du Centre National de la
Recherche Scientifique, no. 13,.
[50] Hani Doss. Bayesian estimation in the symmetric location problem. Z.
Wahrsch. Verw. Gebiete, 68(2):127–147, 1984.
[51] Hani Doss. Bayesian nonparametric estimation of the median. I. Computation
of the estimates. Ann. Statist., 13(4):1432–1444, 1985.
[52] Hani Doss. Bayesian nonparametric estimation of the median. II. Asymptotic
properties of the estimates. Ann. Statist., 13(4):1445–1464, 1985.
[53] Hani Doss. Bayesian nonparametric estimation for incomplete data via suc-
cessive substitution sampling. Ann. Statist., 22(4):1763–1786, 1994.
290 References
[54] Hani Doss and B. Narasimhan. Dynamic display of changing posterior
in Bayesian survival analysis. In Practical nonparametric and semiparametric
Bayesian statistics, pages 63–87. Springer, New York, 1998.
[55] Hani Doss and Thomas Sellke. The tails of probabilities chosen from a
Dirichlet prior. Ann. Statist., 10(4):1302–1305, 1982.
[56] L. Drˇ
aghici and R. V. Ramamoorthi. A note on the absolute continuity
and singularity of Polya tree priors and posteriors. Scand. J. Statist., 27(2):299–
303, 2000.
[57] R. M. Dudley. Measures on non-separable metric spaces. Illinois J. Math.,
11:449–453, 1967.
[58] Richard M. Dudley.Real analysis and probability. Wadsworth &
Brooks/Cole Advanced Books & Software, Pacific Grove, CA, 1989.
[59] Michael D. Escobar and Mike West. Bayesian density estimation and
inference using mixtures. J. Amer. Statist. Assoc., 90(430):577–588, 1995.
[60] Michael D. Escobar and Mike West. Computing nonparametric hi-
erarchical models. In Practical nonparametric and semiparametric Bayesian
statistics, pages 1–22. Springer, New York, 1998.
[61] Thomas S. Ferguson. A Bayesian analysis of some nonparametric problems.
Ann. Statist., 1:209–230, 1973.
[62] Thomas S. Ferguson. Prior distributions on spaces of probability measures.
Ann. Statist., 2:615–629, 1974.
[63] Thomas S. Ferguson. Bayesian density estimation by mixtures of normal
distributions. In Recent advances in statistics, pages 287–302. Academic Press,
New York, 1983.
[64] Thomas S. Ferguson and Eswar G. Phadia. Bayesian nonparametric
estimation based on censored data. Ann. Statist., 7(1):163–186, 1979.
[65] Thomas S. Ferguson, Eswar G. Phadia, and Ram C. Tiwari. Bayesian
nonparametric inference. In Current issues in statistical inference: essays in
honor of D. Basu, pages 127–150. Inst. Math. Statist., Hayward, CA, 1992.
References 291
[66] J.-P. Florens, M. Mouchart, and J.-M. Rolin. Bayesian analysis of
mixtures: some results on exact estimability and identification. In Bayesian
statistics, 4 (Pe˜ıscola, 1991), pages 127–145. Oxford Univ. Press, New York,
1992.
[67] Sandra Fortini, Lucia Ladelli, and Eugenio Regazzini. Exchangeabil-
ity, predictive distributions and parametric models. Sankhy¯a Ser. A, 62(1):86–
109, 2000.
[68] David A. Freedman. Invariants under mixing which generalize de Finetti’s
theorem: Continuous time parameter. Ann. Math. Statist., 34:1194–1216, 1963.
[69] David A. Freedman. On the asymptotic behavior of Bayes’ estimates in the
discrete case. Ann. Math. Statist., 34:1386–1403, 1963.
[70] David A. Freedman. On the asymptotic behavior of Bayes estimates in the
discrete case. II. Ann. Math. Statist., 36:454–456, 1965.
[71] Marie Gaudard and Donald Hadwin. Sigma-algebras on spaces of prob-
ability measures. Scand. J. Statist., 16(2):169–175, 1989.
[72] J. K. Ghorai and H. Rubin. Bayes risk consistency of nonparametric Bayes
density estimates. Austral. J. Statist., 24(1):51–66, 1982.
[73] S. Ghosal, J. K. Ghosh, and R. V. Ramamoorthi. Non-informative
priors via sieves and packing numbers. In Advances in statistical decision theory
and applications, pages 119–132. Birkh¨auser Boston, Boston, MA, 1997.
[74] S. Ghosal, J. K. Ghosh, and R. V. Ramamoorthi. Posterior consistency
of Dirichlet mixtures in density estimation. Ann. Statist., 27(1):143–158, 1999.
[75] Subhashis Ghosal. Normal approximation to the posterior distribution
for generalized linear models with many covariates. Math. Methods Statist.,
6(3):332–348, 1997.
[76] Subhashis Ghosal. Asymptotic normality of posterior distributions in high-
dimensional linear models. Bernoulli, 5(2):315–331, 1999.
[77] Subhashis Ghosal. Asymptotic normality of posterior distributions for expo-
nential families when the number of parameters tends to infinity. J. Multivariate
Anal., 74(1):49–68, 2000.
292 References
[78] Subhashis Ghosal, Jayanta K. Ghosh, and R. V. Ramamoorthi.
Consistent semiparametric Bayesian inference about a location parameter. J.
Statist. Plann. Inference, 77(2):181–193, 1999.
[79] Subhashis Ghosal, Jayanta K. Ghosh, and Tapas Samanta.Oncon-
vergence of posterior distributions. Ann. Statist., 23(6):2145–2152, 1995.
[80] Subhashis Ghosal, Jayanta K. Ghosh, and Aad W. van der Vaart.
Convergence rates of posterior distributions. Ann. Statist., 28(2):500–531, 2000.
[81] J. K. Ghosh, R. V. Ramamoorthi, and K. R. Srikanth. Bayesian anal-
ysis of censored data. Statist. Probab. Lett., 41(3):255–265, 1999. Special issue
in memory of V. Susarla.
[82] J. K. Ghosh, B. K. Sinha, and S. N. Joshi. Expansions for posterior
probability and integrated Bayes risk. In Statistical decision theory and related
topics, III, Vol. 1 (West Lafayette, Ind., 1981), pages 403–456. Academic Press,
New York, 1982.
[83] Jayanta K. Ghosh.Higher Order Asymptotics, volume 4. NSF-CBMS Re-
gional Conference Series in probability and Statistics, 1994.
[84] Jayanta K. Ghosh, Subhashis Ghosal, and Tapas Samanta. Stability
and convergence of the posterior in non-regular problems. In Statistical deci-
sion theory and related topics, V (West Lafayette, IN, 1992), pages 183–199.
Springer, New York, 1994.
[85] Jayanta K. Ghosh, Shrikant N. Joshi, and Chiranjit Mukhopad-
hyay . Asymptotics of a Bayesian approach to estimating change-point in a
hazard rate. Comm. Statist. Theory Methods, 25(12):3147–3166, 1996.
[86] Jayanta K. Ghosh and Rahul Mukerjee. Non-informative priors. In
Bayesian statistics, 4 (Pe˜ıscola, 1991), pages 195–210. Oxford Univ. Press,
New York, 1992.
[87] Jayanta K. Ghosh and R. V. Ramamoorthi. Consistency of Bayesian
inference for survival analysis with or without censoring. In Analysis of cen-
sored data (Pune, 1994/1995), pages 95–103. Inst. Math. Statist., Hayward,
CA, 1995.
References 293
[88] Jayanta K. Ghosh and Tapas Samanta. Nonsubjective Bayes testing—an
overview. J. Statist. Plann. Inference, 103(1-2):205–223, 2002. C. R. Rao 80th
birthday felicitation volume, Part I.
[89] Ghosh.J.K. Review of approximation theorems in statistics by serfling. Jour-
nal of Ameri. Stat., 78(383):732, September 1983.
[90] Richard D. Gill and Søren Johansen. A survey of product-integration
with a view toward application in survival analysis. Ann. Statist., 18(4):1501–
1555, 1990.
[91] Piet Groeneboom. Nonparametric estimators for interval censoring prob-
lems. In Analysis of censored data (Pune, 1994/1995), volume 27 of IMS Lec-
ture Notes Monogr. Ser., pages 105–128. Inst. Math. Statist., Hayward, CA,
1995.
[92] J. Hannan. Consistency of maximum likelihood estimation of discrete distri-
butions. In Contributions to probability and statistics, pages 249–257. Stanford
Univ. Press, Stanford, Calif., 1960.
[93] J. A. Hartigan.Bayes theory . Springer-Verlag, New York, 1983.
[94] J. A. Hartigan. Bayesian histograms. In Bayesian statistics, 5 (Alicante,
1994), pages 211–222. Oxford Univ. Press, New York, 1996.
[95] David Heath and William Sudderth. De Finetti’s theorem on exchange-
able variables. Amer. Statist., 30(4):188–189, 1976.
[96] David Heath and William Sudderth. On finitely additive priors, coher-
ence, and extended admissibility. Ann. Statist., 6(2):333–345, 1978.
[97] David Heath and William Sudderth. Coherent inference from improper
priors and from finitely additive priors. Ann. Statist., 17(2):907–919, 1989.
[98] N. L. Hjort. Bayesian approaches to non- and semiparametric density esti-
mation. In Bayesian statistics, 5 (Alicante, 1994), pages 223–253. Oxford Univ.
Press, New York, 1996.
[99] Nils Lid Hjort.Application of the Dirichlet Process to some nonparametric
estimation problems(in Norwegian). Ph.D. thesis, University of TromsØ.
294 References
[100] Nils Lid Hjort. Nonparametric Bayes estimators based on beta processes in
models for life history data. Ann. Statist., 18(3):1259–1294, 1990.
[101] N.L Hjort and D Pollard. Asymptotics of minimisers of convex processes.
Statistical Research Report, Department of Mathematics. University of Oslo,
1994.
[102] I. A. Ibragimov and R. Z. Hasminski
˘
ı.Statistical estimation. Springer-
Verlag, New York, 1981. Asymptotic theory, Translated from the Russian by
Samuel Kotz.
[103] Hemant Ishwaran. Exponential posterior consistency via generalized P´olya
urn schemes in finite semiparametric mixtures. Ann. Statist., 26(6):2157–2178,
1998.
[104] K. Ito.Stochastic processes. Matematisk Institut, Aarhus Universitet, Aarhus,
1969.
[105] Lancelot James. Poisson process partition calculus with applications to
exchangeable models and bayesian nonparametrics. .
[106] Harold Jeffreys. An invariant form for the prior probability in estimation
problems. Proc. Roy. Soc. London. Ser. A., 186:453–461, 1946.
[107] R. A. Johnson. An asymptotic expansion for posterior distributions. Ann.
Math. Statist., 38:1899–1906, 1967.
[108] Richard A. Johnson. Asymptotic expansions associated with posterior dis-
tributions. Ann. Math. Statist., 41:851–864, 1970.
[109] Joseph B. Kadane, James M. Dickey, Robert L. Winkler, Wayne S.
Smith, and Stephen C. Peters. Interactive elicitation of opinion for a
normal linear model. J. Amer. Statist. Assoc., 75(372):845–854, 1980.
[110] Olav Kallenberg.Foundations of modern probability. Springer-Verlag, New
York, 1997.
[111] Robert E. Kass and Larry Wasserman. The selection of prior distribu-
tions by formal rules. Journal of the American Statistical Association , 91:1343–
1370, 1996.
References 295
[112] J. H. B. Kemperman. On the optimum rate of transmitting information.
Ann. Math. Statist., 40:2156–2177, 1969.
[113] Yongdai Kim. Nonparametric Bayesian estimators for counting processes.
Ann. Statist., 27(2):562–588, 1999.
[114] Yongdai Kim and Jaeyong Lee. On posterior consistency of survival mod-
els. Ann. Statist., 29(3):666–686, 2001.
[115] A. N. Kolmogorov and V. M. Tihomirov.ε-entropy and ε-capacity of
sets in functional space. Amer. Math. Soc. Transl. (2), 17:277–364, 1961.
[116] Ramesh M. Korwar and Myles Hollander. Contributions to the theory
of Dirichlet processes. Ann. Probability, 1:705–711, 1973.
[117] Steffen L. Lauritzen.Extremal families and systems of sufficient statistics.
Springer-Verlag, New York, 1988.
[118] Michael Lavine. SomeaspectsofP´olya tree distributions for statistical mod-
elling. Ann. Statist., 20(3):1222–1235, 1992.
[119] Michael Lavine. MoreaspectsofP´olya tree distributions for statistical mod-
elling. Ann. Statist., 22(3):1161–1176, 1994.
[120] Lucien Le Cam.Asymptotic methods in statistical decision theory. Springer-
Verlag, New York, 1986.
[121] Lucien Le Cam and Grace Lo Yang.Asymptotics in statistics. Springer-
Verlag, New York, 1990. Some basic concepts.
[122] L. LeCam. Convergence of estimates under dimensionality restrictions. Ann.
Statist., 1:38–53, 1973.
[123] E. L. Lehmann.Testing statistical hypotheses. Springer-Verlag, New York,
second edition, 1997.
[124] E. L. Lehmann.Theory of point estimation. Springer-Verlag, New York,
1997. Reprint of the 1983 original.
[125] Peter J. Lenk. The logistic normal distribution for Bayesian, nonparametric,
predictive densities. J. Amer. Statist. Assoc., 83(402):509–516, 1988.
296 References
[126] Tom Leonard. A Bayesian approach to some multinomial estimation and
pretesting problems. J. Amer. Statist. Assoc., 72(360, part 1):869–874, 1977.
[127] Tom Leonard. Density estimation, stochastic processes and prior informa-
tion. J. Roy. Statist. Soc. Ser. B, 40(2):113–146, 1978. With discussion.
[128] D. V. Lindley. On a measure of the information provided by an experiment.
Ann. Math. Statist., 27:986–1005, 1956.
[129] D. V. Lindley. The use of prior probability distributions in statistical infer-
ence and decisions. In Proc. 4th Berkeley Sympos. Math. Statist. and Prob.,
Vol. I , pages 453–468. Univ. California Press, Berkeley, Calif., 1961.
[130] Albert Y. Lo. Consistency in the location model: the undominated case.
Ann. Statist., 12(4):1584–1587, 1984.
[131] Albert Y. Lo. On a class of Bayesian nonparametric estimates. I. Density
estimates. Ann. Statist., 12(1):351–357, 1984.
[132] Michel Lo`
eve.Probability theory. II. Springer-Verlag, New York, fourth
edition, 1978. Graduate Texts in Mathematics, Vol. 46.
[133] R. Daniel Mauldin, William D. Sudderth, and S. C. Williams.P´olya
trees and random distributions. Ann. Statist., 20(3):1203–1221, 1992.
[134] Amewou-Atisso Messan, Subhashis Ghosal, Jayanta K. Ghosh, and
R. V. Ramamoorthi. Posterior consistency for semiparametic regression
problems. Bernoulli , To appear(2), 2002.
[135] Radford M. Neal. Markov chain sampling methods for Dirichlet process
mixture models. J. Comput. Graph. Statist., 9(2):249–265, 2000.
[136] Michael A. Newton, Claudia Czado, and Rick Chappell. Bayesian
inference for semiparametric binary regression. J. Amer. Statist. Assoc.,
91(433):142–153, 1996.
[137] Michael A. Newton, Fernando A. Quintana, and Yunlei Zhang.
Nonparametric Bayes methods using predictive updating. In Practical non-
parametric and semiparametric Bayesian statistics, pages 45–61. Springer, New
York, 1998.
References 297
[138] Arthur V. Peterson, Jr. Expressing the Kaplan-Meier estimator as a
function of empirical subsurvival functions. J. Amer. Statist. Assoc., 72(360,
part 1):854–858, 1977.
[139] David Pollard.Convergence of stochastic processes. Springer-Verlag, New
York, 1984.
[140] David Pollard.A user’s guide to measure theoretic probability. Cambridge
University Press, Cambridge, 2002.
[141] Kathryn Roeder and Larry Wasserman. Practical Bayesian density
estimation using mixtures of normals. J. Amer. Statist. Assoc., 92(439):894–
902, 1997.
[142] Donald B. Rubin. The Bayesian bootstrap. Ann. Statist., 9(1):130–134,
1981.
[143] Gabriella Salinetti. Consistency of statistical estimators: the epigraphi-
cal view. In Stochastic optimization: algorithms and applications (Gainesville,
FL, 2000), volume 54 of Appl. Optim., pages 365–383. Kluwer Acad. Publ.,
Dordrecht, 2001.
[144] Mark J. Schervish.Theory of statistics. Springer-Verlag, New York, 1995.
[145] Lorraine Schwartz. On Bayes procedures. Z. Wahrscheinlichkeitstheorie
und Verw. Gebiete, 4:10–26, 1965.
[146] Gideon Schwarz. Estimating the dimension of a model. Ann. Statist.,
6(2):461–464, 1978.
[147] Robert J. Serfling.Approximation theorems of mathematical statistics.
John Wiley & Sons Inc., New York, 1980. Wiley Series in Probability and
Mathematical Statistics.
[148] Jayaram Sethuraman. A constructive definition of Dirichlet priors. Statist.
Sinica, 4(2):639–650, 1994.
[149] Jayaram Sethuraman and Ram C. Tiwari. Convergence of Dirichlet
measures and the interpretation of their parameter. In Statistical decision the-
ory and related topics, III, Vol. 2 (West Lafayette, Ind., 1981), pages 305–315.
Academic Press, New York, 1982.
298 References
[150] Xiaotong Shen and Larry Wasserman. Rates of convergence of posterior
distributions. Ann. Statist., 29(3):687–714, 2001.
[151] B. W. Silverman.Density estimation for statistics and data analysis. Chap-
man & Hall, London, 1986.
[152] RichardL.Smith. Nonregular regression. Biometrika, 81(1):173–183, 1994.
[153] S. M. Srivastava.A course on Borel sets. Springer-Verlag, New York, 1998.
[154] V. Susarla and J. Van Ryzin. Nonparametric Bayesian estimation
of survival curves from incomplete observations. J. Amer. Statist. Assoc.,
71(356):897–902, 1976.
[155] V. Susarla and J. Van Ryzin. Large sample theory for a Bayesian non-
parametric survival curve estimator based on censored samples. Ann. Statist.,
6(4):755–768, 1978.
[156] Henry Teicher. Identifiability of finite mixtures. Ann. Math. Statist.,
34:1265–1269, 1963.
[157] Daniel Thorburn. A Bayesian approach to density estimation. Biometrika,
73(1):65–75, 1986.
[158] Luke Tierney and Joseph B. Kadane. Accurate approximations for pos-
terior moments and marginal densities. J. Amer. Statist. Assoc., 81(393):82–86,
1986.
[159] Bruce W. Turnbull. The empirical distribution function with arbitrarily
grouped, censored and truncated data. J. Roy. Statist. Soc. Ser. B, 38(3):290–
295, 1976.
[160] A. W. van der Vaart.Asymptotic statistics. Cambridge University Press,
Cambridge, 1998.
[161] Aad W. van der Vaart and Jon A. Wellner.Weak convergence and
empirical processes. Springer-Verlag, New York, 1996. With applications to
statistics.
[162] Richard von Mises.Probability, statistics and truth . Dover Publications
Inc., New York, english edition, 1981.
References 299
[163] Abraham Wald. Note on the consistency of the maximum likelihood esti-
mate. Ann. Math. Statistics, 20:595–601, 1949.
[164] A. M. Walker. On the asymptotic behaviour of posterior distributions. J.
Roy. Statist. Soc. Ser. B, 31:80–88, 1969.
[165] Stephen Walker and Nils Lid Hjort. On Bayesian consistency. J. R.
Stat. Soc. Ser. B Stat. Methodol., 63(4):811–821, 2001.
[166] Stephen Walker and Pietro Muliere. Beta-Stacy processes and a gen-
eralization of the P´olya-urn scheme. Ann. Statist., 25(4):1762–1780, 1997.
[167] Stephen Walker and Pietro Muliere. A characterization of a neutral
to the right prior via an extension of Johnson’s sufficientness postulate. Ann.
Statist., 27(2):589–599, 1999.
[168] Mike West. Modelling with mixtures. In Bayesian statistics, 4 (Pe´ıscola,
1991), pages 503–524. Oxford Univ. Press, New York, 1992.
[169] Mike West. Approximating posterior distributions by mixtures. J. Roy.
Statist. Soc. Ser. B, 55(2):409–422, 1993.
[170] Mike West, Peter M¨
uller, and Michael D. Escobar. Hierarchical pri-
ors and mixture models, with application in regression and density estimation.
In Aspects of uncertainty, pages 363–386. Wiley, Chichester, 1994.
[171] E. T. Whittaker and G. N. Watson.A course of modern analysis.Cam-
bridge University Press, Cambridge, 1996. An introduction to the general the-
ory of infinite processes and of analytic functions; with an account of the prin-
cipal transcendental functions, Reprint of the fourth (1927) edition.
[172] Wing Hung Wong and Xiaotong Shen. Probability inequalities for likeli-
hood ratios and convergence rates of sieve MLEs. Ann. Statist., 23(2):339–362,
1995.
[173] Michael Woodroofe. Very weak expansions for sequentially designed ex-
periments: linear models. Ann. Statist., 17(3):1087–1102, 1989.
Index
affinity, 13
Albert, J.H., 213
amenable group, 52
Andersen, P.K., 237
Antoniak, C.E., 113
Bahadur, R.R., 29
ball, 10
Barron, A., 48, 132, 133, 135–137, 143,
171, 181, 182, 191
Basu, D., 46, 103
Basu, S., 213, 214
Bayes estimates, 122
asymptotic normality, 38
consistency, 122
Berger, J., 46, 47, 50, 51, 88, 229
Berk, R.H., 103
Bernardo, J.M., 47–50, 52, 228, 229
Bernstein, 34
beta distribution, 87
beta process, 254, 265
consistency, 279
construction, 266
definition, 265
properties, 268–270
beta:-Stacy process, 254, 270
BIC, 40
Bickel, P., 34
Billingsley, P., 12, 13
Birg´e, L., 232, 234
Blackwell, D., 21, 103
Blum, J., 238
Borgan, Ø., 237
Borwanker, J., 35
boundary, 10
bracket, 233
Bunke, O., 183
Burr D., 198
Cencov, N.N., 222
censored data, 237
consistency, 241, 247
Index 301
change point, 45
Chen, M., 147, 213
Chib, S., 213
Clarke,B,48
closed , 10
compact, 10
conjugate prior, 53
Connor, R.J., 253
consistency
L1, 122, 135
strong, 122
consistency of posterior, 17, 26
consistent estimate, 33
Cooke, G.E., 198
Cram´er, H., 33, 177
cumulative hazard function, 242, 243,
253, 258
Datta, G., 50–52, 229
Dawid, A.P., 51
De Finetti’s theorem, 64
Dembski, W.A., 221, 223, 224
density estimation, 141
Dey, D.K., 213
Dey, J., 257, 259, 271
Diaconis, P., 21, 22, 31, 53, 55, 86, 113,
181–185, 192, 195
Dirichlet density, 62
Dirichlet distribution, 87, 89
polya urn, 94
properties, 89–94
Bayes estimate, 95
Dirichlet mixtures, 143
normal densities, 144, 161, 197, 198,
209, 222
L1-Consistency, 169, 172
weak consistency, 162, 164, 165
uniform densities, see random his-
tograms
Dirichlet process, 96
convergence properties, 105
discrete support, 102
existence, 96
mixtures of, 113
mutual singularity, 110
neutral to the right, 99
posterior, 96
posterior consistency, 106
predictive distribution, 99
Sethuraman construction, 103
support, 104
tail free, 98
Doksum, K.A, 253
Doksum, K.A., 120, 253, 257, 259
Doss, H., 166, 181, 198
Dragichi, L., 120, 257
Dubins, L., 21
Dudley, R., 16, 81
empirical process, 26
entropy, 47
Erickson, R.V., 259, 271
Escobar, M.D., 142, 146, 147
Ferguson, T., 87, 107, 114, 143, 144,
146, 253, 257, 264
finitely additive prior, 52
Fisher information, 40
Florens, 146
Fortini, S., 86
Freedman, D., 21, 22, 24, 31, 55, 86,
113, 181–185, 192, 195
Gasperini, M., 142, 150, 151
Gaudard, M., 61
302 Index
Gaussian process priors, 174
sample paths, 175, 176
Ghorai, J.K., 161
Ghosal, S., 18, 35, 43, 45, 187, 198, 202,
231
Ghosh, J.K., 18, 33, 35, 39, 40, 43, 45–
47, 50–52, 187, 198, 202, 229,
231
Gill, R., 237, 244
Glivenko-Cantelli theorem, 59
Goldschmidt-Clermont, P.J., 198
Haar measure
left invariant, 51
right invariant, 51
Hannan, J., 14
Hartigan, J.A., 142, 223
Hasminski˘ı, R.Z., 41
Heath, D., 52, 83
Hellinger distance, 41
Hjort, N.L., 28, 103, 142, 143, 245, 253,
254, 264–268
Hoeffeding’s inequality, 128, 136
Hollander, M., 110
hyperparameter, 113, 146
Ibragimov, I.A., 41
IH conditions, 41
independent increment process, 253, 258–
260
interior , 10
Ito, K., 260
Jeffreys prior, 47, 49, 51, 221, 222, 225,
228
Johansen, S., 244
Johnson, R.A., 35
joint distribution, 16
Joshi, S.N., 35, 39, 40, 45
K-L support, 181
Kadane, J., 35, 54
Kallenberg, O., 260
Kallianpur, G., 35
Kaplan-Meier, 238, 241, 242, 249
Kass, R., 46
Keiding, N., 237
Kemperman, J.H.B., 15
Kim, Y., 254, 279
Kolmogorov, A.N., 64, 227
Kolmogorv strong law, 199, 200
Korwar, R., 110
Kraft, C., 76
Kullback-Leibler
divergence, 14, 126
support, 126, 129, 197
evy measure, 253, 261, 263
evy representation, 253, 259, 264
Ladelli, L, 86
Laplace, 17, 34
Lauritzen, S.L., 55
Lavine, M., 114, 143, 190, 192, 195
Leadbetter, M.R., 177
LeCam’s inequality, 137
LeCam, L., 34, 137, 231, 232
Lee, J., 254, 279
Lehman, E.L., 28, 29, 34
Lenk, P., 142, 174, 175, 177, 180
Leonard, T., 142, 174, 175
Lindley, D., 35, 48
link function, 198
Lo, A.Y., 142, 143, 161
location parameter, 181
consistency, 185, 186, 188
Dirichlet prior, 182
Index 303
consistency, 185
posterior for θ, 183
log concave, 28
logit, 213
Mahalanobis, D., 222
marginal distribution, 16
Massart, 234
Mauldin, R.D., 94, 114, 116, 119
maximum likelihood
asymptotic normality, 33, 34
consistency, 26, 28
estimate, 26, 249
inconsistency, 29
Mcqueen, J.B., 103
measure of information, 48
merging, 20
Messan, C.A., 198, 202, 215
metric
L1,13
compact, 11, 24
complete separable, 13, 24, 58, 60
Hellinger, 13, 58
separable space, 11
space, 10
supremum, 59, 81
total variation, 13, 58, 60
metric entropy
L1, 135, 137
bracketing, 135, 137
Milhaud, X., 183
Mosimann, J.E., 253
Mueller, P., 142, 146, 147
Mukherjee, R, 46
Mukhopadhyay, C.S., 45
Mukhopadhyay, C., 45
Mukhopadhyay, S., 213, 214
Muliere, P., 257, 270
multinomial, 24, 54, 67
Neal, R., 147
neighborhood base, 13
neutral to the right prior, 253
beta Stacy process, 255
characterization, 257
consistency, 271, 272, 275, 279
definition, 254
Dirichlet process, 255
existence, 263
inconsistency, 276
posterior, 256
posterior L´evy measure, 264
support, 262
Newton, M.A., 107, 114, 147
nonergodic, 52
noninformative prior, 46
nonregular, 35, 41, 44
nonseparable, 58, 60
nonsubjective prior, 10, 46, 51–53, 221
open,10
packing number, 224
Pericchi, L, 46
Peterson, A.V., 238, 246
Phadia, E., 253, 257
Pollard, D., 11, 26, 28
Polya tree, 142, 209
consistency, 118
existence, 116
Kulback-Leibler support, 190
marginal distribution, 117
on densities, 120
posterior, 117
predictive distribution, 118
304 Index
prior, 73, 198
process, 114
support, 119
posterior distribution, 16
posterior inconsistency, 31
posterior normality, 34, 35, 42
posterior robustness, 18
predictive distribution, 21
probit, 213
product integral, 245, 264
proper centering, 43
Quintana, F. A., 147
Ramamoorthi, R.V., 120, 187, 198, 202,
257, 259, 271
random histograms, 144, 148, 222
L1-consistency, 156, 160
weak consistency, 150, 152
Rao, B.L.S. Prakasa, 35
Rao, C.R., 222
rates of convergence, 141
reference prior, 47, 50, 51
Regazzini, E., 55, 86
regression
coefficient, 198
Schwartz theorem, 197, 198
Rubin, D., 106
Rubin, H, 161
Ryzin, Van J., 238, 241, 249
Salinetti, 122
Samanta, T., 18, 43, 45, 46
Savage, I.R., 103
Schervish, M., 54, 55, 63, 83, 86, 96,
106, 143, 147, 171
Schwartz, 31
Schwarz, G., 40
Sellke, T., 166
Serfling, R.J., 33
Sethuraman, J., 79, 96, 103, 105
setwise convergence, 58–61, 81
Shannon, C.L., 47
Shen, X, 231
Shen, X., 221, 230, 232, 234
Silverman, B.W,, 141
Sinha, B.K., 35, 39, 40
Smith, R.L., 41
Srivastava, S.M., 24
Stein, C., 31
Stone, M, 51
strong consistency, 135
Sudderth, W.D., 52, 83, 94, 114, 116,
119
support, 11, 24
topological, 11
survival function, 254
Susarla,V, 238, 241, 249
tail free prior, 71
0-1 laws, 75
consistency, 126
existence, 71
on densities, 76
posterior, 74
Teicher, H., 144
test
exponentially consistent, 127, 129,
203, 204, 214
unbiased, 127, 131
uniformly consistent, 127, 129, 131,
132
theorem
π-λ, 11, 60
Bernstein–von Mises, 33, 42, 44
Index 305
Borel Isomorphism, 24
De Finetti, 55, 63, 83, 86, 95
Doob, 22, 31
Kolmogorov consistency, 64, 66
pormanteau, 80
portmanteau, 12
Prohorov, 13, 60, 80
Schwartz, 33, 129, 181
Stone-Weirstrass, 21
Thorburn ,D., 174
Tierney, L., 35
tight, 13, 79
Tihomirov, V.M., 227
Tiwari, R.C., 79, 103, 105
Turnbull, B., 249
uniform strong law, 24, 26, 27
upper bracketing numbers, 234
Vaart, van der, 12, 26, 141, 231, 234
Von Mises, R., 18
von Mises, R., 34
Wald’s conditions, 27
Wald, A., 27
Walker, A.M., 34
Walker, S., 143, 257, 270
Wasserman, L., 46, 143, 171, 231
Watson, G.N., 194
weak consistency, 122
weak convergence, 12, 13, 60
Wellner, J., 12, 26, 234
West, M., 142, 146, 147
Whittaker E.T., 194
Williams, S.C., 94, 114, 116, 119
Wong, W., 221, 230, 232, 234
Woodroofe, M., 35
Yahav, J.A., 34
Ylvisaker, D., 53
Zhang, Y., 147
Zidek, J.V., 51
... A standard approach to BCLTs emphasizes the large-data distributional limit of the Bayesian posterior, e.g., by showing that the total variation distance between the Bayesian posterior and a normal approximation convergences to zero in probability. For example, this is the approach taken by Van der Vaart 2000, Chapter 10, Theorem 1.4.2 of Ghosh andRamamoorthi 2003, andLehmann andCasella 2006, on which our proof is principally based. Results such as our Theorem 1, which emphasize the frequentist properties of posterior quantities, may seem to be of a different character than distributional approximations to the posterior itself. ...
... A standard approach to BCLTs emphasizes the large-data distributional limit of the Bayesian posterior, e.g., by showing that the total variation distance between the Bayesian posterior and a normal approximation convergences to zero in probability. For example, this is the approach taken by Van der Vaart 2000, Chapter 10, Theorem 1.4.2 of Ghosh andRamamoorthi 2003, andLehmann andCasella 2006, on which our proof is principally based. Results such as our Theorem 1, which emphasize the frequentist properties of posterior quantities, may seem to be of a different character than distributional approximations to the posterior itself. ...
... For example, one must ensure that there are no other posterior modes very nearly as high as that atθ. In the present work we follow Ghosh and Ramamoorthi 2003;Lehmann and Casella 2006 and manage the posterior far fromθ with the final condition of Assumption 1, namely that ∞ θ is a strict optimum of the empirical likelihood with probability approaching one. This assumption may seem somewhat artificial, since it is stated in terms of the data, rather than of any limiting population quantities. ...
Preprint
The frequentist variability of Bayesian posterior expectations can provide meaningful measures of uncertainty even when models are misspecified. Classical methods to asymptotically approximate the frequentist covariance of Bayesian estimators such as the Laplace approximation and the nonparametric bootstrap can be practically inconvenient, since the Laplace approximation may require an intractable integral to compute the marginal log posterior, and the bootstrap requires computing the posterior for many different bootstrap datasets. We develop and explore the infinitesimal jackknife (IJ), an alternative method for computing asymptotic frequentist covariance of smooth functionals of exchangeable data, which is based on the ``influence function'' of robust statistics. We show that the influence function for posterior expectations has the form of a simple posterior covariance, and that the IJ covariance estimate is, in turn, easily computed from a single set of posterior samples. Under conditions similar to those required for a Bayesian central limit theorem to apply, we prove that the corresponding IJ covariance estimate is asymptotically equivalent to the Laplace approximation and the bootstrap. In the presence of nuisance parameters that may not obey a central limit theorem, we argue heuristically that the IJ covariance can remain a good approximation to the limiting frequentist variance. We demonstrate the accuracy and computational benefits of the IJ covariance estimates with simulated and real-world experiments.
... By Proposition 4.4.1 of Ghosh and Ramamoorthi (2003), this implies the existence of exponentially consistent tests. The case where H 1− and (β − β 0 ) ∈ Q γ,− can be obtained in a similar way. ...
... where U δ is a weak neighbourhood of f 0 (y|λ). Therefore, exponentially consistent tests always exist for Theorem 4.4.2 of Ghosh and Ramamoorthi (2003). If ∆ 2 and ρ are taken small enough, the union would contain ...
Preprint
In this work, we will investigate a Bayesian approach to estimating the parameters of long memory models. Long memory, characterized by the phenomenon of hyperbolic autocorrelation decay in time series, has garnered significant attention. This is because, in many situations, the assumption of short memory, such as the Markovianity assumption, can be deemed too restrictive. Applications for long memory models can be readily found in fields such as astronomy, finance, and environmental sciences. However, current parametric and semiparametric approaches to modeling long memory present challenges, particularly in the estimation process. In this study, we will introduce various methods applied to this problem from a Bayesian perspective, along with a novel semiparametric approach for deriving the posterior distribution of the long memory parameter. Additionally, we will establish the asymptotic properties of the model. An advantage of this approach is that it allows to implement state-of-the-art efficient algorithms for nonparametric Bayesian models.
... The parameter takes the form and can be interpreted as the concentration of p . If the elements a ij for i = 1, … , R and j = 1, … , C have similar and large values, e.g., a ij = 1000 for all i, j, then the prior distribution is very informative where the probability of any p l , l = 1, … , R ⋅ C being large would be equal for all p l [57]. In contrast, if all a ij take similar and small values, e.g., a ij = 0.1 for all i and j, then the resulting distribution p would be noninformative, and the resulting p will resemble the uniform distribution of dimension R ⋅ P. ...
Article
Full-text available
The $$\chi ^{2}$$ χ 2 test is among the most widely used statistical hypothesis tests in medical research. Often, the statistical analysis deals with the test of row-column independence in a $$2\times 2$$ 2 × 2 contingency table, and the statistical parameter of interest is the odds ratio. A novel Bayesian analogue to the frequentist $$\chi ^{2}$$ χ 2 test is introduced. The test is based on a Dirichlet-multinomial model under a joint sampling scheme and works with balanced and unbalanced randomization. The test focusses on the quantity of interest in a variety of medical research, the odds ratio in a $$2\times 2$$ 2 × 2 contingency table. A computational implementation of the test is developed and R code is provided to apply the test. To meet the demands of regulatory agencies, a calibration of the Bayesian test is introduced which allows to calibrate the false-positive rate and power. The latter provides a Bayes-frequentist compromise which ensures control over the long-term error rates of the test. Illustrative examples using clinical trial data and simulations show how to use the test in practice. In contrast to existing Bayesian tests for $$2\times 2$$ 2 × 2 tables, calibration of the acceptance threshold for the hypothesis of interest allows to achieve a bound on the false-positive rate and minimum power for a prespecified odds ratio of interest. The novel Bayesian test provides an attractive choice for Bayesian biostatisticians who face the demands of regulatory agencies which usually require formal control over false-positive errors and power under the alternative. As such, it constitutes an easy-to-apply addition to the arsenal of already existing Bayesian tests.
... Moreover, we also study the basic theoretical question of posterior consistency in these RKHS models within the proposed Bayesian framework. The concepts of consistency and posterior concentration are a kind of frequentist validation that have arguably been an active point of research in the last few decades, particularly in infinite-dimensional settings (see Amewou-Atisso et al., 2003;Ghosh and Ramamoorthi, 2003;Choi and Ramamoorthi, 2008), and also in the functional regression case (e.g. Lian et al., 2016;Abraham and Grollemund, 2020). ...
Preprint
Full-text available
We propose a novel Bayesian methodology for inference in functional linear and logistic regression models based on the theory of reproducing kernel Hilbert spaces (RKHS's). These models build upon the RKHS associated with the covariance function of the underlying stochastic process, and can be viewed as a finite-dimensional approximation to the classical functional regression paradigm. The corresponding functional model is determined by a function living on a dense subspace of the RKHS of interest, which has a tractable parametric form based on linear combinations of the kernel. By imposing a suitable prior distribution on this functional space, we can naturally perform data-driven inference via standard Bayes methodology, estimating the posterior distribution through Markov chain Monte Carlo (MCMC) methods. In this context, our contribution is twofold. First, we derive a theoretical result that guarantees posterior consistency in these models, based on an application of a classic theorem of Doob to our RKHS setting. Second, we show that several prediction strategies stemming from our Bayesian formulation are competitive against other usual alternatives in both simulations and real data sets, including a Bayesian-motivated variable selection procedure.
... Throughout the paper, we represent the CRE distribution through finite-dimensional mixtures; see Section 3.1. Thus, ϑ is finite-dimensional and the concentration results can be obtained from the literature on the consistency and asymptotic normality of posterior distributions; see Hartigan (1983), van der Vaart (1998), Ghosh and Ramamoorthi (2003), or Ghosal and van der Vaart (2017) for textbook treatments. The only difference to many of the results stated in the literature is that we assume that the convergence in probability to occur under the marginal distribution of Y 1:N,0:T rather than its distribution conditional on a "true" parameter, which imposes some restrictions on the prior for ϑ. ...
Article
Full-text available
We use a dynamic panel Tobit model with heteroskedasticity to generate forecasts for a large cross‐section of short time series of censored observations. Our fully Bayesian approach allows us to flexibly estimate the cross‐sectional distribution of heterogeneous coefficients and then implicitly use this distribution as prior to construct Bayes forecasts for the individual time series. In addition to density forecasts, we construct set forecasts that explicitly target the average coverage probability for the cross‐section. We present a novel application in which we forecast bank‐level loan charge‐off rates for small banks.
... The proof of this theorem is shown in Appendix C, which is based on Chib et al. (2018), Ghosh and Ramamoorthi (2003) and Van der Vaart (2000). This theorem essentially shows the limiting posterior distribution of ξ concentrates on a √ n-ball centered at the true value of ξ 0 with the same variance-covariance matrix as the M-estimator. ...
Article
Full-text available
Frequentist semiparametric theory has been used extensively to develop doubly robust (DR) causal estimation. DR estimation combines outcome regression (OR) and propensity score (PS) models in such a way that correct specification of just one of two models is enough to obtain consistent parameter estimation. An equivalent Bayesian solution, however, is not straightforward as there is no obvious distributional framework to the joint OR and PS model, and the DR approach involves a semiparametric estimating equation framework without a fully specified likelihood. In this paper, we develop a fully semiparametric Bayesian framework for DR causal inference by bridging a nonparametric Bayesian procedure with empirical likelihood theory via semiparametric linear regression. Instead of specifying a fully probabilistic model, this procedure is only realized through relevant moment conditions. Crucially, this allows the posterior distribution of the causal parameter to be simulated via Markov chain Monte Carlo methods. We show that the posterior distribution of the causal estimator satisfies consistency and the Bernstein–von Mises theorem, when either the OR or PS is correctly specified. Simulation studies suggest that our proposed method is doubly robust and can achieve the frequentist coverage. We also apply this Bayesian method to a real data example to assess the impact of speed cameras on car collisions in England.
Article
The rapid development of modeling techniques has brought many opportunities for data‐driven discovery and prediction. However, this also leads to the challenge of selecting the most appropriate model for any particular data task. Information criteria, such as the Akaike information criterion (AIC) and Bayesian information criterion (BIC), have been developed as a general class of model selection methods with profound connections with foundational thoughts in statistics and information theory. Many perspectives and theoretical justifications have been developed to understand when and how to use information criteria, which often depend on particular data circumstances. This review article will revisit information criteria by summarizing their key concepts, evaluation metrics, fundamental properties, interconnections, recent advancements, and common misconceptions to enrich the understanding of model selection in general. This article is categorized under: Data: Types and Structure > Traditional Statistical Data Statistical Learning and Exploratory Methods of the Data Sciences > Modeling Methods Statistical and Graphical Methods of Data Analysis > Information Theoretic Methods Statistical Models > Model Selection Model selection for many applications, such as selecting important variables, identifying time lags for forecasting, and ranking competing models.
Article
Full-text available
In the general setting of predictive inference, when observations are exchangeable and take values in a Polish space, conditions are stated in order that parametric models turn out to be limiting forms of predictive distributions and parameters are limiting forms of suitable predictive sufficient statistics. The treatment is completed by a necessary and sufficient condition in order that a sequence of predictive distributions may be consistent with an exchangeable distribution. Moreover, main properties of predictive sufficiency are revisited in the general setting described above.
Article
We consider generalized linear models and study the asymptotic properties of the posterior distribution where the dimension of the parameter is allowed to grow to infinity with the sample size. Under certain growth restrictions on the dimension, we show that the posterior distribution is consistent and admits a normal approximation. This result can be used to construct procedures with asymptotic Bayesian validity.
Article
We consider using scale mixtures of multivariate normal links (SMMVN) to model binary responses when binary observations are taken from the same individuals or are taken over time in a longitudinal fashion. SMMVN-links are quite rich, which include multivariate probit, Student’s t links, logit, symmetric stable link, and exponential power link. Fully parametric classical approaches to these are intractable and thus Bayesian methods are pursued using a Markov chain Monte Carlo (MCMC) sampling based approach. Necessary theory involved in Bayesian modeling and computation is provided. In particular, we produce a new look at the multivariate logit model, the most popular model in this context. Further, we develop various efficient computational algorithms for this complex simulation problem. Finally, a real data example from the Indonesian Children’s Health Study is used to illustrate the proposed methodology.
Article
Let a random sample of size n be taken from a distribution having a density depending on a real parameter θ, and let θ have an absolutely continuous prior distribution with density π(θ). We give a rigorous proof that, under suitable regularity conditions, the posterior distribution of θ will, when n tends to infinity, be asymptotically normal with mean equal to the maximum‐likelihood estimator and variance equal to the reciprocal of the second derivative of the logarithm of the likelihood function evaluated at the maximum‐likelihood estimator, independently of the form of π(θ).
Article
S ummary This paper is concerned with the non‐parametric estimation of a distribution function F , when the data are incomplete due to grouping, censoring and/or truncation. Using the idea of self‐consistency, a simple algorithm is constructed and shown to converge monotonically to yield a maximum likelihood estimate of F. An application to hypothesis testing is indicated.
Conference Paper
In this paper, we propose methods for the construction of a non-informative prior through the uniform distributions on approximating sieves. In parametric families satisfying regularity conditions, it is shown that Jeffreys’ prior is obtained. The case with nuisance parameters is also considered. In the infinite dimensional situation, we show that such a prior leads to consistent posterior.
Article
Kernel density estimation techniques are used to smooth simulated samples from importance sampling function approximations to posterior distributions, resulting in revised approximations that are mixtures of standard parametric forms, usually multivariate normal or T-distributions. Adaptive refinement of such mixture approximations involves repeating this process to home-in successively on the posterior. In fairly low dimensional problems, this provides a general and automatic method of approximating posteriors by mixtures, so that marginal densities and other summaries may be easily computed. This is discussed and illustrated, with comment on variations and extensions suited to sequential Bayesian updating of Monte Carlo approximations, an area in which existing and alternative numerical methods are difficult to apply.
Article
This article presents a nonparametric Bayesian estimator of a survival curve based on incomplete or arbitrarily right-censored data. This estimator, a Bayes estimator under a squared-error loss function assuming a Dirichlet process prior, is shown to be a Bayesian extension of the usual product limit (Kaplan-Meier) nonparametric estimator.
Article
Concepts of independence for nonnegative continuous random variables, X1, …, Xk, subject to the constraint ΣXi = 1 are developed. These concepts provide a means of modeling random vectors of proportions which is useful in analyzing certain kinds of data; and which may be of interest in quantifying prior opinions about multinomial parameters. A generalization of the Dirichlet distribution is given, and its relation to the Dirichlet is simply indicated by means of the concepts.The concepts are used to obtain conclusions of biological interest for data on bone composition in rats and scute growth in turtles.