ArticlePDF Available

Multivariate Receptor Modeling for Temporally Correlated Data by Using MCMC

Taylor & Francis
Journal of the American Statistical Association
Authors:

Abstract and Figures

Multivariate receptor modeling aims to estimate pollution source proŽ les and the amounts of pollution based on a series of ambient concentrations of multiple chemical species over time. Air pollution data often show temporal dependence due to meteorology and/or background sources. Previous approaches to receptor modeling do not incorporate this dependence. We model dependence in the data using a time series approach so that we can incorporate extra sources of variability in parameter estimation and uncertainty estimation. We estimate parameters using the Markov chain Monte Carlo method, which makes simultaneous estimation of parameters and uncertainties possible. The methods are applied to simulated data and 1990 Atlanta air pollution data. The results show promise towards the goal of accounting for the dependence in the data.
Content may be subject to copyright.
Multivariate Receptor Modeling for Temporally
Correlated Data by Using MCMC
Eun Sug Park Peter Guttorp Ronald C. Henry
NRCSE
T e c h n i c a l R e p o r t S e r i e s
NRCSE-TRS No. 043
March 9, 2000
The NRCSE was established in 1996 through a cooperative agreement with the United States
Environmental Protection Agency which provides the Center's primary funding.
Multivariate Receptor Modeling for Temporally Correlated Data
by Using MCMC
Eun Sug Park
1
, Peter Guttorp
1
, and Ronald C. Henry
2
1
National Research Center for Statistics and the Environment
University of Washington
Seattle, WA 98195
2
Civil and Environmental Engineering
University of Southern California
Los Angeles, CA 90089.
Author’s Footnote
Eun Sug Park is Research Associate, National Research Center for Statistics and the
Environment, University of Washington, Seattle, WA 98195. Peter Guttorp is Professor
of Statistics and Director of the National Research Center for Statistics and the
Environment, University of Washington, Seattle, WA 98195. Ronald C. Henry is
Associate Professor of Civil and Environmental Engineering, University of Southern
California, Los Angeles, CA 90089. Although the research described in this article has
been funded by the United States Environmental Protection Agency through agreement
CR825173-01-0 to the University of Washington, it has as not been subjected to the
Agency’s required peer and policy review and therefore does not necessarily reflect the
views of the Agency and no official endorsement should be inferred.
Abstract and Key Words
Multivariate receptor modeling aims to estimate pollution source profiles and the
amounts of pollution based on a series of ambient concentrations of multiple chemical
species over time. Air pollution data often show temporal dependence due to
meteorology and/or background sources. Previous approaches to receptor modeling do
not incorporate this dependence. We model dependence in the data using a time series
approach so that we can incorporate extra sources of variability in parameter estimation
and uncertainty estimation. We estimate parameters using the Markov chain Monte Carlo
method, which makes simultaneous estimation of parameters and uncertainties possible.
The methods are applied to simulated data and 1990 Atlanta air pollution data. The
results show promise towards the goal of accounting for the dependence in the data.
KEY WORDS: Dynamic models; Kalman filter; Gibbs sampler; Metropolis-Hastings
algorithm; Compositions; Air pollution.
1
1. INTRODUCTION
The goal of receptor modeling is to identify the pollution sources and assess the amounts of
pollution based on observations collected at a particular site, and from that information to
develop an effective air quality management plan. The basic mathematical model can be
written as follows based on chemical mass balance assumptions (see, e.g., Hopke, 1985,
1991, 1997; Gleser 1997):
yP tn
t
tk
k
q
k
t
=+=
=
αε
1
1,,,L (1)
where
yyy y
ttt tp
=
()
12
,,,L is the tth observation, q is the number of sources,
Ppp p
kkk kp
=
()
12
,,,L is the kth source composition (consisting of the fractional amount of
each species in the emissions from the kth source),
α
tk
is the contribution from the kth
source on the tth day, and
εεε ε
ttt tp
=
()
12
,,,L is the measurement error associated with the
tth observation. In matrix terms, the model (1) can be written as
YAPE=+ (2)
where A is n×q source contribution matrix, P is q×p source composition matrix, and E is
n
×
p error matrix. The model (1) may be viewed as a factor analysis model in the sense that
Y is the only observable quantity while q, P, and A are all unknown quantities that need to
be estimated (or predicted). Early approaches to multivariate receptor modeling include
exploratory factor analysis, principal component analysis, target transformation factor
analysis, and others (see, e.g., Henry 1991). It is well known that, without imposing
additional constraints on the parameters, the factor analysis model is not identifiable even
with known number of sources, q. There have been several attempts to avoid this problem
by imposing more restrictive constraints on either the P or the A matrix (see Henry and Kim
1990; Henry, Lewis, and Collins 1994; Yang 1994; Park 1997). As a matter of fact, there
could be many different sets of identifiability conditions, each making sense in its own
2
context. Park, Spiegelman, and Henry (1999) discuss identifiability conditions that are
meaningful in receptor models.
The assumption of independence among the observations y
t
has been made either
implicitly or explicitly in all previous approaches to multivariate receptor modeling, see, for
instance, Hopke (1991), Henry (1991), Yang (1994), Gleser (1997), Park (1997), and Park
et al. (1999). Air pollution data, however, are usually obtained as a series of measurements
on concentrations of aerosols over time, and meteorology often induces some degree of
dependence in the data. Observations closer in time tend to be more correlated than
observations farther apart in time (e.g., Figure 1).
{Insert Figure 1}
In some cases the assumption of independence may not be grossly wrong because
environmental data usually contains many missing values or erroneous observations, and
after initial screening of the data, time separation between any pair of measurements may
become large enough so that serial correlation can be ignored in the screened data. This, of
course, is not always the case. The research in this paper was motivated by a 1990 Atlanta
air pollution composition data set consisting of hourly measurements of volatile
hydrocarbon (VHC) species. This data set was used in Henry et al. (1994) to derive
vehicle-related hydrocarbon source compositions from the ambient data. In that study, three
types of measured source profiles specific to Atlanta in the summertime of 1990 were also
available: roadway emissions, whole gasoline, and gasoline headspace (see Henry et al.
1994). The compositions of those three sources for nine selected vehicle-related species are
provided in Table 1.
{Insert Table 1}
It is worthwhile to mention that those direct source measurements were obtained, under
rather restricted conditions, independently of the data (e.g., roadway compositions were
obtained as highway tunnel measurements during morning rush hour). Thus it is not
unlikely that the measured source compositions could be different from the true source
3
compositions P
0
for the data due to pollutant transport (between source and receptor) and
reactions (and also to measurement errors, variations in source compositions, and the
contribution of minor sources). Nonetheless, the measured source compositions may serve
as a guideline for the true source compositions.
Assuming that the measured compositions in Table 1 are the true source
compositions, i.e., P is known in model (2), A can be estimated easily, for instance, as an
ordinary least squares (OLS) solution,
ˆ
AYPPP
OLS
=
′′
()
1
, if we ignore dependence
structure in the data (and vice versa, i.e.,
ˆ
PAAAY
OLS
=
()
1
if A is known or estimated first.
This was done in almost all previous works without checking the independence
assumption). Figures 1 and 2 show the autocorrelation function (ACF) plot of the raw data
Y and residuals calculated as YAP
OLS
ˆ
for each of nine species, respectively.
{Figure 2 about here}
Figure 3 shows ACF plots of OLS estimates of source contributions,
ˆ
A
OLS
.
{Figure 3 about here}
All three plots reveal significant serial correlation in the data. It is well known in time series
literature that in the presence of the correlated residuals, the standard error (not adjusting for
the correlation in the residuals) of OLS estimate of the trend (which may be regarded as P
in our model) in the regression is often grossly wrong. Although the correct standard error
of OLS estimate may be obtained by adjusting for the correlation, it is still not the best
estimate since the generalized least squares estimate, taking the correlation into account in
the estimation procedure, has smaller standard error. The goal of this article is to extend
receptor models to account for temporal dependence in the data so that we can incorporate
that source of variability in estimation of parameters and uncertainties. In Section 2, we
introduce models accounting for time dependence in the observations. Estimation of
parameters is discussed in Section 3. Sections 4 and 5 contain examples from simulated
4
data and the Atlanta air pollution data, respectively. Finally, concluding remarks are made in
Section 6.
2. MODEL
Assume that the y
t
in (1) are dependent. We first need to decide how to model this
dependence. It seems reasonable to assume that the source contribution on time t depends
on the past source contributions (as Figure 3 indicates). Also, it is often the case that
ε
contains not only pure measurement error but also all the remaining sources of variability
that is not explained by the systematic part of our model such as background sources
(unmodeled minor sources) and meteorology, etc. Then it is likely that the
ε
t
are also
correlated in time due to the effect of meteorology and unmodeled sources (see Figure 2).
We may decompose
ε
t
into two terms
εηδ
ttt
=+ where
η
t
represents variability
correlated in time due to meteorology or background sources, and
δ
t
represents residual,
unpredictable variability due to pure measurement error, independent over time.
We consider the model
yP
tt tt
=++
αηδ
where
ααα α
ttt tq
=
()
12
,,,L is a stationary vector AR(1) process centered at
ξξξ ξ
=
()
12
,,, ,L
q
ηηη η
ttt tp
=
()
12
,,,L is a stationary vector AR(1) process centered at 0,
and
δδδ δ
ttt tp p
N=
()
()
12
,,, ~ ,L 0 Σ where
Σ=
()
diag
p
σσ σ
1
2
2
22
,,,L . We use ‘ N
k
⋅⋅
()
, ’ to
denote k-dimensional multivariate normal distribution throughout the paper. This model
may be written in Dynamic Linear Model (DLM) form (West and Harrison, 1997) as
Observation equation: yP N
tt tt t p
=++
()
αηδδ
,~,0 Σ
Evolution equation:
αξα ξ
tt ttq
uuNU=+
()
+
1
Φ ,~(,)0
ηη υ υ
tt t t p
NV=+
()
1
Θ ,~,0 (3)
5
where
uuu u
ttt tq
=
()
12
,,,L ,
Φ=
()
diag
q
φφ φ
12
,,,L ,
φ
k
is an AR coefficient for the kth
source contribution,
υυυ υ
ttt tp
=
()
12
,,,L ,
Θ=
()
diag
p
θθ θ
12
,,,L , and
θ
j
is an AR
coefficient for jth element of
η
t
. Note that marginal distribution for each
α
t
is
αξ
tq
NW W WU~,,
()
=+ΦΦ (4)
and for each
η
t
is
η
tp
NM MMV~,,0
()
=+ΘΘ . (5)
3. ESTIMATION
As the model gets complicated by inclusion of more parameters, Markov chain Monte Carlo
(MCMC) simulation (Tierney 1994; Chib and Greenberg 1995; Besag, Green, Higdon, and
Mengersen 1995; Gilks, Richardson, and Spiegelhalter 1996) seems to be an attractive
approach for parameter estimation. Note also that the parameters of the models (1) or (3)
are all unknown, and the problem of parameter estimation is essentially nonlinear, but the
Markov chain Monte Carlo method makes the problem linear by use of conditional
distributions. We introduce a Bayesian framework to employ an MCMC method
(constraints and identifiability conditions can be used as a part of the prior distribution). As
mentioned in Section 1, the receptor model can be viewed as a special type of a factor
analysis model (with the constraints that the elements of factor loading matrix should be all
nonnegative). For identifiability of the model we borrow conditions from the confirmatory
factor analysis model (Anderson 1984).
C1. There are at least q 1 zero elements in each row of P,
C2. The rank of P
(k)
is q 1, where P
(k)
is the matrix composed of the columns
containing the assigned 0’s in the kth row with those assigned 0’s deleted.
Under the above conditions the source profiles, P, are identified up to normalization, which
is enough for the purpose of receptor model. (As long as the relative amount of each
6
species in a source is determined, a source can be identified.) The conditions C1 and C2
(and nonnegativity constraints on the elements of P) are absorbed into prior distribution for
P.
Under the normal error assumption on
δ
, the likelihood
fYL
()
is written as
f Y tr Y AP Y AP
n
L
()
=−
()
−−
()
2
2
1
2
1
πηη
ΣΣexp (6)
where
η
is n
×
p matrix of which rows are
η
t
,
tn=1, ,L . We use
L to denote
conditioning on all other variables. For a prior distribution p() , we assume that
pP U V
pPp p pUp U p pVp V
nn
nn
,,,, , , ,,, , ,
,, ,, ,, ,.
ΣΦ Θ
ΣΦ Φ Θ Θ
αα ηη
αα ξ ηη
11
10 1
LL
LL
()
=
()()()()
()
()
()
()
For the sake of brevity,
ξ
is assumed known to be
ξξ
=
0
. Note that (3) implies
pUWWUtrU
n tt tt
t
n
n
n
α α ξ π γ γ γγ γγ
1
1
2
1
1
1
1
2
1
11
2
2
2
1
2
1
2
, , , , exp expL ΦΦΦ
()
=
()
()
−−
()
()
−−
=
where
γαξ
tt
=−
0
and
p M M V tr V
n tt tt
t
n
n
n
η η η π η η ηη ηη
1
1
2
1
1
1
1
2
1
11
2
2
2
1
2
1
2
, , , exp expL ΘΘΘ
()
=
()
()
−−
()
()
−−
=
.
Based on a series of observations
yy
n1
,,L , we are interested in sampling the full
posterior
πααηη
PU V Y
np
,,,, , , ,,, , ,ΣΦ Θ
11
LL
()
. We use “block-at-a-time” Metropolis-
Hastings algorithm (Chib and Greenberg, 1995). We shall make use of seven move types
in implementing MCMC:
(a) updating P
(b) updating Σ
(c) updating Φ
(d) updating U
(e) updating Θ
(f) updating V
(g) updating
α
and
η
.
7
Letting
˜
PAAAY=
()
()
1
η
and SY APY AP=−
()
−−
()
ηη
˜˜
, and using the
orthogonality properties associated with
˜
P (see Press 1982), (6) can be written as
2
2
1
2
1
1
2
1
π
ΣΣ Σ
−−
{}
−−
()
()
()
n
tr S tr P P A A P Pexp exp
˜˜
∝−
()
()
()
exp
˜˜
1
2
1
vecP vecP A A vecP vecPΣ .
Let the prior distribution for P be
p P p vecP N m C P k q j p
kj
() ( )~ , , , ,, , ,=
()
≥= =
()
00
01 1I LL
where m
0
is a pq-dimensional vector and C
0
is a pq pq× -dimensional diagonal matrix.
Enforcing the constraints C1-C2 is equivalent to using a degenerate point prior for some of
the elements of P. We set qq×−
()
1 elements of m
0
and the corresponding elements of
C
0
to be zero, which makes the prior distribution for P a truncated singular normal
distribution (though still proper). Then the resulting full conditional posterior distribution
π
PL
()
is again a truncated singular normal distribution, which can be written as
vecP N m C P k q j p
q
kj
LLL~, , ,,, ,,
()
≥= =
()
I 01 1
where m C A vec Y C m=⊗
()
()
+
{}
−−
Σ
1
00
η
, CAAC=⊗
+
()
−−
Σ
1
0
1
where C
0
is a
generalized inverse of C
0
. Since both of Σ and C
0
are diagonal, for the columns of P with
no zero elements, we have
PNmCP k q
jqjj
kj
LL~, ,,,
()
≥=
()
I 01
where
mC Ay Cm
jjj
j
j
jj
=
()
+
{}
−−
ση
2
0
1
0
, CAAC
jj j
=
+
()
−−
σ
2
0
1
1
, m
j0
is a q-dimensional
prior mean vector of P
j
, C
j0
is a corresponding submatrix of C
0
, y
j
is the jth column of Y,
and
η
j
is the jth column of
η
. For the columns of P containing zero elements, let q
be the
8
number of nonzero elements for that column and P
j
be a column vector consisting of those
q
elements. Then
PNmCP k q
j
q
jj
kj
∗∗
()
≥=
()
LL~, ,,,I 01
where mCA Ay Cm
jj j
j
j
jj
∗∗
∗−
∗∗
=−
()
+
{}
ση
2
0
1
0
, CAAC
jj j
∗−
∗∗
=+
()
σ
2
0
1
1
, m
j0
is a q
-
dimensional prior mean vector of nonzero elements of P
j
, C
j0
is a corresponding submatrix
of C
0
, and A
consists of the columns of A corresponding to nonzero elements of P
j
.
If there is no prior information about the source compositions but the zero elements, we
may use a noninformative prior pP P P j J
kj kj
j
p
k
q
() ,=≥
()
=∈
()
==
II00
0
11
where J
0
is the
index set for which P
kj
= 0, which takes into account the conditions C1-C2 and
nonnegativity only. Under this prior, we have, for the columns of P with no zero element,
PNmCP k q
jqjj
kj
LL~, ,,,
()
≥=
()
I 01
where
mAAAy
j
j
j
=
()
()
1
η
, CAA
jj
=
()
σ
2
1
. For the columns of P containing zero
elements, we get
PNmCP k q
j
q
jj
kj
∗∗
()
≥=
()
LL~, ,,,I 01
where
mAAAy
j
j
j
∗∗
=
()
()
1
η
, CAA
jj
∗∗
=
()
σ
2
1
.
Hence move (a) can be performed using either a Gibbs sampler or a simple Metropolis-
Hastings algorithm.
Under a usual inverse gamma prior distribution for
σ
j
2
,
σαβ
j
()
2
~,Γ ,
jp=1, ,L ,
with the parameterization in which the mean and variance are
αβ
and
αβ
2
, respectively,
the full conditional for
σ
j
2
{}
are
σαβ
jj
nd
++
()
2
1
2
1
2
L ~,Γ
9
where d y AP y AP
j
j
j
j
j
j
j
=−
()
−−
()
ηη
. This can be easily sampled using a Gibbs
sampler.
Moves (c) - (g) require Metropolis-Hastings steps. We use the same strategy as those
given in Chib and Greenberg (1995) and West and Harrison (1997) to update Φ and U,
respectively. Let
γαξ
tt
=−
0
. Under uniform priors for
φ
k
, writing
φφ φ
=
()
1
,,L
q
for
the diagonal of Φ, and D diag
t
=
()
γ
1
, the full conditional posterior density for Φ,
πφ
L
()
,
is proportional to
cf bBI
nor
Φ
()
()
<<
()
φφ
,0 1
where f
nor
is the q-variate normal density function, BDUD
t
n
−−
=
=
11
2
, bB UD
t
t
n
=
=
γ
1
2
,
cW WΦ
()
=−
()
1
2
1
2
1
1
1
exp
γγ
, WWU=+ΦΦ and II
k
k
q
01 0 1
1
<<
()
=<<
()
=
φφ
. We use
NbB
q
,
()
as a proposal distribution for
φ
(independent proposal). That is, we sample a
candidate
φ
i
from NbB
q
,
()
, compute the corresponding diagonal matrix Φ
and variance
matrix W
such that WWU
∗∗
=+ΦΦ , and accept new
φ
vector with probability
min ,1
01
01
cI
cI
Φ
Φ
∗∗
()
<<
()
()
<<
()
φ
φ
.
The full conditional posterior for U ,
π
U L
()
, is proportional to
p U a U U trace U G
n
()()
()
[]
1
2
1
2
1
exp
where G
tt tt
t
n
=−
()
()
−−
=
γγ γγ
11
2
ΦΦ and aU W W
()
=−
()
1
2
1
2
1
1
1
exp
γγ
. Note that G
follows a Wishart distribution with parameters U and n 1, i.e.,
GWUn
q
~,
()
1
where fG
G trace U G
Un
nk
kn
n
k
()
=
()
[]
()
−−
1
2
1
2
1
2
2
1
2
1
1
1
1
2
21
()
()
()
exp
()Γ
. Under an inverse Wishart prior
10
UW m
q
~,
()
1
00
Ψ
where the density is given by
pU
U trace U
m
m
mk
mk
k
()
=
()
[]
()
−++
()
ΨΨ
Γ
0
1
1
2
0
1
1
2
0
1
2
0
1
2
0
1
2
0
2
)
exp
,
the conditional distribution of U given G is
UG W Gm n
q
~,
++
()
1
00
1Ψ , and so the full
conditional posterior for U is proportional to
aU f U Gm n
Wishart
()
++
()
1
00
1Ψ ,
where f
Wishart
1
is the inverse Wishart density function. We use this inverse Wishart
distribution WGmn
q
++
()
1
00
1Ψ , as a proposal distribution for U. The acceptance
probability in this case is given by
min ,1
aU
aU
()
()
where WWU
∗∗
=+ΦΦ .
Move types (e)-(f) are essentially the same as move types (c)-(d) with substitution of Θ,
V, Μ and
η
for Φ, U , W , and
γ
, respectively.
Move (g), updating
α
(equivalently, updating
γαξ
tt
=−
0
) and
η
, can be implemented
by forward-filtering, backward-sampling algorithm (West and Harrison 1997) applied to
y
t
µ
0
where
µ
0
=
()
Ey
t
. Note that the assumption that
µ
0
is known is not a strong
assumption. Model (3) can be rewritten as
y
ttt
−= +
µλ δ
0
F and
λλ ρ
tt t
=+
1
G , (7)
where
λγη
ttt
=
[]
is the state vector at time t, F
P
I
=
×pp
, G is the (k+p)
×
(k+p) matrix,
G
0
0
=
Φ
Θ
, and
ρυ
ttt
u=
[]
with variance matrix Ω=
U0
0 V
. To sample from the
11
full conditional posterior
πλλ λ
12
,,,LL
n
()
, we sequentially simulate the individual vectors
λλ λ
nn
,,,
11
L as follows:
1) Sample
λ
n
from NmC
qnn
,
()
where m
n
and C
n
are obtained from the Kalman filtering
recurrences
mmeK
tttt+++
=+
111
G ,
ey m
tt t++
=−
110
µ
GF,
KRR
t
t
t
t
t++
+
=+
()
11
1
1
Σ FFF,
CRRK
tttt++++
=−
1111
F ,
RC
tt
t
+
=+
1
GG .
2) For each
tn n=− 121,,,L , sample
λ
t
from NhH
qt t
,
()
where hm aB
ttt tt
=+
()
++
λ
11
,
HCBRB
ttttt
=−
+1
, BR C
tt t
=
+
1
1
G , am
tt+
=
1
G, and
λ
t+1
is the value just sampled.
Note that the likelihood (6) is invariant with respect to changes in scale of A or P (even
after identifiability conditions C1-C2 are taken into account), and the parameters A (and so
ξ
and U) and P are identified except for multiplication by a diagonal matrix (consisting of
scale constants), i.e., we would estimate AD
1
(D
1
ξ
, DUD
−−11
) and DP unless we use a
very precise informative prior. As already mentioned, knowing (estimating) P up to a
normalizing constant fulfills the objective of receptor modeling. It can also be shown that a
scale constant matrix D (although it is unknown and depends on the initial value of the
parameters) does not vary from iteration to iteration within an MCMC run. In this sense
our MCMC scheme is self-consistent, and so the adjustment for the scale constant matrix
does not need to be made at each step. If the scale constant (the matrix D) is ever known
(e.g., the total mass of pollutant particle is known), the adjustment can be directly applied to
the posterior summaries simply by multiplying (or dividing) by D. Care must be taken
though in specifying the initial values for the parameters or hyperparameters for the prior
12
distributions to ensure that at least they are approximately on the same scale or in a
consistent fashion (e.g.,
ξ
, hyperparameters for U, and initial value for A or P).
Finally, the posterior probability statements can directly be made on the identifiable
quantities such as the normalized P or the scaled matrix of U (i.e., the correlation matrix of
A) as discussed in Besag et al. (1995).
Remark 1. When
α
t
and
ε
t
are assumed to be independent, it can easily be shown that
under a normal prior distribution
αξ
tq
N~,
00
Ξ
()
, the full conditional distribution for
α
t
,
πα
t
L
()
, is a normal distribution through conjugacy, i.e.,
αξ
εε ε
tqt
Ny P P P PPL ~,Σ ΞΣΞ ΣΞ
−−
−−
+
()
+
()
+
()
()
1
00
11
0
1
1
1
0
1
1
where
Σ
εεε
εσσ
=
()
=
()
cov , ,
tp
diag
1
22
L . This can be updated using a Gibbs sampler,
and with moves (a) and (b) where
y
j
j
η
and
σ
j
2
are replaced by y
j
and
σ
ε
j
2
, respectively,
it completes one cycle of MCMC when the observations are treated as independent. In
Section 4, this approach is also compared to our time series approach when the observations
are actually dependent.
4. SIMULATION
The data are generated by the model (3) with p = 7, n = 200, q = 3,
σσ
1
2
7
2
3== =L ,
φφφ
123
08===. ,
ξ
0
10 12 14=
()
,, , U
u
kk
=
×
σ
2
Ι where
σ
u
2
3= ,
θθ
17
7===L . ,
V =⋅
×
σ
υ
2
77
Ι where
σ
υ
2
3= . The initial values of
α
and
η
are given by
αξ
σ
φ
1
0
2
2
1
k
u
k
k
Z=+
, where ZN
k
~(,)01, k =123,, and
η
σ
θ
υ
1
2
2
1
j
j
j
Z=
,
j =17,,L ,
respectively. The true source composition matrix P
0
(normalized to sum to 1) is given in
Table 2. It follows from (4) and (5) that W =⋅
×
8 333
33
. I and M =⋅
×
5 882
77
. I .
13
In implementing MCMC, we take
α
= 3 and
β
= 8 for the prior on
σ
j
2
,
j =17,,L
(yielding the prior mean 4), m
0
7= and Ψ
03
9=⋅
×
I
3
for the prior on U (yielding the prior
mean 3
33
×
I ), and set the scale matrix for the prior on V equal to 9
77
×
I and the degrees of
freedom equal to 11 (yielding the prior mean 3
77
×
I ), each ensuring a proper but relatively
diffuse prior. We use a noninformative prior distribution for the nonzero elements of P
throughout simulation.
The posterior summaries for the model parameters, P, Σ, Φ, U , Θ, and V, based on
2,000 values subsampled from 20,000 iterations following a 20,000 burn-in period are
reported in Tables 3-5. For the source composition matrix P and the variance matrix U,
those summaries are obtained in terms of normalized P (sum to 1) and the scaled variance
matrix R
U
(the correlation matrix) since they are identified only up to a constant multiplier.
{Tables 3-5 about here}
We also report the posterior summaries obtained from the approach for independent
observations (see Remark 1) in Table 6. Since this approach does not decompose the error
variances into Σ and M, we treat the estimates of the error variances as the estimates for
ΣΣ
εε ε
σσ
2
1
22
=
()
=+diag M
p
,,L . The prior mean and the covariance matrix of
α
t
are
set to be
ξ
0
10 12 14=
()
and Ξ
0
100=⋅
×
I
33
, respectively, and the hyperparameters of
the priors on
σ
ε
j
2
(,,)j =17L are taken as
α
= 4 and
β
j
= 27,
j =17,,L , (yielding the
prior mean 9). The results are based on a posterior sample of size 2,000 obtained by
subsampling every 10th from 20,000 values following a 20,000 burn-in period.
{Table 6 about here}
By comparing Table 3 and Table 6, it can be noted that the approach accounting for
dependence in the data yields much better result in terms of posterior inferences than the
approach not accounting for dependence. In Table 3 only 2 of the 15 (nonzero) elements of
P
0
lie outside the 95% credible intervals (all are within the 99% credible intervals though we
14
do not report them in the table) whereas in Table 6 ten elements of P
0
fall ouside the 95%
credible intervals (9 of them are not captured even by the 99% credible intervals).
Simultaneous credible regions for the whole matrix P
0
can also be constructed using the
method (based on order statistics) suggested in Besag et al. (1995). Table 3 includes the
80% credible regions and these contain all elements of P
0
(The same holds for the 70%
credible regions). In Table 6, nine elements of P
0
are still outside the 80% credible regions
(7 of them are not captured even by the 90% credible regions). This is a natural
consequence of not taking into account the correlation in the errors into the calculation of
standard errors (posterior standard deviations here). In fact, the posterior standard
deviations in Table 6 are much smaller than they should have been. Figure 4 shows the
side-by-side barplots of the true source compositions ( P
0
) and the posterior mean of P
from two different approaches, time series approach (
ˆ
P
ts
) and approach ignoring
dependence (
ˆ
P
indep
), with R
2
values between P
0
and estimates. Again it can be seen that
ˆ
P
ts
gives a much better approximation to the true source composition matrix P
0
than
ˆ
P
indep
does.
5. APPLICATION TO ATLANTA DATA
The 1990 Atlanta data described in Section 1 has two types of temporal dependence
structure, correlation in
α
and correlation in
ε
(see figures 2 and 3). We use model (3)
with q = 3 to analyze this data set consisting of 538 measurements on 9 chemical species.
For identifiability conditions, zeros are preassigned for CyHx+2MHx (cyclohexane+2-
methylhexane) and 2,3-DMP (2,3-dimethylpentane) of source 1 (Roadway), acetylene and
propene of source 2 (Gasoline), acetylene and 2,2,4-TMP (2,2,4-trimethylpentane) of source
3 (Headspace) since the relative concentrations of those species in each source are observed
to be very low from Table 1. An OLS estimate
ˆ
AYP PP
OLS
measured
t
measured measured
t
=
()
1
where
P
measured
is the measured source compositions (with zeros preassigned and each row
15
normalized to sum to 100) was used as an initial value for A. The mean source contribution
was set to
ξ
0
37 14 03=
()
..., which is the arithmetic mean of
ˆ
A
OLS
. Note that the
specification of the value of
ξ
0
is somewhat arbitrary due to the scale invariance property
mentioned in Section 3. We only need to ensure that
ξ
0
and the initial value of A are on the
same scale. Since the measured source compositions ( P
measured
) can be regarded as prior
information, we use as a prior distribution for P a truncated singular normal distribution
with the mean P
measured
and the variance 900 for the nonzero elements of P, which ensures a
fairly vague prior (the elements of P
measured
have the values between 0 and 100). The scale
matrix for an inverse Wishart distribution for U was set to Ψ
0
16 1 0 7 0 08=⋅
()
diag ,.,.
with the degrees of freedom m
0
20= , yielding the prior mean of
Ψ
0
16 1 0 7 0 08=
()
diag ,.,.. This choice of the hyperparameter values was made to
ensure that the prior distribution is moderately informative but flexible enough to cover the
range of possible values of U. For the hyperparameters of the priors on
σ
j
2
,
j =19,,L , we
take
α
= 5 and
β
j
= 48 (the prior mean 12), and for the hyperparameters of prior on V we
set the scale matrix equal to 27I
p
and the degrees of freedom equal to 13 (so that the prior
mean is 9I
p
), ensuring a proper but relatively diffuse prior. For each parameter, a
posterior sample of size 1,000 was obtained by subsampling every 10th from 10,000 values
following a 10,000 burn-in period. Tables 7-9 contain posterior summaries for some model
parameters.
{Tables 7 and 8 about here}
The AR coefficients
φ
k
are estimated to be
ˆ
.
φ
1
78= ,
ˆ
.
φ
2
68= , and
ˆ
.
φ
3
48= , respectively,
suggesting that there is substantial autocorrelation in roadway contribution and moderate
autocorrelation in gasoline contribution and headspace contribution.
16
The side-by-side barplots of the measured source compositions (in Table 1) and
estimated compositions are given in Figure 5 with R
2
values between measured and
estimated compositions. In general, there seems to be good agreement between them.
{Figure 5 about here}
As mentioned in Section 1, the measured compositions are not the true source
compositions in the sense of Section 4 for the data though they are expected to be generally
close to the true compositions. For the Headspace composition profile (for which the
measured and the estimated compositions show the best agreement), all but one (2MPentan)
of the measured values fall in the 99% credible intervals. The 80% simultaneous credible
regions (constructed by the method of Besag et al. 1995) are also reported in Table 7 and
these capture all of the measured Headspace composition.
6. CONCLUSIONS AND DISCUSSION
In this article we develop a time series extension of multivariate receptor modeling in order
to capture in the estimation process extra variability due to temporal dependence in air
pollution data. Recent developments in MCMC methodology make estimation of
parameters of complex models possible. By modeling the dependence structure, we can get
more reliable estimates for the source compositions and their uncertainties, which are of our
primary interest. As a by-product we can assess the amount of variability and
autocorrelation in the source contributions and the errors. It also makes it possible to
forecast the level of pollutants y
tk+
()
and the amount of pollution
α
tk+
()
, which has been
regarded as one of the model limitations in previous receptor modeling approaches (see the
EPA discussion at http://www.epa.gov/oar/oaqps/pams/analysis/receptor/rectxtsac.html).
Throughout the article we assume that the errors are normally distributed.
Environmental data often contain many outliers, and it is sometimes more appropriate to use
the lognormal distribution to describe the data. The usual transformation technique does
not help especially in the context of receptor modeling. By log-transforming the data the
17
chemical mass balance equation of the model no longer applies directly, and we need to deal
with model identifiability using different conditions. Alternatively, we may consider a
multivariate T-distribution or a mixture of normal distributions to describe the error
distribution. In the application to Atlanta data, the histogram of the residuals for each
species looks in general bell shaped, but shows a few outliers for some of the species. This
might suggest a use of heavy-tailed distribution for errors though it was not pursued further
in this article. Non-normal dynamic modeling is still an active research area (see West and
Harrison 1997), and we expect that multivariate receptor modeling can be extended further
using non-normal dynamic models.
Another assumption we have made is that the errors have mean 0. To be more realistic,
it would be preferable to generalize this to include the unknown non-zero mean errors,
corresponding to unknown sources. This again involves the development of new
identifiability conditions.
Finally, air pollution data is often obtained from multiple receptors. How to incorporate
spatial variability as well as temporal variability in modeling when multiple species are
measured is a challenging problem. Even in the case of no temporal dependence, this
problem remains open.
18
REFERENCES
Anderson, T.W. (1984), An Introduction to Multivariate Statistical Analysis (2nd ed.), New
York: Wiley.
Besag, J., Green, P., Higdon D., and Mengersen K. (1995),”Bayesian Computation and
Stochastic Systems,” Statistical Science, 10, 3-41.
Chib, S., and Greenberg, E. (1995), “Understanding the Metropolis-Hastings Algorithm,”
American Statistician, 49, 331-335.
Gilks, W.R., Richardson, S., and Spiegelhalter, D.J. (1996), Markov chain Monte Carlo in
practice, Chapman & Hall.
Gleser, L.J. (1997), “Some Thoughts on Chemical Mass Balance Models,” Chemometrics
and Intelligent Laboratory Systems, 37, 15-22.
Henry, R.C. (1991), “Multivariate Receptor Models,” in Receptor Modeling for Air
Quality Management (ed. P. Hopke), pp.117-147. Amsterdam: Elsevier.
(1997), “History and Fundamentals of Multivariate Air Quality Receptor
Models,” Chemometrics and Intelligent Laboratory Systems, 37, 37-42.
Henry, R.C., and Kim, B.M. (1990), “Extension of Self-Modeling Curve Resolution to
Mixtures of More than Three Components. part 1. Finding the Basic Feasible
Region,” Chemometrics and Intelligent Laboratory Systems, 8, 205-216.
Henry, R.C., Lewis, C.W., and Collins, J.F. (1994), “Vehicle-Related Hydrocarbon Source
Compositions from Ambient Data: the Grace/Safer Method,” Environmental Science
and Technology, 28, 823-832.
Henry, R.C., Lewis, C.W., and Hopke, P.K. (1984), “Review of Receptor Model
Fundamentals,” Atmospheric Environment, 18, 1507-1515.
Hopke, P.K. (1985), Receptor Modeling in Environmental Chemistry, New York: Wiley.
(1991), “An Introduction to Receptor Modeling,” Chemometrics and
Intelligent Laboratory Systems, 10, 21-43.
19
(1997), “The Chemical Mass Balance as a Multivariate Calibration Problem,”
Chemometrics and Intelligent Laboratory Systems, 37, 5-14.
Park, E. S. (1997), “Multivariate Receptor Modeling from a Statistical Science Viewpoint,”
unpublished Ph.D. dissertation, Texas A&M University, Dept. of Statistics.
Park, E. S., Spiegelman, C. H., and Henry, R. C. (1999), “Bilinear Estimation of Pollution
Source Profiles in Receptor Models,” Technical Report 006, University of
Washington, National Research Center for Statistics and the Environment.
Press, S. J. (1982), Applied Multivariate Analysis: Using Bayesian and Frequentist
Methods of Inference (2nd edition). New York: Krieger.
Tierney, L. (1994), “Markov Chains for Exploring Posterior Distributions,” Annals of
Statistics. 22, 1701-1762
West, M., and Harrison (1997), Dynamic Linear Models, New York: Springer-Verlag.
Yang, H. (1994), “Confirmatory Factor Analysis and its Application to Receptor
Modeling,” unpublished Ph.D. dissertation, University of Pittsburgh, Dept. of
Mathematics and Statistics.
20
TABLES
Table 1.
Measured source composition profiles
Source acetylene propene nButane 2MPentan 3MPentan benzene CyHx
+2MHx
2,3-DMP 2,2,4-TMP
roadway 0.181 0.094 0.197 0.116 0.069 0.132 0.049 0.043 0.120
gasoline 0 0.002 0.197 0.221 0.138 0.108 0.116 0.067 0.152
headspace 0 0.007 0.685 0.144 0.075 0.034 0.021 0.014 0.021
Note: Each source profile is normalized sum to one
Table 2. True source composition profiles (P
0
)
1234567
Source 1 0 0.248 0 0.102 0.306 0.128 0.216
Source 2 0.242 0 0.266 0 0.009 0.044 0.440
Source 3 0.311 0.250 0.039 0.302 0 0.099 0
Note: Each source profile is normalized sum to one
21
Table 3.
Summaries of the posterior distribution for
P
when the data is generated by model (3)
and the approach accounting for dependence is used
Param.
j
1234567
P
1j
Mean
SD
LSCR
LCI
UCI
USCR
0
0
0
0
0
0
0.234
0.018
0.191
0.205
0.262
0.279
0
0
0
0
0
0
0.087
0.023
0.025
0.049
0.124
0.145
0.339*
0.016
0.299
0.313
0.306
0.378
0.124
0.013
0.088
0.101
0.147
0.158
0.216
0.033
0.137
0.160
0.269
0.293
P
2j
Mean
SD
LSCR
LCI
UCI
USCR
0.204*
0.026
0.137
0.157
0.241
0.256
0
0
0
0
0
0
0.253
0.017
0.214
0.225
0.282
0.295
0
0
0
0
0
0
0.044
0.029
0.001
0.004
0.100
0.127
0.043
0.013
0.009
0.021
0.065
0.075
0.456
0.016
0.416
0.430
0.484
0.502
P
3j
Mean
SD
LSCR
LCI
UCI
USCR
0.298
0.009
0.278
0.284
0.313
0.320
0.264
0.010
0.237
0.247
0.279
0.288
0.029
0.011
0.003
0.011
0.046
0.056
0.304
0.009
0.284
0.290
0.319
0.328
0
0
0
0
0
0
0.106
0.008
0.085
0.093
0.118
0.126
0
0
0
0
0
0
Note: 1. SD stands for the posterior standard deviation; 2. LCI and UCI stand for the lower limit and upper limit of the 95%
credible interval; 3. Asterisk (*) indicates that the true parameter value is not captured by the 95% credible interval; 3.
Asterisk (*) indicates that the true parameter value is not captured by the 95% credible interval; 4. LSCR and USCR stand
for the lower limit and upper limit of the 80% simultaneous credible region.
22
Table 4.
Posterior means and standard deviations of
Φ
and
R
U
(correlation matrix corresponding to
U
)
when the data is generated by model (3) and the approach accounting for dependence is used
φ
k
Correlations in
R
U
k
=
1
0.826 (0.044) 1
k
=
2
0.834 (0.042) 0.010 (0.133) 1
k
=
3
0.817 (0.040) 0.245 (0.108)* -0.141 (0.102) 1
Note: 1. Posterior standard deviation is given in the parenthesis; 2. Asterisk (*) indicates
that the true parameter value is not captured by the 95% credible interval.
Table 5.
Posterior means and standard deviations of
Θ
,
V
, and
Σ
when the data is generated by model (3)
and the approach accounting for dependence is used
θ
j
Diagonal elements of
V
σ
j
2
j
=
1
0.379 (0.194)* 2.463 (1.295) 3.823 (1.238)
j
=
2
0.628 (0.178) 2.777 (1.304) 2.908 (1.002)
j
=
3
0.836 (0.100) 2.030 (0.924) 4.368 (1.010)
j
=
4
0.801 (0.102) 2.470 (1.127) 4.072 (1.077)
j
=
5
0.539 (0.207) 2.634 (1.431) 4.252 (1.509)
j
=
6
0.609 (0.121) 2.485 (0.950) 3.279 (0.921)
j
=
7
0.650 (0.191) 2.496 (1.457) 2.547 (1.029)
Note: 1. Posterior standard deviation is given in the parenthesis; 2. Asterisk (*) indicates that the true
parameter value is not captured by the 95% credible interval.
23
Table 6.
Summaries of the posterior distribution for the parameters
P
and
Σ
ε
when the data is generated by model (3)
but the approach ignoring dependence (given in Remark 1) is used
Param.
j
1234567
P
1j
Mean
SD
LSCR
LCI
UCI
USCR
0
0
0
0
0
0
0.214*
0.014
0.180
0.190
0.236
0.246
0
0
0
0
0
0
0.084
0.014
0.050
0.060
0.106
0.115
0.339*
0.011
0.314
0.322
0.357
0.365
0.125
0.008
0.105
0.112
0.137
0.144
0.239
0.022
0.189
0.205
0.277
0.297
P
2j
Mean
SD
LSCR
LCI
UCI
USCR
0.123*
0.012
0.096
0.104
0.142
0.154
0
0
0
0
0
0
0.201*
0.008
0.182
0.187
0.214
0.221
0
0
0
0
0
0
0.154*
0.011
0.125
0.136
0.172
0.179
0.063*
0.007
0.045
0.051
0.074
0.080
0.459*
0.009
0.439
0.445
0.474
0.482
P
3j
Mean
SD
LSCR
LCI
UCI
USCR
0.292*
0.005
0.281
0.284
0.300
0.304
0.282*
0.005
0.269
0.274
0.291
0.296
0.036
0.007
0.021
0.026
0.047
0.054
0.286*
0.004
0.276
0.278
0.293
0.297
0
0
0
0
0
0
0.103
0.005
0.092
0.096
0.111
0.115
0
0
0
0
0
0
σ
ε
j
2
=
8.882
Mean
SD
5.565*
1.453
8.648
1.853
10.415
1.403
11.375
1.621
8.275
2.246
7.873
0.840
7.255
2.768
Note: 1. SD stands for the posterior standard deviation; 2. LCI and UCI stand for the lower limit and upper limit of the 95%
credible interval; 3. Asterisk (*) indicates that the true parameter value is not captured by the 95% credible interval; 4. LSCR
and USCR stand for the lower limit and upper limit of the 80% simultaneous credible region.
24
Table 7.
Summaries of the posterior distribution for
P
for the Atlanta data
Param.
Species
j
acetylene
1
propene
2
nButane
3
2MPentan
4
3MPentan
5
benzene
6
CyHx
+2Mhx
7
2,3-DMP
8
2,2,4-TMP
9
roadway Mean
SD
LSCR
LCI
UCI
USCR
0.275
0.008
0.257
0.257
0.295
0.297
0.115
0.004
0.107
0.107
0.124
0.125
0.279
0.013
0.247
0.248
0.305
0.307
0.086
0.004
0.076
0.076
0.095
0.096
0.049
0.003
0.042
0.043
0.056
0.056
0.126
0.004
0.117
0.118
0.135
0.136
0
0
0
0
0
0
0
0
0
0
0
0
0.069
0.005
0.057
0.057
0.081
0.081
gasoline Mean
SD
LSCR
LCI
UCI
USCR
0
0
0
0
0
0
0
0
0
0
0
0
0.172
0.019
0.127
0.128
0.214
0.217
0.191
0.005
0.179
0.180
0.202
0.204
0.113
0.003
0.104
0.105
0.121
0.122
0.088
0.004
0.077
0.078
0.097
0.099
0.123
0.005
0.112
0.112
0.134
0.135
0.098
0.004
0.089
0.090
0.107
0.107
0.217
0.008
0.200
0.201
0.236
0.238
headspace Mean
SD
LSCR
LCI
UCI
USCR
0
0
0
0
0
0
0.009
0.007
0.000
0.001
0.029
0.034
0.693
0.035
0.606
0.609
0.773
0.776
0.116
0.011
0.083
0.087
0.142
0.145
0.063
0.007
0.042
0.045
0.080
0.081
0.052
0.010
0.028
0.029
0.074
0.076
0.021
0.009
0.001
0.002
0.044
0.046
0
0
0
0
0
0
0.046
0.017
0.007
0.008
0.088
0.093
Note: 1. SD stands for the posterior standard deviation; 2. LCI and UCI stand for lower limit and upper limit of the 99% credible interval; 3. LSCR and
USCR stand for lower limit and upper limit of the 80% simultaneous credible region.
25
Table 8.
Posterior means and standard deviations of
Φ
and
R
U
(correlation matrix corresponding to
U
) for the Atlanta data
φ
k
Correlations in
R
U
k
=
1
0.775 (0.036) 1
k
=
2
0.677 (0.062) 0.207 (0.045) 1
k
=
3
0.476 (0.114) -0.069 (0.051) -0.049 (0.047) 1
Note: Posterior standard deviation is given in the parenthesis.
Table 9.
Posterior means and standard deviations of
Θ
, diagonal elements of
V
, and
Σ
for the Atlanta data
Species
θ
j
Diagonal elements of
V
σ
j
2
Acetylene 0.512 (0.110) 1.039 (0.243) 1.148 (0.127)
Propene 0.550 (0.066) 0.405 (0.058) 0.506 (0.042)
nButane 0.400 (0.201) 2.929 (1.339) 3.683 (0.751)
2Mpentan 0.221 (0.086) 0.520 (0.102) 0.534 (0.045)
3Mpentan 0.162 (0.073) 0.280 (0.040) 0.349 (0.026)
Benzene 0.360 (0.092) 0.379 (0.055) 0.501 (0.040)
CyHx+2Mhx 0.237 (0.088) 0.341 (0.048) 0.448 (0.036)
2,3-DMP 0.269 (0.086) 0.261 (0.033) 0.360 (0.027)
2,2,4-TMP 0.643 (0.062) 0.681 (0.138) 0.758 (0.070)
Note: Posterior standard deviation is given in the parenthesis.
26
Figure Titles and Legends
Figure 1. Autocorrelation function (ACF) plots of Y for Atlanta data
Figure 2. Autocorrelation function (ACF) plots of the residuals for Atlanta data:
Y
ˆ
A
OLS
P
where P is the measured source compositions in Table 1
Figure 3. Autocorrelation function (ACF) plots of source contributions (
ˆ
A
OLS
) for Atlanta
data
Figure 4. Side-by-side barplots of the true source compositions (P
0
) and the estimated
compositions obtained from two different approaches, time series approach and
approach ignoring dependence
Figure 5. Side-by-side barplots of the measured source compositions and the estimated
compositions for the Atlanta data
27
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
Acetylene
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
propene
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
nButane
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
2MPentan
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
3MPentan
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
benzene
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
CyHx+2MHx
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
2,3-DMP
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
2,2,4-TMP
Figure 1
28
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
Acetylene
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
propene
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
nButane
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
2MPentan
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
3MPentan
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
benzene
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
CyHx+2MHx
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
2,3-DMP
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
2,2,4-TMP
Figure 2
29
0 5 10 15 20 25 30
-0.5
0
0.5
1
ACF
TextEnd
Series : Roadway contribution
0 5 10 15 20 25 30
-0.5
0
0.5
1
ACF
TextEnd
Series : Gasoline contribution
0 5 10 15 20 25 30
-0.5
0
0.5
1
ACF
TextEnd
Series : Headspace contribution
Fi
g
ure 3
30
1 2 3 4 5 6 7
0
0.1
0.2
0.3
0.4
Source 1
R
2
=.99
1 2 3 4 5 6 7
0
0.2
0.4
0.6
0.8
Source 2
R
2
=.98
1 2 3 4 5 6 7
0
0.1
0.2
0.3
0.4
Source 3
Time series approach
R
2
=.99
estimated
true
1 2 3 4 5 6 7
0
0.1
0.2
0.3
0.4
Source 1
R
2
=.97
1 2 3 4 5 6 7
0
0.2
0.4
0.6
0.8
Source 2
R
2
=.78
1 2 3 4 5 6 7
0
0.1
0.2
0.3
0.4
Source 3
Approach ignoring dependence
Figure 4
R
2
=.99
estimated
true
31
1 2 3 4 5 6 7 8 9
0
0.1
0.2
0.3
0.4
Roadway
R
2
=.91
1 2 3 4 5 6 7 8 9
0
0.1
0.2
0.3
0.4
Gasoline
R
2
=.84
1 2 3 4 5 6 7 8 9
0
0.2
0.4
0.6
0.8
Headspace
Figure 5
R
2
=.99
estimated
measured
... Bayesian methods for factor analysis in chemometrics offer great potential in that they provide alternative ways to overcome some of the most challenging problems in factor analysis such as estimating the unknown number of factors, dealing with the non-identifiability problem (e.g., factor indeterminacy or rotational ambiguity) in parameter estimation, and uncertainty estimation. In spite of such potential benefits, there have been only a limited number of Bayesian factor analysis applications in chemometrics, mostly in the area of environmental chemistry (see, e.g., Park et al. [5][6][7][8][9][10] ; Park and Oh [11][12][13] ; Heaton et al. 14 ; Hackstadt and Peng 15 ). The main goal of this chapter is to present an overview of Bayesian factor analysis in chemometrics and encourage more research and applications in this field. ...
... For the examples of Bayesian estimation of factor analysis models by MCMC in chemometrics, readers may refer to Park et al., [5][6][7][8][9][10] Calder, 35 Lingwall et al., 31 Heaton et al., 14 Nikolov et al., 32 Pollice and Lasinio, 36 Park and Oh, 11-13 and Hackstadt and Peng 15 among others. ...
... A basic multivariate receptor model can be written in the form of Eq. (1) or (2), and estimation of the source composition profiles (F which can serve as chemical fingerprints of pollution source categories) and contributions (G corresponding to the amounts of pollution) from different source categories have been the primary concerns in multivariate receptor modeling. Hopke, 40 Tauler et al., 30 In addition to those approaches developed in the context of non-Bayesian factor analysis, various forms of Bayesian factor analysis methods have also been developed and applied in multivariate receptor modeling for the past two decades including compositional receptor modeling, 42 a time series extension of multivariate receptor models, 5 Dirichlet-based Bayesian multivariate receptor modeling, 31 a Dirichlet Process model incorporating time-varying source profiles, 14 Bayesian multivariate receptor modeling with multiplicative errors, 15,32 robust Bayesian multivariate receptor modeling accounting for outliers in the data, 11 Bayesian quantile multivariate receptor modeling, 12 and Bayesian spatial multivariate receptor modeling. 10 Three of the aforementioned approaches that appear to have general applicability in future developments of Bayesian factor analysis in other areas of chemometrics are discussed further below. ...
Chapter
Factor analysis is widely used in various scientific disciplines including chemometrics. Factor analysis models and fundamental issues are introduced from the perspective of chemometrics. Bayesian methods offer great potential in that they provide alternative ways to resolve main problems in factor analysis such as the unknown number of factors and model non-identifiability which leads to factor indeterminacy or rotational ambiguity in estimation. Standard Bayesian nonnegative factor analysis models and principles of Bayesian estimation along with modern Bayesian computational methods are introduced. Key advantages of Bayesian factor analysis are simultaneous estimation of model parameters (factor loadings and scores) and their uncertainties and the capability to deal with the unknown number of factors, factor indeterminacy/rotational ambiguity, and parameter uncertainty in a coherent manner as well as the flexibility in modeling and incorporating prior knowledge into estimation. Other developments of extended Bayesian factor analysis models in chemometrics are also presented. Future directions of Bayesian factor analysis in chemometrics, including public release of user-friendly software facilitating its implementation, are discussed.
... The use of Bayesian statistical analysis has become highly popular in the past 20 years (Hackstadt and Peng, 2014;Nikolov et al., 2007;Park et al., 2001). Bayesian modeling is based on Markov chain Monte Carlo (MCMC) methods and is versatile and attractive because it enables the integration of prior knowledge for a comprehensive evaluation of available information intrinsically. ...
... In SA models, we can simply use the knowledge we have regarding source profiles as the prior distributions on elements of P. We can also use prior distributions on elements of C, reflecting our knowledge of the source contributions. Park et al. (2001) proposed an extended Bayesian model to account for temporal dependence in exposure data. Billheimer (2001) published a Bayesian SA model that modeled both the source contributions and source profiles as unknown compositional quantities for known sources. ...
Article
Full-text available
Identifying realistic pollution source profiles and quantifying the contributions of atmospheric particulate matter are crucial for the development of pollution mitigation strategies to protect public health. In this paper, we proposed a multivariate source apportionment model by using a Bayesian framework for latent source profiles to incorporate expert knowledge regarding emissions that can facilitate source profile estimation, and atmospheric effects, such as meteorological conditions, can improve source concentration estimations. This approach can maintain positivity and summation constraints for source contributions and profiles. Furthermore, available expert knowledge regarding source profiles is incorporated as prior knowledge to avoid restrictive assumptions regarding the presence or absence of chemical constituent tracers in source profile modeling. We used long-term PM2.5 measurements collected from two locations with different environmental characteristics in northern Taiwan to demonstrate the feasibility of the proposed model and evaluated its performance by using simulated data.
... Prior knowledge is then incorporated in Billheimer (2001), which is Bayesian based, and a non-additive error structure is also adopted. Park, Guttorp, and Henry (2001) incorporates temporal dependence and adopts a MCMC approach to estimate parameters. Lingwall, Christensen, and Reese (2008) uses a Dirichlet prior distribution to allow flexible specification of prior information. ...
Preprint
Full-text available
We introduce an R package for fitting Stable Isotope Mixing Models (SIMMs) via both Markov chain Monte Carlo and Variational Bayes. The package is mainly used for estimating dietary contributions from food sources taken via measurements of stable isotope ratios from animals. It can also be used to estimate proportional contributions of a mixture from known sources, for example apportionment of river sediment, amongst many other use cases. The package contains a simple structure which allows non-expert users to interface with the package, with most of the computational complexity hidden behind the main fitting functions. In this paper we detail the background to these functions and provide case studies on how the package should be used. Further examples are available in the online package vignettes.
... Bayesian models have been introduced to the field of receptor modeling and chemometrics by statisticians in early 2000 (Park et al., 2001(Park et al., , 2002Billheimer, 2001). A J o u r n a l P r e -p r o o f number of variations and extensions of the basic multivariate receptor model have also been proposed in statistics literature (see, e.g., Park et al., 2004Park et al., , 2014aPark and Oh, 2015, 2016Nikolov et al., 2007Nikolov et al., , 2011Heaton et al., 2010;Hackstadt and Peng, 2014). ...
Article
Full-text available
We present user-friendly software tools to implement Bayesian multivariate receptor modeling in the form of a MATLAB function (BNFA) and an R package (bayesMRM). A basic model and a Markov chain Monte Carlo algorithm underlying BNFA and bayesMRM are given. An example of implementation based on real air pollution data is also provided. Users can freely choose between BNFA and bayesMRM depending on their computing platform. These tools are expected to facilitate implementation of Bayesian multivariate receptor models and/or Bayesian nonnegative factor analysis models and promote their use in chemometrics.
... The original auto-GC data consist of measured concentrations for 46 hydrocarbon species. The first important step in multivariate receptor modeling is to select an appropriate subset of species for an analysis; the inclusion of noisy or unhelpful species could hinder source apportionment (Park et al., 2001). Out of the 46 hydrocarbon species, 18 species were selected for model fitting through exploratory analyses, automatic algorithms for the selection of species, and review of previous Table 1. ...
Article
It is well-known that El Paso is the only border area in Texas that has violated national air quality standards. Mobile source emissions (including vehicle exhaust) contribute significantly to air pollution, along with other sources including industrial, residential, and cross-border. This study aims at separating unobserved vehicle emissions from air-pollution mixtures indicated by ambient air quality data. The level of contributions from vehicle emissions to air pollution cannot be determined by simply comparing ambient air quality data with traffic levels because of the various other contributors to overall air pollution. To estimate contributions from vehicle emissions, researchers employed advanced multivariate receptor modeling called positive matrix factorization (PMF) to analyze hydrocarbon data consisting of hourly concentrations measured from the Chamizal air pollution monitoring station in El Paso. The analysis of hydrocarbon data collected at the Chamizal site in 2008 showed that approximately 25% of measured Total Non-Methane Hydrocarbons (TNMHC) was apportioned to motor vehicle exhaust. Using wind direction analysis, researchers also showed that the motor vehicle exhaust contributions to hydrocarbons were significantly higher when winds blow from the south (Mexico) than those when winds blow from other directions. The results from this research can be used to improve understanding source apportionment of pollutants measured in El Paso and can also potentially inform transportation planning strategies aimed at reducing emissions across the region.
Article
Fine particulate matter (PM2.5) has been a pollutant of main interest globally for more than two decades, owing to its well-known adverse health effects. For developing effective management strategies for PM2.5, it is vital to identify its major sources and quantify how much they contribute to ambient PM2.5 concentrations. With the expanded monitoring efforts established during recent decades in Korea, speciated PM2.5 data needed for source apportionment of PM2.5 are now available for multiple sites (cities). However, many cities in Korea still do not have any speciated PM2.5 monitoring station, although quantification of source contributions for those cities is in great need. While there have been many PM2.5 source apportionment studies throughout the world for several decades based on monitoring data collected from receptor site(s), none of those receptor-oriented studies could predict unobserved source contributions at unmonitored sites. This study predicts source contributions of PM2.5 at unmonitored locations using a recently developed novel spatial multivariate receptor modeling (BSMRM) approach, which incorporates spatial correlation in data into modeling and estimation for spatial prediction of latent source contributions. The validity of BSMRM results is also assessed based on the data from a test site (city), not used in model development and estimation.
Article
Full-text available
The relationship between particle exposure and health risks has been well established in recent years. Particulate matter (PM) is made up of different components coming from several sources, which might have different level of toxicity. Hence, identifying these sources is an important task in order to implement effective policies to improve air quality and population health. The problem of identifying sources of particulate pollution has already been studied in the literature. However, current methods require an a priori specification of the number of sources and do not include information on covariates in the source allocations. Here, we propose a novel Bayesian nonparametric approach to overcome these limitations. In particular, we model source contribution using a Dirichlet process as a prior for source profiles, which allows us to estimate the number of components that contribute to particle concentration rather than fixing this number beforehand. To better characterize them we also include meteorological variables (wind speed and direction) as covariates within the allocation process via a flexible Gaussian kernel. We apply the model to apportion particle number size distribution measured near London Gatwick Airport (UK) in 2019. When analyzing this data, we are able to identify the most common PM sources, as well as new sources that have not been identified with the commonly used methods.
Article
Full-text available
A hidden semi-Markov-switching quantile regression model is introduced as an extension of the hidden Markov-switching one. The proposed model allows for arbitrary sojourn-time distributions in the states of the Markov-switching chain. Parameters estimation is carried out via maximum likelihood estimation method using the Asymmetric Laplace distribution. As a by product of the model specification, the formulae and methods for forecasting, the state prediction, decoding and model checking that exist for ordinary hidden Markov-switching models can be applied to the proposed model. A simulation study to investigate the behaviour of the proposed model is performed covering several experimental settings. The empirical analysis studies the relationship between the stock index from the emerging market of China and those from the advanced markets, and investigates the determinants of high levels of pollution in an Italian small city.
Article
Markov chain Monte Carlo (MCMC) methods have been used extensively in statistical physics over the last 40 years, in spatial statistics for the past 20 and in Bayesian image analysis over the last decade. In the last five years, MCMC has been introduced into significance testing, general Bayesian inference and maximum likelihood estimation. This paper presents basic methodology of MCMC, emphasizing the Bayesian paradigm, conditional probability and the intimate relationship with Markov random fields in spatial statistics. Hastings algorithms are discussed, including Gibbs, Metropolis and some other variations. Pairwise difference priors are described and are used subsequently in three Bayesian applications, in each of which there is a pronounced spatial or temporal aspect to the modeling. The examples involve logistic regression in the presence of unobserved covariates and ordinal factors; the analysis of agricultural field experiments, with adjustment for fertility gradients; and processing of low-resolution medical images obtained by a gamma camera. Additional methodological issues arise in each of these applications and in the Appendices. The paper lays particular emphasis on the calculation of posterior probabilities and concurs with others in its view that MCMC facilitates a fundamental breakthrough in applied Bayesian modeling. Comments: Arnoldo Frigessi (41–43), Alan E. Gelfand, Bradley P. Carlin (43–46), Charles J. Geyer (46–48), G. O. Roberts, S. K. Sahu, W. R. Gilks (49–51), Wing Hung Wong (52–53), Bin Yu (54–58), Julian Besag, Peter Green, David Higdon, Kerrie Mengersen (58–66).
Article
It is argued that multivariate chemical mass balance models are, in general, incapable of determining the number and composition of sources of environmental contamination entirely from the data alone; additional identifying information must be obtained from sources external to the data. Determination of the number and composition of sources is possible, however, if a `library' of possible source profiles (or partial source profiles) exists. In such a case, a generalized least squares approach to target transformation factor analysis has been shown in the Ph.D. dissertation of H. Yang to have large-sample performance properties that are robust against violations of distributional assumptions. One can alternatively make use of the popular computer program LISREL to fit chemical mass balance models when partial information from a library of source profiles is available. Using the fitted model, it is shown how to obtain point estimates and credible intervals for source contributions (either in absolute terms, or as proportions of total contributions from all sources).
Article
This chapter will review the major multivariate methods which have been applied to receptor modeling. The Source Apportionment by Factors with Explicit Restrictions (SAFER) model is discussed in greater detail than other models because all the others suffer from the fundamental mathematical indeterminacy discussed above. The other models continue to be important as semi-quantitative methods to estimate composition of sources from the data alone. Of course, all the multivariate methods can be extremely valuable in identifying the existence of unsuspected sources. In the next sections the concepts basic to multivariate receptor models are introduced. This is followed by a more detailed description of the SAFER model and its application to Los Angeles PM-10 data. -Author
Article
We provide a detailed, introductory exposition of the Metropolis-Hastings algorithm, a powerful Markov chain method to simulate multivariate distributions. A simple, intuitive derivation of this method is given along with guidance on implementation. Also discussed are two applications of the algorithm, one for implementing acceptance-rejection sampling when a blanketing function is not available and the other for implementing the algorithm with block-at-a-time scans. In the latter situation, many different algorithms, including the Gibbs sampler, are shown to be special cases of the Metropolis-Hastings algorithm. The methods are illustrated with examples.