Content uploaded by Ronald C Henry
Author content
All content in this area was uploaded by Ronald C Henry on Feb 24, 2016
Content may be subject to copyright.
Multivariate Receptor Modeling for Temporally
Correlated Data by Using MCMC
Eun Sug Park Peter Guttorp Ronald C. Henry
NRCSE
T e c h n i c a l R e p o r t S e r i e s
NRCSE-TRS No. 043
March 9, 2000
The NRCSE was established in 1996 through a cooperative agreement with the United States
Environmental Protection Agency which provides the Center's primary funding.
Multivariate Receptor Modeling for Temporally Correlated Data
by Using MCMC
Eun Sug Park
1
, Peter Guttorp
1
, and Ronald C. Henry
2
1
National Research Center for Statistics and the Environment
University of Washington
Seattle, WA 98195
2
Civil and Environmental Engineering
University of Southern California
Los Angeles, CA 90089.
Author’s Footnote
Eun Sug Park is Research Associate, National Research Center for Statistics and the
Environment, University of Washington, Seattle, WA 98195. Peter Guttorp is Professor
of Statistics and Director of the National Research Center for Statistics and the
Environment, University of Washington, Seattle, WA 98195. Ronald C. Henry is
Associate Professor of Civil and Environmental Engineering, University of Southern
California, Los Angeles, CA 90089. Although the research described in this article has
been funded by the United States Environmental Protection Agency through agreement
CR825173-01-0 to the University of Washington, it has as not been subjected to the
Agency’s required peer and policy review and therefore does not necessarily reflect the
views of the Agency and no official endorsement should be inferred.
Abstract and Key Words
Multivariate receptor modeling aims to estimate pollution source profiles and the
amounts of pollution based on a series of ambient concentrations of multiple chemical
species over time. Air pollution data often show temporal dependence due to
meteorology and/or background sources. Previous approaches to receptor modeling do
not incorporate this dependence. We model dependence in the data using a time series
approach so that we can incorporate extra sources of variability in parameter estimation
and uncertainty estimation. We estimate parameters using the Markov chain Monte Carlo
method, which makes simultaneous estimation of parameters and uncertainties possible.
The methods are applied to simulated data and 1990 Atlanta air pollution data. The
results show promise towards the goal of accounting for the dependence in the data.
KEY WORDS: Dynamic models; Kalman filter; Gibbs sampler; Metropolis-Hastings
algorithm; Compositions; Air pollution.
1
1. INTRODUCTION
The goal of receptor modeling is to identify the pollution sources and assess the amounts of
pollution based on observations collected at a particular site, and from that information to
develop an effective air quality management plan. The basic mathematical model can be
written as follows based on chemical mass balance assumptions (see, e.g., Hopke, 1985,
1991, 1997; Gleser 1997):
yP tn
t
tk
k
q
k
t
=+=
=
∑
αε
1
1,,,L (1)
where
yyy y
ttt tp
=
()
12
,,,L is the tth observation, q is the number of sources,
Ppp p
kkk kp
=
()
12
,,,L is the kth source composition (consisting of the fractional amount of
each species in the emissions from the kth source),
α
tk
is the contribution from the kth
source on the tth day, and
εεε ε
ttt tp
=
()
12
,,,L is the measurement error associated with the
tth observation. In matrix terms, the model (1) can be written as
YAPE=+ (2)
where A is n×q source contribution matrix, P is q×p source composition matrix, and E is
n
×
p error matrix. The model (1) may be viewed as a factor analysis model in the sense that
Y is the only observable quantity while q, P, and A are all unknown quantities that need to
be estimated (or predicted). Early approaches to multivariate receptor modeling include
exploratory factor analysis, principal component analysis, target transformation factor
analysis, and others (see, e.g., Henry 1991). It is well known that, without imposing
additional constraints on the parameters, the factor analysis model is not identifiable even
with known number of sources, q. There have been several attempts to avoid this problem
by imposing more restrictive constraints on either the P or the A matrix (see Henry and Kim
1990; Henry, Lewis, and Collins 1994; Yang 1994; Park 1997). As a matter of fact, there
could be many different sets of identifiability conditions, each making sense in its own
2
context. Park, Spiegelman, and Henry (1999) discuss identifiability conditions that are
meaningful in receptor models.
The assumption of independence among the observations y
t
has been made either
implicitly or explicitly in all previous approaches to multivariate receptor modeling, see, for
instance, Hopke (1991), Henry (1991), Yang (1994), Gleser (1997), Park (1997), and Park
et al. (1999). Air pollution data, however, are usually obtained as a series of measurements
on concentrations of aerosols over time, and meteorology often induces some degree of
dependence in the data. Observations closer in time tend to be more correlated than
observations farther apart in time (e.g., Figure 1).
{Insert Figure 1}
In some cases the assumption of independence may not be grossly wrong because
environmental data usually contains many missing values or erroneous observations, and
after initial screening of the data, time separation between any pair of measurements may
become large enough so that serial correlation can be ignored in the screened data. This, of
course, is not always the case. The research in this paper was motivated by a 1990 Atlanta
air pollution composition data set consisting of hourly measurements of volatile
hydrocarbon (VHC) species. This data set was used in Henry et al. (1994) to derive
vehicle-related hydrocarbon source compositions from the ambient data. In that study, three
types of measured source profiles specific to Atlanta in the summertime of 1990 were also
available: roadway emissions, whole gasoline, and gasoline headspace (see Henry et al.
1994). The compositions of those three sources for nine selected vehicle-related species are
provided in Table 1.
{Insert Table 1}
It is worthwhile to mention that those direct source measurements were obtained, under
rather restricted conditions, independently of the data (e.g., roadway compositions were
obtained as highway tunnel measurements during morning rush hour). Thus it is not
unlikely that the measured source compositions could be different from the true source
3
compositions P
0
for the data due to pollutant transport (between source and receptor) and
reactions (and also to measurement errors, variations in source compositions, and the
contribution of minor sources). Nonetheless, the measured source compositions may serve
as a guideline for the true source compositions.
Assuming that the measured compositions in Table 1 are the true source
compositions, i.e., P is known in model (2), A can be estimated easily, for instance, as an
ordinary least squares (OLS) solution,
ˆ
AYPPP
OLS
=
′′
()
−1
, if we ignore dependence
structure in the data (and vice versa, i.e.,
ˆ
PAAAY
OLS
=
′
()
′
−1
if A is known or estimated first.
This was done in almost all previous works without checking the independence
assumption). Figures 1 and 2 show the autocorrelation function (ACF) plot of the raw data
Y and residuals calculated as YAP
OLS
−
ˆ
for each of nine species, respectively.
{Figure 2 about here}
Figure 3 shows ACF plots of OLS estimates of source contributions,
ˆ
A
OLS
.
{Figure 3 about here}
All three plots reveal significant serial correlation in the data. It is well known in time series
literature that in the presence of the correlated residuals, the standard error (not adjusting for
the correlation in the residuals) of OLS estimate of the trend (which may be regarded as P
in our model) in the regression is often grossly wrong. Although the correct standard error
of OLS estimate may be obtained by adjusting for the correlation, it is still not the best
estimate since the generalized least squares estimate, taking the correlation into account in
the estimation procedure, has smaller standard error. The goal of this article is to extend
receptor models to account for temporal dependence in the data so that we can incorporate
that source of variability in estimation of parameters and uncertainties. In Section 2, we
introduce models accounting for time dependence in the observations. Estimation of
parameters is discussed in Section 3. Sections 4 and 5 contain examples from simulated
4
data and the Atlanta air pollution data, respectively. Finally, concluding remarks are made in
Section 6.
2. MODEL
Assume that the y
t
in (1) are dependent. We first need to decide how to model this
dependence. It seems reasonable to assume that the source contribution on time t depends
on the past source contributions (as Figure 3 indicates). Also, it is often the case that
ε
contains not only pure measurement error but also all the remaining sources of variability
that is not explained by the systematic part of our model such as background sources
(unmodeled minor sources) and meteorology, etc. Then it is likely that the
ε
t
are also
correlated in time due to the effect of meteorology and unmodeled sources (see Figure 2).
We may decompose
ε
t
into two terms
εηδ
ttt
=+ where
η
t
represents variability
correlated in time due to meteorology or background sources, and
δ
t
represents residual,
unpredictable variability due to pure measurement error, independent over time.
We consider the model
yP
tt tt
=++
αηδ
where
ααα α
ttt tq
=
()
12
,,,L is a stationary vector AR(1) process centered at
ξξξ ξ
=
()
12
,,, ,L
q
ηηη η
ttt tp
=
()
12
,,,L is a stationary vector AR(1) process centered at 0,
and
δδδ δ
ttt tp p
N=
()
()
12
,,, ~ ,L 0 Σ where
Σ=
()
diag
p
σσ σ
1
2
2
22
,,,L . We use ‘ N
k
⋅⋅
()
, ’ to
denote k-dimensional multivariate normal distribution throughout the paper. This model
may be written in Dynamic Linear Model (DLM) form (West and Harrison, 1997) as
Observation equation: yP N
tt tt t p
=++
()
αηδδ
,~,0 Σ
Evolution equation:
αξα ξ
tt ttq
uuNU=+ −
()
+
−1
Φ ,~(,)0
ηη υ υ
tt t t p
NV=+
()
−1
Θ ,~,0 (3)
5
where
uuu u
ttt tq
=
()
12
,,,L ,
Φ=
()
diag
q
φφ φ
12
,,,L ,
φ
k
is an AR coefficient for the kth
source contribution,
υυυ υ
ttt tp
=
()
12
,,,L ,
Θ=
()
diag
p
θθ θ
12
,,,L , and
θ
j
is an AR
coefficient for jth element of
η
t
. Note that marginal distribution for each
α
t
is
αξ
tq
NW W WU~,,
()
=+ΦΦ (4)
and for each
η
t
is
η
tp
NM MMV~,,0
()
=+ΘΘ . (5)
3. ESTIMATION
As the model gets complicated by inclusion of more parameters, Markov chain Monte Carlo
(MCMC) simulation (Tierney 1994; Chib and Greenberg 1995; Besag, Green, Higdon, and
Mengersen 1995; Gilks, Richardson, and Spiegelhalter 1996) seems to be an attractive
approach for parameter estimation. Note also that the parameters of the models (1) or (3)
are all unknown, and the problem of parameter estimation is essentially nonlinear, but the
Markov chain Monte Carlo method makes the problem linear by use of conditional
distributions. We introduce a Bayesian framework to employ an MCMC method
(constraints and identifiability conditions can be used as a part of the prior distribution). As
mentioned in Section 1, the receptor model can be viewed as a special type of a factor
analysis model (with the constraints that the elements of factor loading matrix should be all
nonnegative). For identifiability of the model we borrow conditions from the confirmatory
factor analysis model (Anderson 1984).
C1. There are at least q −1 zero elements in each row of P,
C2. The rank of P
(k)
is q −1, where P
(k)
is the matrix composed of the columns
containing the assigned 0’s in the kth row with those assigned 0’s deleted.
Under the above conditions the source profiles, P, are identified up to normalization, which
is enough for the purpose of receptor model. (As long as the relative amount of each
6
species in a source is determined, a source can be identified.) The conditions C1 and C2
(and nonnegativity constraints on the elements of P) are absorbed into prior distribution for
P.
Under the normal error assumption on
δ
, the likelihood
fYL
()
is written as
f Y tr Y AP Y AP
n
L
()
=−−−
()
′
−−
()
−
−
2
2
1
2
1
πηη
ΣΣexp (6)
where
η
is n
×
p matrix of which rows are
η
t
,
tn=1, ,L . We use ‘
L ’ to denote
conditioning on all other variables. For a prior distribution p()⋅ , we assume that
pP U V
pPp p pUp U p pVp V
nn
nn
,,,, , , ,,, , ,
,, ,, ,, ,.
ΣΦ Θ
ΣΦ Φ Θ Θ
αα ηη
αα ξ ηη
11
10 1
LL
LL
()
=
()()()()
()
()
()
()
For the sake of brevity,
ξ
is assumed known to be
ξξ
=
0
. Note that (3) implies
pUWWUtrU
n tt tt
t
n
n
n
α α ξ π γ γ γγ γγ
1
1
2
1
1
1
1
2
1
11
2
2
2
1
2
1
2
, , , , exp expL ΦΦΦ
()
=
()
−
′
()
−−
()
′
−
()
−
−
−
−
−
−−
=
−
∑
where
γαξ
tt
=−
0
and
p M M V tr V
n tt tt
t
n
n
n
η η η π η η ηη ηη
1
1
2
1
1
1
1
2
1
11
2
2
2
1
2
1
2
, , , exp expL ΘΘΘ
()
=
()
−
′
()
−−
()
′
−
()
−
−
−
−
−
−−
=
−
∑
.
Based on a series of observations
yy
n1
,,L , we are interested in sampling the full
posterior
πααηη
PU V Y
np
,,,, , , ,,, , ,ΣΦ Θ
11
LL
()
. We use “block-at-a-time” Metropolis-
Hastings algorithm (Chib and Greenberg, 1995). We shall make use of seven move types
in implementing MCMC:
(a) updating P
(b) updating Σ
(c) updating Φ
(d) updating U
(e) updating Θ
(f) updating V
(g) updating
α
and
η
.
7
Letting
˜
PAAAY=
′
()
′
−
()
−1
η
and SY APY AP=−−
()
′
−−
()
ηη
˜˜
, and using the
orthogonality properties associated with
˜
P (see Press 1982), (6) can be written as
2
2
1
2
1
1
2
1
π
ΣΣ Σ
−
−−
−
{}
−−
()
′
′
()
−
()
n
tr S tr P P A A P Pexp exp
˜˜
∝− −
()
′
⊗
′
()
−
()
−
exp
˜˜
1
2
1
vecP vecP A A vecP vecPΣ .
Let the prior distribution for P be
p P p vecP N m C P k q j p
kj
() ( )~ , , , ,, , ,=
()
≥= =
()
00
01 1I LL
where m
0
is a pq-dimensional vector and C
0
is a pq pq× -dimensional diagonal matrix.
Enforcing the constraints C1-C2 is equivalent to using a degenerate point prior for some of
the elements of P. We set qq×−
()
1 elements of m
0
and the corresponding elements of
C
0
to be zero, which makes the prior distribution for P a truncated singular normal
distribution (though still proper). Then the resulting full conditional posterior distribution
π
PL
()
is again a truncated singular normal distribution, which can be written as
vecP N m C P k q j p
q
kj
LLL~, , ,,, ,,
()
≥= =
()
I 01 1
where m C A vec Y C m=⊗
′
()
−
()
+
{}
−−
Σ
1
00
η
, CAAC=⊗
′
+
()
−−
−
Σ
1
0
1
where C
0
−
is a
generalized inverse of C
0
. Since both of Σ and C
0
are diagonal, for the columns of P with
no zero elements, we have
PNmCP k q
jqjj
kj
LL~, ,,,
()
≥=
()
I 01
where
mC Ay Cm
jjj
j
j
jj
=
′
−
()
+
{}
−−
ση
2
0
1
0
, CAAC
jj j
=
′
+
()
−−
−
σ
2
0
1
1
, m
j0
is a q-dimensional
prior mean vector of P
j
, C
j0
is a corresponding submatrix of C
0
, y
j
is the jth column of Y,
and
η
j
is the jth column of
η
. For the columns of P containing zero elements, let q
∗
be the
8
number of nonzero elements for that column and P
j
∗
be a column vector consisting of those
q
∗
elements. Then
PNmCP k q
j
q
jj
kj
∗∗∗∗
∗
()
≥=
()
LL~, ,,,I 01
where mCA Ay Cm
jj j
j
j
jj
∗∗
′
∗−
′
∗∗−∗
=−
()
+
{}
ση
2
0
1
0
, CAAC
jj j
∗−
′
∗∗ ∗−
−
=+
()
σ
2
0
1
1
, m
j0
∗
is a q
∗
-
dimensional prior mean vector of nonzero elements of P
j
, C
j0
∗
is a corresponding submatrix
of C
0
, and A
∗
consists of the columns of A corresponding to nonzero elements of P
j
.
If there is no prior information about the source compositions but the zero elements, we
may use a noninformative prior pP P P j J
kj kj
j
p
k
q
() ,=≥
()
=∈
()
==
∏∏
II00
0
11
where J
0
is the
index set for which P
kj
= 0, which takes into account the conditions C1-C2 and
nonnegativity only. Under this prior, we have, for the columns of P with no zero element,
PNmCP k q
jqjj
kj
LL~, ,,,
()
≥=
()
I 01
where
mAAAy
j
j
j
=
′
()
′
−
()
−1
η
, CAA
jj
=
′
()
−
σ
2
1
. For the columns of P containing zero
elements, we get
PNmCP k q
j
q
jj
kj
∗∗∗∗
∗
()
≥=
()
LL~, ,,,I 01
where
mAAAy
j
j
j
∗
′
∗∗
−
′
∗
=
()
−
()
1
η
, CAA
jj
∗
′
∗∗
−
=
()
σ
2
1
.
Hence move (a) can be performed using either a Gibbs sampler or a simple Metropolis-
Hastings algorithm.
Under a usual inverse gamma prior distribution for
σ
j
2
,
σαβ
j
−
()
2
~,Γ ,
jp=1, ,L ,
with the parameterization in which the mean and variance are
αβ
and
αβ
2
, respectively,
the full conditional for
σ
j
2
{}
are
σαβ
jj
nd
−
++
()
2
1
2
1
2
L ~,Γ
9
where d y AP y AP
j
j
j
j
j
j
j
=−−
()
′
−−
()
ηη
. This can be easily sampled using a Gibbs
sampler.
Moves (c) - (g) require Metropolis-Hastings steps. We use the same strategy as those
given in Chib and Greenberg (1995) and West and Harrison (1997) to update Φ and U,
respectively. Let
γαξ
tt
=−
0
. Under uniform priors for
φ
k
, writing
φφ φ
=
()
1
,,L
q
for
the diagonal of Φ, and D diag
t
=
()
−
γ
1
, the full conditional posterior density for Φ,
πφ
L
()
,
is proportional to
cf bBI
nor
Φ
()
()
<<
()
φφ
,0 1
where f
nor
is the q-variate normal density function, BDUD
t
n
−−
=
=
′
∑
11
2
, bB UD
t
t
n
=
′
−
=
∑
γ
1
2
,
cW WΦ
()
=−
′
()
−
−
1
2
1
2
1
1
1
exp
γγ
, WWU=+ΦΦ and II
k
k
q
01 0 1
1
<<
()
=<<
()
=
∏
φφ
. We use
NbB
q
,
()
as a proposal distribution for
φ
(independent proposal). That is, we sample a
candidate
φ
i
∗
from NbB
q
,
()
, compute the corresponding diagonal matrix Φ
∗
and variance
matrix W
∗
such that WWU
∗∗∗∗
=+ΦΦ , and accept new
φ
vector with probability
min ,1
01
01
cI
cI
Φ
Φ
∗∗
()
<<
()
()
<<
()
φ
φ
.
The full conditional posterior for U ,
π
U L
()
, is proportional to
p U a U U trace U G
n
()()
−
()
[]
−
−
−1
2
1
2
1
exp
where G
tt tt
t
n
=−
()
′
−
()
−−
=
∑
γγ γγ
11
2
ΦΦ and aU W W
()
=−
′
()
−
−
1
2
1
2
1
1
1
exp
γγ
. Note that G
follows a Wishart distribution with parameters U and n −1, i.e.,
GWUn
q
~,−
()
1
where fG
G trace U G
Un
nk
kn
n
k
()
=
−
()
[]
−
()
−−
−
−
−
1
2
1
2
1
2
2
1
2
1
1
1
1
2
21
()
()
()
exp
()Γ
. Under an inverse Wishart prior
10
UW m
q
~,
−
()
1
00
Ψ
where the density is given by
pU
U trace U
m
m
mk
mk
k
()
=
−
()
[]
()
−++
()
−
ΨΨ
Γ
0
1
1
2
0
1
1
2
0
1
2
0
1
2
0
1
2
0
2
)
exp
,
the conditional distribution of U given G is
UG W Gm n
q
~,
−
++−
()
1
00
1Ψ , and so the full
conditional posterior for U is proportional to
aU f U Gm n
Wishart
()
++−
()
−1
00
1Ψ ,
where f
Wishart
−1
is the inverse Wishart density function. We use this inverse Wishart
distribution WGmn
q
−
++−
()
1
00
1Ψ , as a proposal distribution for U. The acceptance
probability in this case is given by
min ,1
aU
aU
∗
()
()
where WWU
∗∗∗
=+ΦΦ .
Move types (e)-(f) are essentially the same as move types (c)-(d) with substitution of Θ,
V, Μ and
η
for Φ, U , W , and
γ
, respectively.
Move (g), updating
α
(equivalently, updating
γαξ
tt
=−
0
) and
η
, can be implemented
by forward-filtering, backward-sampling algorithm (West and Harrison 1997) applied to
y
t
−
µ
0
where
µ
0
=
()
Ey
t
. Note that the assumption that
µ
0
is known is not a strong
assumption. Model (3) can be rewritten as
y
ttt
−= +
µλ δ
0
F and
λλ ρ
tt t
=+
−1
G , (7)
where
λγη
ttt
=
[]
is the state vector at time t, F
P
I
=
×pp
, G is the (k+p)
×
(k+p) matrix,
G
0
0
=
Φ
Θ
, and
ρυ
ttt
u=
[]
with variance matrix Ω=
U0
0 V
. To sample from the
11
full conditional posterior
πλλ λ
12
,,,LL
n
()
, we sequentially simulate the individual vectors
λλ λ
nn
,,,
−11
L as follows:
1) Sample
λ
n
from NmC
qnn
,
()
where m
n
and C
n
are obtained from the Kalman filtering
recurrences
mmeK
tttt+++
=+
111
G ,
ey m
tt t++
=−−
110
µ
GF,
KRR
t
t
t
t
t++
−
+
=+
()
11
1
1
Σ FFF,
CRRK
tttt++++
=−
1111
F ,
RC
tt
t
+
=+
1
GG Ω.
2) For each
tn n=− −121,,,L , sample
λ
t
from NhH
qt t
,
()
where hm aB
ttt tt
=+ −
()
++
λ
11
,
HCBRB
ttttt
=−
′
+1
, BR C
tt t
=
+
−
1
1
G , am
tt+
=
1
G, and
λ
t+1
is the value just sampled.
Note that the likelihood (6) is invariant with respect to changes in scale of A or P (even
after identifiability conditions C1-C2 are taken into account), and the parameters A (and so
ξ
and U) and P are identified except for multiplication by a diagonal matrix (consisting of
scale constants), i.e., we would estimate AD
−1
(D
−1
ξ
, DUD
−−11
) and DP unless we use a
very precise informative prior. As already mentioned, knowing (estimating) P up to a
normalizing constant fulfills the objective of receptor modeling. It can also be shown that a
scale constant matrix D (although it is unknown and depends on the initial value of the
parameters) does not vary from iteration to iteration within an MCMC run. In this sense
our MCMC scheme is self-consistent, and so the adjustment for the scale constant matrix
does not need to be made at each step. If the scale constant (the matrix D) is ever known
(e.g., the total mass of pollutant particle is known), the adjustment can be directly applied to
the posterior summaries simply by multiplying (or dividing) by D. Care must be taken
though in specifying the initial values for the parameters or hyperparameters for the prior
12
distributions to ensure that at least they are approximately on the same scale or in a
consistent fashion (e.g.,
ξ
, hyperparameters for U, and initial value for A or P).
Finally, the posterior probability statements can directly be made on the identifiable
quantities such as the normalized P or the scaled matrix of U (i.e., the correlation matrix of
A) as discussed in Besag et al. (1995).
Remark 1. When
α
t
and
ε
t
are assumed to be independent, it can easily be shown that
under a normal prior distribution
αξ
tq
N~,
00
Ξ
()
, the full conditional distribution for
α
t
,
πα
t
L
()
, is a normal distribution through conjugacy, i.e.,
αξ
εε ε
tqt
Ny P P P PPL ~,Σ ΞΣΞ ΣΞ
−−−−
−
−−
−
′
+
()
′
+
()
′
+
()
()
1
00
11
0
1
1
1
0
1
1
where
Σ
εεε
εσσ
=
()
=
()
cov , ,
tp
diag
1
22
L . This can be updated using a Gibbs sampler,
and with moves (a) and (b) where
y
j
j
−
η
and
σ
j
2
are replaced by y
j
and
σ
ε
j
2
, respectively,
it completes one cycle of MCMC when the observations are treated as independent. In
Section 4, this approach is also compared to our time series approach when the observations
are actually dependent.
4. SIMULATION
The data are generated by the model (3) with p = 7, n = 200, q = 3,
σσ
1
2
7
2
3== =L ,
φφφ
123
08===. ,
ξ
0
10 12 14=
()
,, , U
u
kk
=
×
σ
2
Ι where
σ
u
2
3= ,
θθ
17
7===L . ,
V =⋅
×
σ
υ
2
77
Ι where
σ
υ
2
3= . The initial values of
α
and
η
are given by
αξ
σ
φ
1
0
2
2
1
k
u
k
k
Z=+
−
, where ZN
k
~(,)01, k =123,, and
η
σ
θ
υ
1
2
2
1
j
j
j
Z=
−
,
j =17,,L ,
respectively. The true source composition matrix P
0
(normalized to sum to 1) is given in
Table 2. It follows from (4) and (5) that W =⋅
×
8 333
33
. I and M =⋅
×
5 882
77
. I .
13
In implementing MCMC, we take
α
= 3 and
β
= 8 for the prior on
σ
j
2
,
j =17,,L
(yielding the prior mean 4), m
0
7= and Ψ
03
9=⋅
×
I
3
for the prior on U (yielding the prior
mean 3
33
⋅
×
I ), and set the scale matrix for the prior on V equal to 9
77
⋅
×
I and the degrees of
freedom equal to 11 (yielding the prior mean 3
77
⋅
×
I ), each ensuring a proper but relatively
diffuse prior. We use a noninformative prior distribution for the nonzero elements of P
throughout simulation.
The posterior summaries for the model parameters, P, Σ, Φ, U , Θ, and V, based on
2,000 values subsampled from 20,000 iterations following a 20,000 burn-in period are
reported in Tables 3-5. For the source composition matrix P and the variance matrix U,
those summaries are obtained in terms of normalized P (sum to 1) and the scaled variance
matrix R
U
(the correlation matrix) since they are identified only up to a constant multiplier.
{Tables 3-5 about here}
We also report the posterior summaries obtained from the approach for independent
observations (see Remark 1) in Table 6. Since this approach does not decompose the error
variances into Σ and M, we treat the estimates of the error variances as the estimates for
ΣΣ
εε ε
σσ
2
1
22
=
()
=+diag M
p
,,L . The prior mean and the covariance matrix of
α
t
are
set to be
ξ
0
10 12 14=
()
and Ξ
0
100=⋅
×
I
33
, respectively, and the hyperparameters of
the priors on
σ
ε
j
2
(,,)j =17L are taken as
α
= 4 and
β
j
= 27,
j =17,,L , (yielding the
prior mean 9). The results are based on a posterior sample of size 2,000 obtained by
subsampling every 10th from 20,000 values following a 20,000 burn-in period.
{Table 6 about here}
By comparing Table 3 and Table 6, it can be noted that the approach accounting for
dependence in the data yields much better result in terms of posterior inferences than the
approach not accounting for dependence. In Table 3 only 2 of the 15 (nonzero) elements of
P
0
lie outside the 95% credible intervals (all are within the 99% credible intervals though we
14
do not report them in the table) whereas in Table 6 ten elements of P
0
fall ouside the 95%
credible intervals (9 of them are not captured even by the 99% credible intervals).
Simultaneous credible regions for the whole matrix P
0
can also be constructed using the
method (based on order statistics) suggested in Besag et al. (1995). Table 3 includes the
80% credible regions and these contain all elements of P
0
(The same holds for the 70%
credible regions). In Table 6, nine elements of P
0
are still outside the 80% credible regions
(7 of them are not captured even by the 90% credible regions). This is a natural
consequence of not taking into account the correlation in the errors into the calculation of
standard errors (posterior standard deviations here). In fact, the posterior standard
deviations in Table 6 are much smaller than they should have been. Figure 4 shows the
side-by-side barplots of the true source compositions ( P
0
) and the posterior mean of P
from two different approaches, time series approach (
ˆ
P
ts
) and approach ignoring
dependence (
ˆ
P
indep
), with R
2
values between P
0
and estimates. Again it can be seen that
ˆ
P
ts
gives a much better approximation to the true source composition matrix P
0
than
ˆ
P
indep
does.
5. APPLICATION TO ATLANTA DATA
The 1990 Atlanta data described in Section 1 has two types of temporal dependence
structure, correlation in
α
and correlation in
ε
(see figures 2 and 3). We use model (3)
with q = 3 to analyze this data set consisting of 538 measurements on 9 chemical species.
For identifiability conditions, zeros are preassigned for CyHx+2MHx (cyclohexane+2-
methylhexane) and 2,3-DMP (2,3-dimethylpentane) of source 1 (Roadway), acetylene and
propene of source 2 (Gasoline), acetylene and 2,2,4-TMP (2,2,4-trimethylpentane) of source
3 (Headspace) since the relative concentrations of those species in each source are observed
to be very low from Table 1. An OLS estimate
ˆ
AYP PP
OLS
measured
t
measured measured
t∗
−
=
()
1
where
P
measured
is the measured source compositions (with zeros preassigned and each row
15
normalized to sum to 100) was used as an initial value for A. The mean source contribution
was set to
ξ
0
37 14 03=
()
..., which is the arithmetic mean of
ˆ
A
OLS
∗
. Note that the
specification of the value of
ξ
0
is somewhat arbitrary due to the scale invariance property
mentioned in Section 3. We only need to ensure that
ξ
0
and the initial value of A are on the
same scale. Since the measured source compositions ( P
measured
) can be regarded as prior
information, we use as a prior distribution for P a truncated singular normal distribution
with the mean P
measured
and the variance 900 for the nonzero elements of P, which ensures a
fairly vague prior (the elements of P
measured
have the values between 0 and 100). The scale
matrix for an inverse Wishart distribution for U was set to Ψ
0
16 1 0 7 0 08=⋅
()
diag ,.,.
with the degrees of freedom m
0
20= , yielding the prior mean of
Ψ
0
16 1 0 7 0 08=
()
diag ,.,.. This choice of the hyperparameter values was made to
ensure that the prior distribution is moderately informative but flexible enough to cover the
range of possible values of U. For the hyperparameters of the priors on
σ
j
2
,
j =19,,L , we
take
α
= 5 and
β
j
= 48 (the prior mean 12), and for the hyperparameters of prior on V we
set the scale matrix equal to 27⋅I
p
and the degrees of freedom equal to 13 (so that the prior
mean is 9⋅I
p
), ensuring a proper but relatively diffuse prior. For each parameter, a
posterior sample of size 1,000 was obtained by subsampling every 10th from 10,000 values
following a 10,000 burn-in period. Tables 7-9 contain posterior summaries for some model
parameters.
{Tables 7 and 8 about here}
The AR coefficients
φ
k
are estimated to be
ˆ
.
φ
1
78= ,
ˆ
.
φ
2
68= , and
ˆ
.
φ
3
48= , respectively,
suggesting that there is substantial autocorrelation in roadway contribution and moderate
autocorrelation in gasoline contribution and headspace contribution.
16
The side-by-side barplots of the measured source compositions (in Table 1) and
estimated compositions are given in Figure 5 with R
2
values between measured and
estimated compositions. In general, there seems to be good agreement between them.
{Figure 5 about here}
As mentioned in Section 1, the measured compositions are not the true source
compositions in the sense of Section 4 for the data though they are expected to be generally
close to the true compositions. For the Headspace composition profile (for which the
measured and the estimated compositions show the best agreement), all but one (2MPentan)
of the measured values fall in the 99% credible intervals. The 80% simultaneous credible
regions (constructed by the method of Besag et al. 1995) are also reported in Table 7 and
these capture all of the measured Headspace composition.
6. CONCLUSIONS AND DISCUSSION
In this article we develop a time series extension of multivariate receptor modeling in order
to capture in the estimation process extra variability due to temporal dependence in air
pollution data. Recent developments in MCMC methodology make estimation of
parameters of complex models possible. By modeling the dependence structure, we can get
more reliable estimates for the source compositions and their uncertainties, which are of our
primary interest. As a by-product we can assess the amount of variability and
autocorrelation in the source contributions and the errors. It also makes it possible to
forecast the level of pollutants y
tk+
()
and the amount of pollution
α
tk+
()
, which has been
regarded as one of the model limitations in previous receptor modeling approaches (see the
EPA discussion at http://www.epa.gov/oar/oaqps/pams/analysis/receptor/rectxtsac.html).
Throughout the article we assume that the errors are normally distributed.
Environmental data often contain many outliers, and it is sometimes more appropriate to use
the lognormal distribution to describe the data. The usual transformation technique does
not help especially in the context of receptor modeling. By log-transforming the data the
17
chemical mass balance equation of the model no longer applies directly, and we need to deal
with model identifiability using different conditions. Alternatively, we may consider a
multivariate T-distribution or a mixture of normal distributions to describe the error
distribution. In the application to Atlanta data, the histogram of the residuals for each
species looks in general bell shaped, but shows a few outliers for some of the species. This
might suggest a use of heavy-tailed distribution for errors though it was not pursued further
in this article. Non-normal dynamic modeling is still an active research area (see West and
Harrison 1997), and we expect that multivariate receptor modeling can be extended further
using non-normal dynamic models.
Another assumption we have made is that the errors have mean 0. To be more realistic,
it would be preferable to generalize this to include the unknown non-zero mean errors,
corresponding to unknown sources. This again involves the development of new
identifiability conditions.
Finally, air pollution data is often obtained from multiple receptors. How to incorporate
spatial variability as well as temporal variability in modeling when multiple species are
measured is a challenging problem. Even in the case of no temporal dependence, this
problem remains open.
18
REFERENCES
Anderson, T.W. (1984), An Introduction to Multivariate Statistical Analysis (2nd ed.), New
York: Wiley.
Besag, J., Green, P., Higdon D., and Mengersen K. (1995),”Bayesian Computation and
Stochastic Systems,” Statistical Science, 10, 3-41.
Chib, S., and Greenberg, E. (1995), “Understanding the Metropolis-Hastings Algorithm,”
American Statistician, 49, 331-335.
Gilks, W.R., Richardson, S., and Spiegelhalter, D.J. (1996), Markov chain Monte Carlo in
practice, Chapman & Hall.
Gleser, L.J. (1997), “Some Thoughts on Chemical Mass Balance Models,” Chemometrics
and Intelligent Laboratory Systems, 37, 15-22.
Henry, R.C. (1991), “Multivariate Receptor Models,” in Receptor Modeling for Air
Quality Management (ed. P. Hopke), pp.117-147. Amsterdam: Elsevier.
(1997), “History and Fundamentals of Multivariate Air Quality Receptor
Models,” Chemometrics and Intelligent Laboratory Systems, 37, 37-42.
Henry, R.C., and Kim, B.M. (1990), “Extension of Self-Modeling Curve Resolution to
Mixtures of More than Three Components. part 1. Finding the Basic Feasible
Region,” Chemometrics and Intelligent Laboratory Systems, 8, 205-216.
Henry, R.C., Lewis, C.W., and Collins, J.F. (1994), “Vehicle-Related Hydrocarbon Source
Compositions from Ambient Data: the Grace/Safer Method,” Environmental Science
and Technology, 28, 823-832.
Henry, R.C., Lewis, C.W., and Hopke, P.K. (1984), “Review of Receptor Model
Fundamentals,” Atmospheric Environment, 18, 1507-1515.
Hopke, P.K. (1985), Receptor Modeling in Environmental Chemistry, New York: Wiley.
(1991), “An Introduction to Receptor Modeling,” Chemometrics and
Intelligent Laboratory Systems, 10, 21-43.
19
(1997), “The Chemical Mass Balance as a Multivariate Calibration Problem,”
Chemometrics and Intelligent Laboratory Systems, 37, 5-14.
Park, E. S. (1997), “Multivariate Receptor Modeling from a Statistical Science Viewpoint,”
unpublished Ph.D. dissertation, Texas A&M University, Dept. of Statistics.
Park, E. S., Spiegelman, C. H., and Henry, R. C. (1999), “Bilinear Estimation of Pollution
Source Profiles in Receptor Models,” Technical Report 006, University of
Washington, National Research Center for Statistics and the Environment.
Press, S. J. (1982), Applied Multivariate Analysis: Using Bayesian and Frequentist
Methods of Inference (2nd edition). New York: Krieger.
Tierney, L. (1994), “Markov Chains for Exploring Posterior Distributions,” Annals of
Statistics. 22, 1701-1762
West, M., and Harrison (1997), Dynamic Linear Models, New York: Springer-Verlag.
Yang, H. (1994), “Confirmatory Factor Analysis and its Application to Receptor
Modeling,” unpublished Ph.D. dissertation, University of Pittsburgh, Dept. of
Mathematics and Statistics.
20
TABLES
Table 1.
Measured source composition profiles
Source acetylene propene nButane 2MPentan 3MPentan benzene CyHx
+2MHx
2,3-DMP 2,2,4-TMP
roadway 0.181 0.094 0.197 0.116 0.069 0.132 0.049 0.043 0.120
gasoline 0 0.002 0.197 0.221 0.138 0.108 0.116 0.067 0.152
headspace 0 0.007 0.685 0.144 0.075 0.034 0.021 0.014 0.021
Note: Each source profile is normalized sum to one
Table 2. True source composition profiles (P
0
)
1234567
Source 1 0 0.248 0 0.102 0.306 0.128 0.216
Source 2 0.242 0 0.266 0 0.009 0.044 0.440
Source 3 0.311 0.250 0.039 0.302 0 0.099 0
Note: Each source profile is normalized sum to one
21
Table 3.
Summaries of the posterior distribution for
P
when the data is generated by model (3)
and the approach accounting for dependence is used
Param.
j
1234567
P
1j
Mean
SD
LSCR
LCI
UCI
USCR
0
0
0
0
0
0
0.234
0.018
0.191
0.205
0.262
0.279
0
0
0
0
0
0
0.087
0.023
0.025
0.049
0.124
0.145
0.339*
0.016
0.299
0.313
0.306
0.378
0.124
0.013
0.088
0.101
0.147
0.158
0.216
0.033
0.137
0.160
0.269
0.293
P
2j
Mean
SD
LSCR
LCI
UCI
USCR
0.204*
0.026
0.137
0.157
0.241
0.256
0
0
0
0
0
0
0.253
0.017
0.214
0.225
0.282
0.295
0
0
0
0
0
0
0.044
0.029
0.001
0.004
0.100
0.127
0.043
0.013
0.009
0.021
0.065
0.075
0.456
0.016
0.416
0.430
0.484
0.502
P
3j
Mean
SD
LSCR
LCI
UCI
USCR
0.298
0.009
0.278
0.284
0.313
0.320
0.264
0.010
0.237
0.247
0.279
0.288
0.029
0.011
0.003
0.011
0.046
0.056
0.304
0.009
0.284
0.290
0.319
0.328
0
0
0
0
0
0
0.106
0.008
0.085
0.093
0.118
0.126
0
0
0
0
0
0
Note: 1. SD stands for the posterior standard deviation; 2. LCI and UCI stand for the lower limit and upper limit of the 95%
credible interval; 3. Asterisk (*) indicates that the true parameter value is not captured by the 95% credible interval; 3.
Asterisk (*) indicates that the true parameter value is not captured by the 95% credible interval; 4. LSCR and USCR stand
for the lower limit and upper limit of the 80% simultaneous credible region.
22
Table 4.
Posterior means and standard deviations of
Φ
and
R
U
(correlation matrix corresponding to
U
)
when the data is generated by model (3) and the approach accounting for dependence is used
φ
k
Correlations in
R
U
k
=
1
0.826 (0.044) 1
k
=
2
0.834 (0.042) 0.010 (0.133) 1
k
=
3
0.817 (0.040) 0.245 (0.108)* -0.141 (0.102) 1
Note: 1. Posterior standard deviation is given in the parenthesis; 2. Asterisk (*) indicates
that the true parameter value is not captured by the 95% credible interval.
Table 5.
Posterior means and standard deviations of
Θ
,
V
, and
Σ
when the data is generated by model (3)
and the approach accounting for dependence is used
θ
j
Diagonal elements of
V
σ
j
2
j
=
1
0.379 (0.194)* 2.463 (1.295) 3.823 (1.238)
j
=
2
0.628 (0.178) 2.777 (1.304) 2.908 (1.002)
j
=
3
0.836 (0.100) 2.030 (0.924) 4.368 (1.010)
j
=
4
0.801 (0.102) 2.470 (1.127) 4.072 (1.077)
j
=
5
0.539 (0.207) 2.634 (1.431) 4.252 (1.509)
j
=
6
0.609 (0.121) 2.485 (0.950) 3.279 (0.921)
j
=
7
0.650 (0.191) 2.496 (1.457) 2.547 (1.029)
Note: 1. Posterior standard deviation is given in the parenthesis; 2. Asterisk (*) indicates that the true
parameter value is not captured by the 95% credible interval.
23
Table 6.
Summaries of the posterior distribution for the parameters
P
and
Σ
ε
when the data is generated by model (3)
but the approach ignoring dependence (given in Remark 1) is used
Param.
j
1234567
P
1j
Mean
SD
LSCR
LCI
UCI
USCR
0
0
0
0
0
0
0.214*
0.014
0.180
0.190
0.236
0.246
0
0
0
0
0
0
0.084
0.014
0.050
0.060
0.106
0.115
0.339*
0.011
0.314
0.322
0.357
0.365
0.125
0.008
0.105
0.112
0.137
0.144
0.239
0.022
0.189
0.205
0.277
0.297
P
2j
Mean
SD
LSCR
LCI
UCI
USCR
0.123*
0.012
0.096
0.104
0.142
0.154
0
0
0
0
0
0
0.201*
0.008
0.182
0.187
0.214
0.221
0
0
0
0
0
0
0.154*
0.011
0.125
0.136
0.172
0.179
0.063*
0.007
0.045
0.051
0.074
0.080
0.459*
0.009
0.439
0.445
0.474
0.482
P
3j
Mean
SD
LSCR
LCI
UCI
USCR
0.292*
0.005
0.281
0.284
0.300
0.304
0.282*
0.005
0.269
0.274
0.291
0.296
0.036
0.007
0.021
0.026
0.047
0.054
0.286*
0.004
0.276
0.278
0.293
0.297
0
0
0
0
0
0
0.103
0.005
0.092
0.096
0.111
0.115
0
0
0
0
0
0
σ
ε
j
2
=
8.882
Mean
SD
5.565*
1.453
8.648
1.853
10.415
1.403
11.375
1.621
8.275
2.246
7.873
0.840
7.255
2.768
Note: 1. SD stands for the posterior standard deviation; 2. LCI and UCI stand for the lower limit and upper limit of the 95%
credible interval; 3. Asterisk (*) indicates that the true parameter value is not captured by the 95% credible interval; 4. LSCR
and USCR stand for the lower limit and upper limit of the 80% simultaneous credible region.
24
Table 7.
Summaries of the posterior distribution for
P
for the Atlanta data
Param.
Species
j
acetylene
1
propene
2
nButane
3
2MPentan
4
3MPentan
5
benzene
6
CyHx
+2Mhx
7
2,3-DMP
8
2,2,4-TMP
9
roadway Mean
SD
LSCR
LCI
UCI
USCR
0.275
0.008
0.257
0.257
0.295
0.297
0.115
0.004
0.107
0.107
0.124
0.125
0.279
0.013
0.247
0.248
0.305
0.307
0.086
0.004
0.076
0.076
0.095
0.096
0.049
0.003
0.042
0.043
0.056
0.056
0.126
0.004
0.117
0.118
0.135
0.136
0
0
0
0
0
0
0
0
0
0
0
0
0.069
0.005
0.057
0.057
0.081
0.081
gasoline Mean
SD
LSCR
LCI
UCI
USCR
0
0
0
0
0
0
0
0
0
0
0
0
0.172
0.019
0.127
0.128
0.214
0.217
0.191
0.005
0.179
0.180
0.202
0.204
0.113
0.003
0.104
0.105
0.121
0.122
0.088
0.004
0.077
0.078
0.097
0.099
0.123
0.005
0.112
0.112
0.134
0.135
0.098
0.004
0.089
0.090
0.107
0.107
0.217
0.008
0.200
0.201
0.236
0.238
headspace Mean
SD
LSCR
LCI
UCI
USCR
0
0
0
0
0
0
0.009
0.007
0.000
0.001
0.029
0.034
0.693
0.035
0.606
0.609
0.773
0.776
0.116
0.011
0.083
0.087
0.142
0.145
0.063
0.007
0.042
0.045
0.080
0.081
0.052
0.010
0.028
0.029
0.074
0.076
0.021
0.009
0.001
0.002
0.044
0.046
0
0
0
0
0
0
0.046
0.017
0.007
0.008
0.088
0.093
Note: 1. SD stands for the posterior standard deviation; 2. LCI and UCI stand for lower limit and upper limit of the 99% credible interval; 3. LSCR and
USCR stand for lower limit and upper limit of the 80% simultaneous credible region.
25
Table 8.
Posterior means and standard deviations of
Φ
and
R
U
(correlation matrix corresponding to
U
) for the Atlanta data
φ
k
Correlations in
R
U
k
=
1
0.775 (0.036) 1
k
=
2
0.677 (0.062) 0.207 (0.045) 1
k
=
3
0.476 (0.114) -0.069 (0.051) -0.049 (0.047) 1
Note: Posterior standard deviation is given in the parenthesis.
Table 9.
Posterior means and standard deviations of
Θ
, diagonal elements of
V
, and
Σ
for the Atlanta data
Species
θ
j
Diagonal elements of
V
σ
j
2
Acetylene 0.512 (0.110) 1.039 (0.243) 1.148 (0.127)
Propene 0.550 (0.066) 0.405 (0.058) 0.506 (0.042)
nButane 0.400 (0.201) 2.929 (1.339) 3.683 (0.751)
2Mpentan 0.221 (0.086) 0.520 (0.102) 0.534 (0.045)
3Mpentan 0.162 (0.073) 0.280 (0.040) 0.349 (0.026)
Benzene 0.360 (0.092) 0.379 (0.055) 0.501 (0.040)
CyHx+2Mhx 0.237 (0.088) 0.341 (0.048) 0.448 (0.036)
2,3-DMP 0.269 (0.086) 0.261 (0.033) 0.360 (0.027)
2,2,4-TMP 0.643 (0.062) 0.681 (0.138) 0.758 (0.070)
Note: Posterior standard deviation is given in the parenthesis.
26
Figure Titles and Legends
Figure 1. Autocorrelation function (ACF) plots of Y for Atlanta data
Figure 2. Autocorrelation function (ACF) plots of the residuals for Atlanta data:
Y
−
ˆ
A
OLS
P
where P is the measured source compositions in Table 1
Figure 3. Autocorrelation function (ACF) plots of source contributions (
ˆ
A
OLS
) for Atlanta
data
Figure 4. Side-by-side barplots of the true source compositions (P
0
) and the estimated
compositions obtained from two different approaches, time series approach and
approach ignoring dependence
Figure 5. Side-by-side barplots of the measured source compositions and the estimated
compositions for the Atlanta data
27
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
Acetylene
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
propene
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
nButane
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
2MPentan
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
3MPentan
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
benzene
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
CyHx+2MHx
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
2,3-DMP
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
2,2,4-TMP
Figure 1
28
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
Acetylene
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
propene
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
nButane
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
2MPentan
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
3MPentan
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
benzene
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
CyHx+2MHx
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
2,3-DMP
0 10 20 30
-0.5
0
0.5
1
ACF
TextEnd
2,2,4-TMP
Figure 2
29
0 5 10 15 20 25 30
-0.5
0
0.5
1
ACF
TextEnd
Series : Roadway contribution
0 5 10 15 20 25 30
-0.5
0
0.5
1
ACF
TextEnd
Series : Gasoline contribution
0 5 10 15 20 25 30
-0.5
0
0.5
1
ACF
TextEnd
Series : Headspace contribution
Fi
g
ure 3
30
1 2 3 4 5 6 7
0
0.1
0.2
0.3
0.4
Source 1
R
2
=.99
1 2 3 4 5 6 7
0
0.2
0.4
0.6
0.8
Source 2
R
2
=.98
1 2 3 4 5 6 7
0
0.1
0.2
0.3
0.4
Source 3
Time series approach
R
2
=.99
estimated
true
1 2 3 4 5 6 7
0
0.1
0.2
0.3
0.4
Source 1
R
2
=.97
1 2 3 4 5 6 7
0
0.2
0.4
0.6
0.8
Source 2
R
2
=.78
1 2 3 4 5 6 7
0
0.1
0.2
0.3
0.4
Source 3
Approach ignoring dependence
Figure 4
R
2
=.99
estimated
true
31
1 2 3 4 5 6 7 8 9
0
0.1
0.2
0.3
0.4
Roadway
R
2
=.91
1 2 3 4 5 6 7 8 9
0
0.1
0.2
0.3
0.4
Gasoline
R
2
=.84
1 2 3 4 5 6 7 8 9
0
0.2
0.4
0.6
0.8
Headspace
Figure 5
R
2
=.99
estimated
measured