ArticlePDF Available

Multivariate Receptor Modeling for Temporally Correlated Data by Using MCMC

Journal of the American Statistical Association

February 2001
96(December):1171-1183

DOI:10.1198/016214501753381823

Source
RePEc

Authors:

Ronald C Henry

University of Southern California

Multivariate receptor modeling aims to estimate pollution source pro les and the amounts of pollution based on a series of ambient concentrations of multiple chemical species over time. Air pollution data often show temporal dependence due to meteorology and/or background sources. Previous approaches to receptor modeling do not incorporate this dependence. We model dependence in the data using a time series approach so that we can incorporate extra sources of variability in parameter estimation and uncertainty estimation. We estimate parameters using the Markov chain Monte Carlo method, which makes simultaneous estimation of parameters and uncertainties possible. The methods are applied to simulated data and 1990 Atlanta air pollution data. The results show promise towards the goal of accounting for the dependence in the data.

. True source composition profiles (P 0 )

…

. Summaries of the posterior distribution for the parameters P and Σ ε when the data is generated by model (3)

…

. Summaries of the posterior distribution for P for the Atlanta data

…

No caption available

…

Figures - uploaded by Ronald C Henry

Content may be subject to copyright.

Content uploaded by Ronald C Henry

Content may be subject to copyright.

Multivariate Receptor Modeling for Temporally

Correlated Data by Using MCMC

Eun Sug Park Peter Guttorp Ronald C. Henry

NRCSE

T e c h n i c a l R e p o r t S e r i e s

NRCSE-TRS No. 043

March 9, 2000

The NRCSE was established in 1996 through a cooperative agreement with the United States

Environmental Protection Agency which provides the Center's primary funding.

Multivariate Receptor Modeling for Temporally Correlated Data

by Using MCMC

Eun Sug Park

, Peter Guttorp

, and Ronald C. Henry

National Research Center for Statistics and the Environment

University of Washington

Seattle, WA 98195

Civil and Environmental Engineering

University of Southern California

Los Angeles, CA 90089.

Author’s Footnote

Eun Sug Park is Research Associate, National Research Center for Statistics and the

Environment, University of Washington, Seattle, WA 98195. Peter Guttorp is Professor

of Statistics and Director of the National Research Center for Statistics and the

Environment, University of Washington, Seattle, WA 98195. Ronald C. Henry is

Associate Professor of Civil and Environmental Engineering, University of Southern

California, Los Angeles, CA 90089. Although the research described in this article has

been funded by the United States Environmental Protection Agency through agreement

CR825173-01-0 to the University of Washington, it has as not been subjected to the

Agency’s required peer and policy review and therefore does not necessarily reflect the

views of the Agency and no official endorsement should be inferred.

Abstract and Key Words

Multivariate receptor modeling aims to estimate pollution source profiles and the

amounts of pollution based on a series of ambient concentrations of multiple chemical

species over time. Air pollution data often show temporal dependence due to

meteorology and/or background sources. Previous approaches to receptor modeling do

not incorporate this dependence. We model dependence in the data using a time series

approach so that we can incorporate extra sources of variability in parameter estimation

and uncertainty estimation. We estimate parameters using the Markov chain Monte Carlo

method, which makes simultaneous estimation of parameters and uncertainties possible.

The methods are applied to simulated data and 1990 Atlanta air pollution data. The

results show promise towards the goal of accounting for the dependence in the data.

KEY WORDS: Dynamic models; Kalman filter; Gibbs sampler; Metropolis-Hastings

algorithm; Compositions; Air pollution.

1. INTRODUCTION

The goal of receptor modeling is to identify the pollution sources and assess the amounts of

pollution based on observations collected at a particular site, and from that information to

develop an effective air quality management plan. The basic mathematical model can be

written as follows based on chemical mass balance assumptions (see, e.g., Hopke, 1985,

1991, 1997; Gleser 1997):

yP tn

=+=

∑

αε

1,,,L (1)

where

yyy y

ttt tp

()

,,,L is the tth observation, q is the number of sources,

Ppp p

kkk kp

()

,,,L is the kth source composition (consisting of the fractional amount of

each species in the emissions from the kth source),

is the contribution from the kth

source on the tth day, and

εεε ε

ttt tp

()

,,,L is the measurement error associated with the

tth observation. In matrix terms, the model (1) can be written as

YAPE=+ (2)

where A is n×q source contribution matrix, P is q×p source composition matrix, and E is

p error matrix. The model (1) may be viewed as a factor analysis model in the sense that

Y is the only observable quantity while q, P, and A are all unknown quantities that need to

be estimated (or predicted). Early approaches to multivariate receptor modeling include

exploratory factor analysis, principal component analysis, target transformation factor

analysis, and others (see, e.g., Henry 1991). It is well known that, without imposing

additional constraints on the parameters, the factor analysis model is not identifiable even

with known number of sources, q. There have been several attempts to avoid this problem

by imposing more restrictive constraints on either the P or the A matrix (see Henry and Kim

1990; Henry, Lewis, and Collins 1994; Yang 1994; Park 1997). As a matter of fact, there

could be many different sets of identifiability conditions, each making sense in its own

context. Park, Spiegelman, and Henry (1999) discuss identifiability conditions that are

meaningful in receptor models.

The assumption of independence among the observations y

has been made either

implicitly or explicitly in all previous approaches to multivariate receptor modeling, see, for

instance, Hopke (1991), Henry (1991), Yang (1994), Gleser (1997), Park (1997), and Park

et al. (1999). Air pollution data, however, are usually obtained as a series of measurements

on concentrations of aerosols over time, and meteorology often induces some degree of

dependence in the data. Observations closer in time tend to be more correlated than

observations farther apart in time (e.g., Figure 1).

{Insert Figure 1}

In some cases the assumption of independence may not be grossly wrong because

environmental data usually contains many missing values or erroneous observations, and

after initial screening of the data, time separation between any pair of measurements may

become large enough so that serial correlation can be ignored in the screened data. This, of

course, is not always the case. The research in this paper was motivated by a 1990 Atlanta

air pollution composition data set consisting of hourly measurements of volatile

hydrocarbon (VHC) species. This data set was used in Henry et al. (1994) to derive

vehicle-related hydrocarbon source compositions from the ambient data. In that study, three

types of measured source profiles specific to Atlanta in the summertime of 1990 were also

available: roadway emissions, whole gasoline, and gasoline headspace (see Henry et al.

1994). The compositions of those three sources for nine selected vehicle-related species are

provided in Table 1.

{Insert Table 1}

It is worthwhile to mention that those direct source measurements were obtained, under

rather restricted conditions, independently of the data (e.g., roadway compositions were

obtained as highway tunnel measurements during morning rush hour). Thus it is not

unlikely that the measured source compositions could be different from the true source

compositions P

for the data due to pollutant transport (between source and receptor) and

reactions (and also to measurement errors, variations in source compositions, and the

contribution of minor sources). Nonetheless, the measured source compositions may serve

as a guideline for the true source compositions.

Assuming that the measured compositions in Table 1 are the true source

compositions, i.e., P is known in model (2), A can be estimated easily, for instance, as an

ordinary least squares (OLS) solution,

AYPPP

OLS

′′

()

−1

, if we ignore dependence

structure in the data (and vice versa, i.e.,

PAAAY

OLS

′

()

′

−1

if A is known or estimated first.

This was done in almost all previous works without checking the independence

assumption). Figures 1 and 2 show the autocorrelation function (ACF) plot of the raw data

Y and residuals calculated as YAP

OLS

−

for each of nine species, respectively.

{Figure 2 about here}

Figure 3 shows ACF plots of OLS estimates of source contributions,

OLS

{Figure 3 about here}

All three plots reveal significant serial correlation in the data. It is well known in time series

literature that in the presence of the correlated residuals, the standard error (not adjusting for

the correlation in the residuals) of OLS estimate of the trend (which may be regarded as P

in our model) in the regression is often grossly wrong. Although the correct standard error

of OLS estimate may be obtained by adjusting for the correlation, it is still not the best

estimate since the generalized least squares estimate, taking the correlation into account in

the estimation procedure, has smaller standard error. The goal of this article is to extend

receptor models to account for temporal dependence in the data so that we can incorporate

that source of variability in estimation of parameters and uncertainties. In Section 2, we

introduce models accounting for time dependence in the observations. Estimation of

parameters is discussed in Section 3. Sections 4 and 5 contain examples from simulated

data and the Atlanta air pollution data, respectively. Finally, concluding remarks are made in

Section 6.

2. MODEL

Assume that the y

in (1) are dependent. We first need to decide how to model this

dependence. It seems reasonable to assume that the source contribution on time t depends

on the past source contributions (as Figure 3 indicates). Also, it is often the case that

contains not only pure measurement error but also all the remaining sources of variability

that is not explained by the systematic part of our model such as background sources

(unmodeled minor sources) and meteorology, etc. Then it is likely that the

are also

correlated in time due to the effect of meteorology and unmodeled sources (see Figure 2).

We may decompose

into two terms

εηδ

ttt

=+ where

represents variability

correlated in time due to meteorology or background sources, and

represents residual,

unpredictable variability due to pure measurement error, independent over time.

We consider the model

tt tt

=++

αηδ

where

ααα α

ttt tq

()

,,,L is a stationary vector AR(1) process centered at

ξξξ ξ

()

,,, ,L

ηηη η

ttt tp

()

,,,L is a stationary vector AR(1) process centered at 0,

and

δδδ δ

ttt tp p

()

,,, ~ ,L 0 Σ where

Σ=

()

diag

σσ σ

,,,L . We use ‘ N

⋅⋅

()

, ’ to

denote k-dimensional multivariate normal distribution throughout the paper. This model

may be written in Dynamic Linear Model (DLM) form (West and Harrison, 1997) as

Observation equation: yP N

tt tt t p

=++

()

αηδδ

,~,0 Σ

Evolution equation:

αξα ξ

tt ttq

uuNU=+ −

()

−1

Φ ,~(,)0

ηη υ υ

tt t t p

NV=+

()

−1

Θ ,~,0 (3)

where

uuu u

ttt tq

()

,,,L ,

Φ=

()

diag

φφ φ

,,,L ,

is an AR coefficient for the kth

source contribution,

υυυ υ

ttt tp

()

,,,L ,

Θ=

()

diag

θθ θ

,,,L , and

is an AR

coefficient for jth element of

. Note that marginal distribution for each

αξ

NW W WU~,,

()

=+ΦΦ (4)

and for each

NM MMV~,,0

()

=+ΘΘ . (5)

3. ESTIMATION

As the model gets complicated by inclusion of more parameters, Markov chain Monte Carlo

(MCMC) simulation (Tierney 1994; Chib and Greenberg 1995; Besag, Green, Higdon, and

Mengersen 1995; Gilks, Richardson, and Spiegelhalter 1996) seems to be an attractive

approach for parameter estimation. Note also that the parameters of the models (1) or (3)

are all unknown, and the problem of parameter estimation is essentially nonlinear, but the

Markov chain Monte Carlo method makes the problem linear by use of conditional

distributions. We introduce a Bayesian framework to employ an MCMC method

(constraints and identifiability conditions can be used as a part of the prior distribution). As

mentioned in Section 1, the receptor model can be viewed as a special type of a factor

analysis model (with the constraints that the elements of factor loading matrix should be all

nonnegative). For identifiability of the model we borrow conditions from the confirmatory

factor analysis model (Anderson 1984).

C1. There are at least q −1 zero elements in each row of P,

C2. The rank of P

(k)

is q −1, where P

(k)

is the matrix composed of the columns

containing the assigned 0’s in the kth row with those assigned 0’s deleted.

Under the above conditions the source profiles, P, are identified up to normalization, which

is enough for the purpose of receptor model. (As long as the relative amount of each

species in a source is determined, a source can be identified.) The conditions C1 and C2

(and nonnegativity constraints on the elements of P) are absorbed into prior distribution for

Under the normal error assumption on

, the likelihood

fYL

()

is written as

f Y tr Y AP Y AP

()

=−−−

()

′

−−

()













−

πηη

ΣΣexp (6)

where

is n

p matrix of which rows are

tn=1, ,L . We use ‘

L ’ to denote

conditioning on all other variables. For a prior distribution p()⋅ , we assume that

pP U V

pPp p pUp U p pVp V

,,,, , , ,,, , ,

,, ,, ,, ,.

ΣΦ Θ

ΣΦ Φ Θ Θ

αα ηη

αα ξ ηη

10 1

()

()()()()

()

For the sake of brevity,

is assumed known to be

ξξ

. Note that (3) implies

pUWWUtrU

n tt tt

α α ξ π γ γ γγ γγ

, , , , exp expL ΦΦΦ

()

−

′

()

−−

()

′

−

()

























−

−−

−

∑

where

γαξ

=−

and

p M M V tr V

n tt tt

η η η π η η ηη ηη

, , , exp expL ΘΘΘ

()

−

′

()

−−

()

′

−

()

























−

−−

−

∑

Based on a series of observations

,,L , we are interested in sampling the full

posterior

πααηη

PU V Y

,,,, , , ,,, , ,ΣΦ Θ

()

. We use “block-at-a-time” Metropolis-

Hastings algorithm (Chib and Greenberg, 1995). We shall make use of seven move types

in implementing MCMC:

(a) updating P

(b) updating Σ

(d) updating U

(e) updating Θ

(f) updating V

(g) updating

and

Letting

PAAAY=

′

()

′

−

()

−1

and SY APY AP=−−

()

′

−−

()

ηη

˜˜

, and using the

orthogonality properties associated with

P (see Press 1982), (6) can be written as

ΣΣ Σ

−

−−

−

{}

−−

()

′

()

−

()













tr S tr P P A A P Pexp exp

˜˜

∝− −

()

′

⊗

′

()

−

()













−

exp

˜˜

vecP vecP A A vecP vecPΣ .

Let the prior distribution for P be

p P p vecP N m C P k q j p

() ( )~ , , , ,, , ,=

()

≥= =

()

01 1I LL

where m

is a pq-dimensional vector and C

is a pq pq× -dimensional diagonal matrix.

Enforcing the constraints C1-C2 is equivalent to using a degenerate point prior for some of

the elements of P. We set qq×−

()

1 elements of m

and the corresponding elements of

to be zero, which makes the prior distribution for P a truncated singular normal

distribution (though still proper). Then the resulting full conditional posterior distribution

()

is again a truncated singular normal distribution, which can be written as

vecP N m C P k q j p

LLL~, , ,,, ,,

()

≥= =

()

I 01 1

where m C A vec Y C m=⊗

′

()

−

()

{}

−−

, CAAC=⊗

′

()

−−

−

where C

−

is a

generalized inverse of C

. Since both of Σ and C

are diagonal, for the columns of P with

no zero elements, we have

PNmCP k q

jqjj

LL~, ,,,

()

≥=

()

I 01

where

mC Ay Cm

jjj

′

−

()

{}

−−

ση

, CAAC

jj j

′

()

−−

−

, m

is a q-dimensional

prior mean vector of P

, C

is a corresponding submatrix of C

, y

is the jth column of Y,

and

is the jth column of

. For the columns of P containing zero elements, let q

∗

be the

number of nonzero elements for that column and P

∗

be a column vector consisting of those

∗

elements. Then

PNmCP k q

∗∗∗∗

∗

()

≥=

()

LL~, ,,,I 01

where mCA Ay Cm

jj j

∗∗

′

∗−

′

∗∗−∗

=−

()

{}

ση

, CAAC

jj j

∗−

′

∗∗ ∗−

−

()

, m

∗

is a q

∗

dimensional prior mean vector of nonzero elements of P

, C

∗

is a corresponding submatrix

of C

, and A

∗

consists of the columns of A corresponding to nonzero elements of P

If there is no prior information about the source compositions but the zero elements, we

may use a noninformative prior pP P P j J

kj kj

() ,=≥

()

=∈

()













∏∏

II00

where J

is the

index set for which P

= 0, which takes into account the conditions C1-C2 and

nonnegativity only. Under this prior, we have, for the columns of P with no zero element,

PNmCP k q

jqjj

LL~, ,,,

()

≥=

()

I 01

where

mAAAy

′

()

′

−

()

−1

, CAA

′

()

−

. For the columns of P containing zero

elements, we get

PNmCP k q

∗∗∗∗

∗

()

≥=

()

LL~, ,,,I 01

where

mAAAy

∗

′

∗∗

−

′

∗

()

−

()

, CAA

∗

′

∗∗

−

()

Hence move (a) can be performed using either a Gibbs sampler or a simple Metropolis-

Hastings algorithm.

Under a usual inverse gamma prior distribution for

σαβ

−

()

~,Γ ,

jp=1, ,L ,

with the parameterization in which the mean and variance are

αβ

and

αβ

, respectively,

the full conditional for

{}

are

σαβ

−

()

L ~,Γ

where d y AP y AP

=−−

()

′

−−

()

ηη

. This can be easily sampled using a Gibbs

sampler.

Moves (c) - (g) require Metropolis-Hastings steps. We use the same strategy as those

given in Chib and Greenberg (1995) and West and Harrison (1997) to update Φ and U,

respectively. Let

γαξ

=−

. Under uniform priors for

, writing

φφ φ

()

,,L

for

the diagonal of Φ, and D diag

()

−

, the full conditional posterior density for Φ,

πφ

()

is proportional to

cf bBI

nor

()

φφ

,0 1

where f

nor

is the q-variate normal density function, BDUD

−−

′

∑

, bB UD

′

−

∑

cW WΦ

()

=−

′

()

−

exp

γγ

, WWU=+ΦΦ and II

01 0 1

()

=<<

()

∏

φφ

. We use

NbB

()

as a proposal distribution for

(independent proposal). That is, we sample a

candidate

∗

from NbB

()

, compute the corresponding diagonal matrix Φ

∗

and variance

matrix W

∗

such that WWU

∗∗∗∗

=+ΦΦ , and accept new

vector with probability

min ,1

∗∗

()





















The full conditional posterior for U ,

U L

()

, is proportional to

p U a U U trace U G

()()

−

()

[]

−

−1

exp

where G

tt tt

=−

()

′

−

()

−−

∑

γγ γγ

ΦΦ and aU W W

()

=−

′

()

−

exp

γγ

. Note that G

follows a Wishart distribution with parameters U and n −1, i.e.,

GWUn

~,−

()

where fG

G trace U G

()

−

()

[]

−

()

−−

−

()

exp

()Γ

. Under an inverse Wishart prior

UW m

−

()

where the density is given by

U trace U

()

−

()

[]

()

−++

()

−

ΨΨ

)

exp

the conditional distribution of U given G is

UG W Gm n

−

++−

()

1Ψ , and so the full

conditional posterior for U is proportional to

aU f U Gm n

Wishart

()

++−

()

−1

1Ψ ,

where f

Wishart

−1

is the inverse Wishart density function. We use this inverse Wishart

distribution WGmn

−

++−

()

1Ψ , as a proposal distribution for U. The acceptance

probability in this case is given by

min ,1

∗

()





















where WWU

∗∗∗

=+ΦΦ .

Move types (e)-(f) are essentially the same as move types (c)-(d) with substitution of Θ,

V, Μ and

for Φ, U , W , and

, respectively.

Move (g), updating

(equivalently, updating

γαξ

=−

) and

, can be implemented

by forward-filtering, backward-sampling algorithm (West and Harrison 1997) applied to

−

where

()

. Note that the assumption that

is known is not a strong

assumption. Model (3) can be rewritten as

ttt

−= +

µλ δ

F and

λλ ρ

tt t

−1

G , (7)

where

λγη

ttt

[]

is the state vector at time t, F













×pp

, G is the (k+p)

(k+p) matrix,













, and

ρυ

ttt

[]

with variance matrix Ω=













0 V

. To sample from the

full conditional posterior

πλλ λ

,,,LL

()

, we sequentially simulate the individual vectors

λλ λ

,,,

−11

L as follows:

1) Sample

from NmC

qnn

()

where m

and C

are obtained from the Kalman filtering

recurrences

mmeK

tttt+++

111

G ,

ey m

tt t++

=−−

110

GF,

KRR

t++

−

()

Σ FFF,

CRRK

tttt++++

=−

1111

F ,

GG Ω.

2) For each

tn n=− −121,,,L , sample

from NhH

qt t

()

where hm aB

ttt tt

=+ −

()

HCBRB

ttttt

=−

′

, BR C

tt t

−

G , am

tt+

G, and

t+1

is the value just sampled.

Note that the likelihood (6) is invariant with respect to changes in scale of A or P (even

after identifiability conditions C1-C2 are taken into account), and the parameters A (and so

and U) and P are identified except for multiplication by a diagonal matrix (consisting of

scale constants), i.e., we would estimate AD

−1

, DUD

−−11

) and DP unless we use a

very precise informative prior. As already mentioned, knowing (estimating) P up to a

normalizing constant fulfills the objective of receptor modeling. It can also be shown that a

scale constant matrix D (although it is unknown and depends on the initial value of the

parameters) does not vary from iteration to iteration within an MCMC run. In this sense

our MCMC scheme is self-consistent, and so the adjustment for the scale constant matrix

does not need to be made at each step. If the scale constant (the matrix D) is ever known

(e.g., the total mass of pollutant particle is known), the adjustment can be directly applied to

the posterior summaries simply by multiplying (or dividing) by D. Care must be taken

though in specifying the initial values for the parameters or hyperparameters for the prior

distributions to ensure that at least they are approximately on the same scale or in a

consistent fashion (e.g.,

, hyperparameters for U, and initial value for A or P).

Finally, the posterior probability statements can directly be made on the identifiable

quantities such as the normalized P or the scaled matrix of U (i.e., the correlation matrix of

A) as discussed in Besag et al. (1995).

Remark 1. When

and

are assumed to be independent, it can easily be shown that

under a normal prior distribution

αξ

N~,

()

, the full conditional distribution for

πα

()

, is a normal distribution through conjugacy, i.e.,

αξ

εε ε

tqt

Ny P P P PPL ~,Σ ΞΣΞ ΣΞ

−−−−

−

−−

−

′

()

′

()

′

()

where

εεε

εσσ

()

cov , ,

diag

L . This can be updated using a Gibbs sampler,

and with moves (a) and (b) where

−

and

are replaced by y

and

, respectively,

it completes one cycle of MCMC when the observations are treated as independent. In

Section 4, this approach is also compared to our time series approach when the observations

are actually dependent.

4. SIMULATION

The data are generated by the model (3) with p = 7, n = 200, q = 3,

σσ

3== =L ,

φφφ

123

08===. ,

10 12 14=

()

,, , U

Ι where

3= ,

θθ

7===L . ,

V =⋅

Ι where

3= . The initial values of

and

are given by

αξ

Z=+

−

, where ZN

~(,)01, k =123,, and

−

j =17,,L ,

respectively. The true source composition matrix P

(normalized to sum to 1) is given in

Table 2. It follows from (4) and (5) that W =⋅

8 333

. I and M =⋅

5 882

. I .

In implementing MCMC, we take

= 3 and

= 8 for the prior on

j =17,,L

(yielding the prior mean 4), m

7= and Ψ

9=⋅

for the prior on U (yielding the prior

mean 3

⋅

I ), and set the scale matrix for the prior on V equal to 9

⋅

I and the degrees of

freedom equal to 11 (yielding the prior mean 3

⋅

I ), each ensuring a proper but relatively

diffuse prior. We use a noninformative prior distribution for the nonzero elements of P

throughout simulation.

The posterior summaries for the model parameters, P, Σ, Φ, U , Θ, and V, based on

2,000 values subsampled from 20,000 iterations following a 20,000 burn-in period are

reported in Tables 3-5. For the source composition matrix P and the variance matrix U,

those summaries are obtained in terms of normalized P (sum to 1) and the scaled variance

matrix R

(the correlation matrix) since they are identified only up to a constant multiplier.

{Tables 3-5 about here}

We also report the posterior summaries obtained from the approach for independent

observations (see Remark 1) in Table 6. Since this approach does not decompose the error

variances into Σ and M, we treat the estimates of the error variances as the estimates for

ΣΣ

εε ε

σσ

()

=+diag M

,,L . The prior mean and the covariance matrix of

are

set to be

10 12 14=

()

and Ξ

100=⋅

, respectively, and the hyperparameters of

the priors on

(,,)j =17L are taken as

= 4 and

= 27,

j =17,,L , (yielding the

prior mean 9). The results are based on a posterior sample of size 2,000 obtained by

subsampling every 10th from 20,000 values following a 20,000 burn-in period.

{Table 6 about here}

By comparing Table 3 and Table 6, it can be noted that the approach accounting for

dependence in the data yields much better result in terms of posterior inferences than the

approach not accounting for dependence. In Table 3 only 2 of the 15 (nonzero) elements of

lie outside the 95% credible intervals (all are within the 99% credible intervals though we

do not report them in the table) whereas in Table 6 ten elements of P

fall ouside the 95%

credible intervals (9 of them are not captured even by the 99% credible intervals).

Simultaneous credible regions for the whole matrix P

can also be constructed using the

method (based on order statistics) suggested in Besag et al. (1995). Table 3 includes the

80% credible regions and these contain all elements of P

(The same holds for the 70%

credible regions). In Table 6, nine elements of P

are still outside the 80% credible regions

(7 of them are not captured even by the 90% credible regions). This is a natural

consequence of not taking into account the correlation in the errors into the calculation of

standard errors (posterior standard deviations here). In fact, the posterior standard

deviations in Table 6 are much smaller than they should have been. Figure 4 shows the

side-by-side barplots of the true source compositions ( P

) and the posterior mean of P

from two different approaches, time series approach (

) and approach ignoring

dependence (

indep

), with R

values between P

and estimates. Again it can be seen that

gives a much better approximation to the true source composition matrix P

than

indep

does.

5. APPLICATION TO ATLANTA DATA

The 1990 Atlanta data described in Section 1 has two types of temporal dependence

structure, correlation in

and correlation in

(see figures 2 and 3). We use model (3)

with q = 3 to analyze this data set consisting of 538 measurements on 9 chemical species.

For identifiability conditions, zeros are preassigned for CyHx+2MHx (cyclohexane+2-

methylhexane) and 2,3-DMP (2,3-dimethylpentane) of source 1 (Roadway), acetylene and

propene of source 2 (Gasoline), acetylene and 2,2,4-TMP (2,2,4-trimethylpentane) of source

3 (Headspace) since the relative concentrations of those species in each source are observed

to be very low from Table 1. An OLS estimate

AYP PP

OLS

measured

measured measured

t∗

−

()

where

measured

is the measured source compositions (with zeros preassigned and each row

normalized to sum to 100) was used as an initial value for A. The mean source contribution

was set to

37 14 03=

()

..., which is the arithmetic mean of

OLS

∗

. Note that the

specification of the value of

is somewhat arbitrary due to the scale invariance property

mentioned in Section 3. We only need to ensure that

and the initial value of A are on the

same scale. Since the measured source compositions ( P

measured

) can be regarded as prior

information, we use as a prior distribution for P a truncated singular normal distribution

with the mean P

measured

and the variance 900 for the nonzero elements of P, which ensures a

fairly vague prior (the elements of P

measured

have the values between 0 and 100). The scale

matrix for an inverse Wishart distribution for U was set to Ψ

16 1 0 7 0 08=⋅

()

diag ,.,.

with the degrees of freedom m

20= , yielding the prior mean of

16 1 0 7 0 08=

()

diag ,.,.. This choice of the hyperparameter values was made to

ensure that the prior distribution is moderately informative but flexible enough to cover the

range of possible values of U. For the hyperparameters of the priors on

j =19,,L , we

take

= 5 and

= 48 (the prior mean 12), and for the hyperparameters of prior on V we

set the scale matrix equal to 27⋅I

and the degrees of freedom equal to 13 (so that the prior

mean is 9⋅I

), ensuring a proper but relatively diffuse prior. For each parameter, a

posterior sample of size 1,000 was obtained by subsampling every 10th from 10,000 values

following a 10,000 burn-in period. Tables 7-9 contain posterior summaries for some model

parameters.

{Tables 7 and 8 about here}

The AR coefficients

are estimated to be

78= ,

68= , and

48= , respectively,

suggesting that there is substantial autocorrelation in roadway contribution and moderate

autocorrelation in gasoline contribution and headspace contribution.

The side-by-side barplots of the measured source compositions (in Table 1) and

estimated compositions are given in Figure 5 with R

values between measured and

estimated compositions. In general, there seems to be good agreement between them.

{Figure 5 about here}

As mentioned in Section 1, the measured compositions are not the true source

compositions in the sense of Section 4 for the data though they are expected to be generally

close to the true compositions. For the Headspace composition profile (for which the

measured and the estimated compositions show the best agreement), all but one (2MPentan)

of the measured values fall in the 99% credible intervals. The 80% simultaneous credible

regions (constructed by the method of Besag et al. 1995) are also reported in Table 7 and

these capture all of the measured Headspace composition.

6. CONCLUSIONS AND DISCUSSION

In this article we develop a time series extension of multivariate receptor modeling in order

to capture in the estimation process extra variability due to temporal dependence in air

pollution data. Recent developments in MCMC methodology make estimation of

parameters of complex models possible. By modeling the dependence structure, we can get

more reliable estimates for the source compositions and their uncertainties, which are of our

primary interest. As a by-product we can assess the amount of variability and

autocorrelation in the source contributions and the errors. It also makes it possible to

forecast the level of pollutants y

tk+

()

and the amount of pollution

tk+

()

, which has been

regarded as one of the model limitations in previous receptor modeling approaches (see the

EPA discussion at http://www.epa.gov/oar/oaqps/pams/analysis/receptor/rectxtsac.html).

Throughout the article we assume that the errors are normally distributed.

Environmental data often contain many outliers, and it is sometimes more appropriate to use

the lognormal distribution to describe the data. The usual transformation technique does

not help especially in the context of receptor modeling. By log-transforming the data the

chemical mass balance equation of the model no longer applies directly, and we need to deal

with model identifiability using different conditions. Alternatively, we may consider a

multivariate T-distribution or a mixture of normal distributions to describe the error

distribution. In the application to Atlanta data, the histogram of the residuals for each

species looks in general bell shaped, but shows a few outliers for some of the species. This

might suggest a use of heavy-tailed distribution for errors though it was not pursued further

in this article. Non-normal dynamic modeling is still an active research area (see West and

Harrison 1997), and we expect that multivariate receptor modeling can be extended further

using non-normal dynamic models.

Another assumption we have made is that the errors have mean 0. To be more realistic,

it would be preferable to generalize this to include the unknown non-zero mean errors,

corresponding to unknown sources. This again involves the development of new

identifiability conditions.

Finally, air pollution data is often obtained from multiple receptors. How to incorporate

spatial variability as well as temporal variability in modeling when multiple species are

measured is a challenging problem. Even in the case of no temporal dependence, this

problem remains open.

REFERENCES

Anderson, T.W. (1984), An Introduction to Multivariate Statistical Analysis (2nd ed.), New

York: Wiley.

Besag, J., Green, P., Higdon D., and Mengersen K. (1995),”Bayesian Computation and

Stochastic Systems,” Statistical Science, 10, 3-41.

Chib, S., and Greenberg, E. (1995), “Understanding the Metropolis-Hastings Algorithm,”

American Statistician, 49, 331-335.

Gilks, W.R., Richardson, S., and Spiegelhalter, D.J. (1996), Markov chain Monte Carlo in

practice, Chapman & Hall.

Gleser, L.J. (1997), “Some Thoughts on Chemical Mass Balance Models,” Chemometrics

and Intelligent Laboratory Systems, 37, 15-22.

Henry, R.C. (1991), “Multivariate Receptor Models,” in Receptor Modeling for Air

Quality Management (ed. P. Hopke), pp.117-147. Amsterdam: Elsevier.

(1997), “History and Fundamentals of Multivariate Air Quality Receptor

Models,” Chemometrics and Intelligent Laboratory Systems, 37, 37-42.

Henry, R.C., and Kim, B.M. (1990), “Extension of Self-Modeling Curve Resolution to

Mixtures of More than Three Components. part 1. Finding the Basic Feasible

Region,” Chemometrics and Intelligent Laboratory Systems, 8, 205-216.

Henry, R.C., Lewis, C.W., and Collins, J.F. (1994), “Vehicle-Related Hydrocarbon Source

Compositions from Ambient Data: the Grace/Safer Method,” Environmental Science

and Technology, 28, 823-832.

Henry, R.C., Lewis, C.W., and Hopke, P.K. (1984), “Review of Receptor Model

Fundamentals,” Atmospheric Environment, 18, 1507-1515.

Hopke, P.K. (1985), Receptor Modeling in Environmental Chemistry, New York: Wiley.

(1991), “An Introduction to Receptor Modeling,” Chemometrics and

Intelligent Laboratory Systems, 10, 21-43.

(1997), “The Chemical Mass Balance as a Multivariate Calibration Problem,”

Chemometrics and Intelligent Laboratory Systems, 37, 5-14.

Park, E. S. (1997), “Multivariate Receptor Modeling from a Statistical Science Viewpoint,”

unpublished Ph.D. dissertation, Texas A&M University, Dept. of Statistics.

Park, E. S., Spiegelman, C. H., and Henry, R. C. (1999), “Bilinear Estimation of Pollution

Source Profiles in Receptor Models,” Technical Report 006, University of

Washington, National Research Center for Statistics and the Environment.

Press, S. J. (1982), Applied Multivariate Analysis: Using Bayesian and Frequentist

Methods of Inference (2nd edition). New York: Krieger.

Tierney, L. (1994), “Markov Chains for Exploring Posterior Distributions,” Annals of

Statistics. 22, 1701-1762

West, M., and Harrison (1997), Dynamic Linear Models, New York: Springer-Verlag.

Yang, H. (1994), “Confirmatory Factor Analysis and its Application to Receptor

Modeling,” unpublished Ph.D. dissertation, University of Pittsburgh, Dept. of

Mathematics and Statistics.

TABLES

Table 1.

Measured source composition profiles

Source acetylene propene nButane 2MPentan 3MPentan benzene CyHx

+2MHx

2,3-DMP 2,2,4-TMP

roadway 0.181 0.094 0.197 0.116 0.069 0.132 0.049 0.043 0.120

gasoline 0 0.002 0.197 0.221 0.138 0.108 0.116 0.067 0.152

headspace 0 0.007 0.685 0.144 0.075 0.034 0.021 0.014 0.021

Note: Each source profile is normalized sum to one

Table 2. True source composition profiles (P

)

1234567

Source 1 0 0.248 0 0.102 0.306 0.128 0.216

Source 2 0.242 0 0.266 0 0.009 0.044 0.440

Source 3 0.311 0.250 0.039 0.302 0 0.099 0

Note: Each source profile is normalized sum to one

Table 3.

Summaries of the posterior distribution for

when the data is generated by model (3)

and the approach accounting for dependence is used

Param.

1234567

Mean

LSCR

LCI

UCI

USCR

0.234

0.018

0.191

0.205

0.262

0.279

0.087

0.023

0.025

0.049

0.124

0.145

0.339*

0.016

0.299

0.313

0.306

0.378

0.124

0.013

0.088

0.101

0.147

0.158

0.216

0.033

0.137

0.160

0.269

0.293

Mean

LSCR

LCI

UCI

USCR

0.204*

0.026

0.137

0.157

0.241

0.256

0.253

0.017

0.214

0.225

0.282

0.295

0.044

0.029

0.001

0.004

0.100

0.127

0.043

0.013

0.009

0.021

0.065

0.075

0.456

0.016

0.416

0.430

0.484

0.502

Mean

LSCR

LCI

UCI

USCR

0.298

0.009

0.278

0.284

0.313

0.320

0.264

0.010

0.237

0.247

0.279

0.288

0.029

0.011

0.003

0.011

0.046

0.056

0.304

0.009

0.284

0.290

0.319

0.328

0.106

0.008

0.085

0.093

0.118

0.126

Note: 1. SD stands for the posterior standard deviation; 2. LCI and UCI stand for the lower limit and upper limit of the 95%

credible interval; 3. Asterisk (*) indicates that the true parameter value is not captured by the 95% credible interval; 3.

Asterisk (*) indicates that the true parameter value is not captured by the 95% credible interval; 4. LSCR and USCR stand

for the lower limit and upper limit of the 80% simultaneous credible region.

Table 4.

Posterior means and standard deviations of

and

(correlation matrix corresponding to

)

when the data is generated by model (3) and the approach accounting for dependence is used

Correlations in

0.826 (0.044) 1

0.834 (0.042) 0.010 (0.133) 1

0.817 (0.040) 0.245 (0.108)* -0.141 (0.102) 1

Note: 1. Posterior standard deviation is given in the parenthesis; 2. Asterisk (*) indicates

that the true parameter value is not captured by the 95% credible interval.

Table 5.

Posterior means and standard deviations of

, and

when the data is generated by model (3)

and the approach accounting for dependence is used

Diagonal elements of

0.379 (0.194)* 2.463 (1.295) 3.823 (1.238)

0.628 (0.178) 2.777 (1.304) 2.908 (1.002)

0.836 (0.100) 2.030 (0.924) 4.368 (1.010)

0.801 (0.102) 2.470 (1.127) 4.072 (1.077)

0.539 (0.207) 2.634 (1.431) 4.252 (1.509)

0.609 (0.121) 2.485 (0.950) 3.279 (0.921)

0.650 (0.191) 2.496 (1.457) 2.547 (1.029)

Note: 1. Posterior standard deviation is given in the parenthesis; 2. Asterisk (*) indicates that the true

parameter value is not captured by the 95% credible interval.

Table 6.

Summaries of the posterior distribution for the parameters

and

when the data is generated by model (3)

but the approach ignoring dependence (given in Remark 1) is used

Param.

1234567

Mean

LSCR

LCI

UCI

USCR

0.214*

0.014

0.180

0.190

0.236

0.246

0.084

0.014

0.050

0.060

0.106

0.115

0.339*

0.011

0.314

0.322

0.357

0.365

0.125

0.008

0.105

0.112

0.137

0.144

0.239

0.022

0.189

0.205

0.277

0.297

Mean

LSCR

LCI

UCI

USCR

0.123*

0.012

0.096

0.104

0.142

0.154

0.201*

0.008

0.182

0.187

0.214

0.221

0.154*

0.011

0.125

0.136

0.172

0.179

0.063*

0.007

0.045

0.051

0.074

0.080

0.459*

0.009

0.439

0.445

0.474

0.482

Mean

LSCR

LCI

UCI

USCR

0.292*

0.005

0.281

0.284

0.300

0.304

0.282*

0.005

0.269

0.274

0.291

0.296

0.036

0.007

0.021

0.026

0.047

0.054

0.286*

0.004

0.276

0.278

0.293

0.297

0.103

0.005

0.092

0.096

0.111

0.115

8.882

Mean

5.565*

1.453

8.648

1.853

10.415

1.403

11.375

1.621

8.275

2.246

7.873

0.840

7.255

2.768

Note: 1. SD stands for the posterior standard deviation; 2. LCI and UCI stand for the lower limit and upper limit of the 95%

credible interval; 3. Asterisk (*) indicates that the true parameter value is not captured by the 95% credible interval; 4. LSCR

and USCR stand for the lower limit and upper limit of the 80% simultaneous credible region.

Table 7.

Summaries of the posterior distribution for

for the Atlanta data

Param.

Species

acetylene

propene

nButane

2MPentan

3MPentan

benzene

CyHx

+2Mhx

2,3-DMP

2,2,4-TMP

roadway Mean

LSCR

LCI

UCI

USCR

0.275

0.008

0.257

0.295

0.297

0.115

0.004

0.107

0.124

0.125

0.279

0.013

0.247

0.248

0.305

0.307

0.086

0.004

0.076

0.095

0.096

0.049

0.003

0.042

0.043

0.056

0.126

0.004

0.117

0.118

0.135

0.136

0.069

0.005

0.057

0.081

gasoline Mean

LSCR

LCI

UCI

USCR

0.172

0.019

0.127

0.128

0.214

0.217

0.191

0.005

0.179

0.180

0.202

0.204

0.113

0.003

0.104

0.105

0.121

0.122

0.088

0.004

0.077

0.078

0.097

0.099

0.123

0.005

0.112

0.134

0.135

0.098

0.004

0.089

0.090

0.107

0.217

0.008

0.200

0.201

0.236

0.238

headspace Mean

LSCR

LCI

UCI

USCR

0.009

0.007

0.000

0.001

0.029

0.034

0.693

0.035

0.606

0.609

0.773

0.776

0.116

0.011

0.083

0.087

0.142

0.145

0.063

0.007

0.042

0.045

0.080

0.081

0.052

0.010

0.028

0.029

0.074

0.076

0.021

0.009

0.001

0.002

0.044

0.046

0.017

0.007

0.008

0.088

0.093

Note: 1. SD stands for the posterior standard deviation; 2. LCI and UCI stand for lower limit and upper limit of the 99% credible interval; 3. LSCR and

USCR stand for lower limit and upper limit of the 80% simultaneous credible region.

Table 8.

Posterior means and standard deviations of

and

(correlation matrix corresponding to

) for the Atlanta data

Correlations in

0.775 (0.036) 1

0.677 (0.062) 0.207 (0.045) 1

0.476 (0.114) -0.069 (0.051) -0.049 (0.047) 1

Note: Posterior standard deviation is given in the parenthesis.

Table 9.

Posterior means and standard deviations of

, diagonal elements of

, and

for the Atlanta data

Species

Diagonal elements of

Acetylene 0.512 (0.110) 1.039 (0.243) 1.148 (0.127)

Propene 0.550 (0.066) 0.405 (0.058) 0.506 (0.042)

nButane 0.400 (0.201) 2.929 (1.339) 3.683 (0.751)

2Mpentan 0.221 (0.086) 0.520 (0.102) 0.534 (0.045)

3Mpentan 0.162 (0.073) 0.280 (0.040) 0.349 (0.026)

Benzene 0.360 (0.092) 0.379 (0.055) 0.501 (0.040)

CyHx+2Mhx 0.237 (0.088) 0.341 (0.048) 0.448 (0.036)

2,3-DMP 0.269 (0.086) 0.261 (0.033) 0.360 (0.027)

2,2,4-TMP 0.643 (0.062) 0.681 (0.138) 0.758 (0.070)

Note: Posterior standard deviation is given in the parenthesis.

Figure Titles and Legends

Figure 1. Autocorrelation function (ACF) plots of Y for Atlanta data

Figure 2. Autocorrelation function (ACF) plots of the residuals for Atlanta data:

−

OLS

where P is the measured source compositions in Table 1

Figure 3. Autocorrelation function (ACF) plots of source contributions (

OLS

) for Atlanta

data

Figure 4. Side-by-side barplots of the true source compositions (P

) and the estimated

compositions obtained from two different approaches, time series approach and

approach ignoring dependence

Figure 5. Side-by-side barplots of the measured source compositions and the estimated

compositions for the Atlanta data

0 10 20 30

-0.5

0.5

ACF

TextEnd

Acetylene

0 10 20 30

-0.5

0.5

ACF

TextEnd

propene

0 10 20 30

-0.5

0.5

ACF

TextEnd

nButane

0 10 20 30

-0.5

0.5

ACF

TextEnd

2MPentan

0 10 20 30

-0.5

0.5

ACF

TextEnd

3MPentan

0 10 20 30

-0.5

0.5

ACF

TextEnd

benzene

0 10 20 30

-0.5

0.5

ACF

TextEnd

CyHx+2MHx

0 10 20 30

-0.5

0.5

ACF

TextEnd

2,3-DMP

0 10 20 30

-0.5

0.5

ACF

TextEnd

2,2,4-TMP

Figure 1

0 10 20 30

-0.5

0.5

ACF

TextEnd

Acetylene

0 10 20 30

-0.5

0.5

ACF

TextEnd

propene

0 10 20 30

-0.5

0.5

ACF

TextEnd

nButane

0 10 20 30

-0.5

0.5

ACF

TextEnd

2MPentan

0 10 20 30

-0.5

0.5

ACF

TextEnd

3MPentan

0 10 20 30

-0.5

0.5

ACF

TextEnd

benzene

0 10 20 30

-0.5

0.5

ACF

TextEnd

CyHx+2MHx

0 10 20 30

-0.5

0.5

ACF

TextEnd

2,3-DMP

0 10 20 30

-0.5

0.5

ACF

TextEnd

2,2,4-TMP

Figure 2

0 5 10 15 20 25 30

-0.5

0.5

ACF

TextEnd

Series : Roadway contribution

0 5 10 15 20 25 30

-0.5

0.5

ACF

TextEnd

Series : Gasoline contribution

0 5 10 15 20 25 30

-0.5

0.5

ACF

TextEnd

Series : Headspace contribution

ure 3

1 2 3 4 5 6 7

0.1

0.2

0.3

0.4

Source 1

=.99

1 2 3 4 5 6 7

0.2

0.4

0.6

0.8

Source 2

=.98

1 2 3 4 5 6 7

0.1

0.2

0.3

0.4

Source 3

Time series approach

=.99

estimated

true

1 2 3 4 5 6 7

0.1

0.2

0.3

0.4

Source 1

=.97

1 2 3 4 5 6 7

0.2

0.4

0.6

0.8

Source 2

=.78

1 2 3 4 5 6 7

0.1

0.2

0.3

0.4

Source 3

Approach ignoring dependence

Figure 4

=.99

estimated

true

1 2 3 4 5 6 7 8 9

0.1

0.2

0.3

0.4

Roadway

=.91

1 2 3 4 5 6 7 8 9

0.1

0.2

0.3

0.4

Gasoline

=.84

1 2 3 4 5 6 7 8 9

0.2

0.4

0.6

0.8

Headspace

Figure 5

=.99

estimated

measured

Bayesian Methods for Factor Analysis in Chemometrics

Chapter

Jan 2020

Factor analysis is widely used in various scientific disciplines including chemometrics. Factor analysis models and fundamental issues are introduced from the perspective of chemometrics. Bayesian methods offer great potential in that they provide alternative ways to resolve main problems in factor analysis such as the unknown number of factors and model non-identifiability which leads to factor indeterminacy or rotational ambiguity in estimation. Standard Bayesian nonnegative factor analysis models and principles of Bayesian estimation along with modern Bayesian computational methods are introduced. Key advantages of Bayesian factor analysis are simultaneous estimation of model parameters (factor loadings and scores) and their uncertainties and the capability to deal with the unknown number of factors, factor indeterminacy/rotational ambiguity, and parameter uncertainty in a coherent manner as well as the flexibility in modeling and incorporating prior knowledge into estimation. Other developments of extended Bayesian factor analysis models in chemometrics are also presented. Future directions of Bayesian factor analysis in chemometrics, including public release of user-friendly software facilitating its implementation, are discussed.

Source apportionment of PM2.5 concentrations with a Bayesian hierarchical model on latent source profiles

Article

Full-text available

Jul 2020
APR

Identifying realistic pollution source profiles and quantifying the contributions of atmospheric particulate matter are crucial for the development of pollution mitigation strategies to protect public health. In this paper, we proposed a multivariate source apportionment model by using a Bayesian framework for latent source profiles to incorporate expert knowledge regarding emissions that can facilitate source profile estimation, and atmospheric effects, such as meteorological conditions, can improve source concentration estimations. This approach can maintain positivity and summation constraints for source contributions and profiles. Furthermore, available expert knowledge regarding source profiles is incorporated as prior knowledge to avoid restrictive assumptions regarding the presence or absence of chemical constituent tracers in source profile modeling. We used long-term PM2.5 measurements collected from two locations with different environmental characteristics in northern Taiwan to demonstrate the feasibility of the proposed model and evaluated its performance by using simulated data.

simmr: A package for fitting Stable Isotope Mixing Models in R

Preprint

Full-text available

Jun 2023

We introduce an R package for fitting Stable Isotope Mixing Models (SIMMs) via both Markov chain Monte Carlo and Variational Bayes. The package is mainly used for estimating dietary contributions from food sources taken via measurements of stable isotope ratios from animals. It can also be used to estimate proportional contributions of a mixture from known sources, for example apportionment of river sediment, amongst many other use cases. The package contains a simple structure which allows non-expert users to interface with the package, with most of the computational complexity hidden behind the main fitting functions. In this paper we detail the background to these functions and provide case studies on how the package should be used. Further examples are available in the online package vignettes.

Bayesian Multivariate Receptor Modeling Software: BNFA and bayesMRM

Article

Full-text available

Mar 2021
CHEMOMETR INTELL LAB

We present user-friendly software tools to implement Bayesian multivariate receptor modeling in the form of a MATLAB function (BNFA) and an R package (bayesMRM). A basic model and a Markov chain Monte Carlo algorithm underlying BNFA and bayesMRM are given. An example of implementation based on real air pollution data is also provided. Users can freely choose between BNFA and bayesMRM depending on their computing platform. These tools are expected to facilitate implementation of Bayesian multivariate receptor models and/or Bayesian nonnegative factor analysis models and promote their use in chemometrics.

Assessment of mobile source contributions in El Paso by PMF receptor modeling coupled with wind direction analysis

Article

Feb 2020
SCI TOTAL ENVIRON

It is well-known that El Paso is the only border area in Texas that has violated national air quality standards. Mobile source emissions (including vehicle exhaust) contribute significantly to air pollution, along with other sources including industrial, residential, and cross-border. This study aims at separating unobserved vehicle emissions from air-pollution mixtures indicated by ambient air quality data. The level of contributions from vehicle emissions to air pollution cannot be determined by simply comparing ambient air quality data with traffic levels because of the various other contributors to overall air pollution. To estimate contributions from vehicle emissions, researchers employed advanced multivariate receptor modeling called positive matrix factorization (PMF) to analyze hydrocarbon data consisting of hourly concentrations measured from the Chamizal air pollution monitoring station in El Paso. The analysis of hydrocarbon data collected at the Chamizal site in 2008 showed that approximately 25% of measured Total Non-Methane Hydrocarbons (TNMHC) was apportioned to motor vehicle exhaust. Using wind direction analysis, researchers also showed that the motor vehicle exhaust contributions to hydrocarbons were significantly higher when winds blow from the south (Mexico) than those when winds blow from other directions. The results from this research can be used to improve understanding source apportionment of pollutants measured in El Paso and can also potentially inform transportation planning strategies aimed at reducing emissions across the region.

Spatial source apportionment of airborne coarse particulate matter using PMF-Bayesian receptor model

Article

Jan 2024
SCI TOTAL ENVIRON

Bilinear model factor decomposition: A general mixture analysis tool

Article

Jun 2023
CHEMOMETR INTELL LAB

Predicting latent source-specific PM2.5 pollution from regional sources at unmonitored sites by Bayesian spatial multivariate receptor modeling

Article

Mar 2023

Fine particulate matter (PM2.5) has been a pollutant of main interest globally for more than two decades, owing to its well-known adverse health effects. For developing effective management strategies for PM2.5, it is vital to identify its major sources and quantify how much they contribute to ambient PM2.5 concentrations. With the expanded monitoring efforts established during recent decades in Korea, speciated PM2.5 data needed for source apportionment of PM2.5 are now available for multiple sites (cities). However, many cities in Korea still do not have any speciated PM2.5 monitoring station, although quantification of source contributions for those cities is in great need. While there have been many PM2.5 source apportionment studies throughout the world for several decades based on monitoring data collected from receptor site(s), none of those receptor-oriented studies could predict unobserved source contributions at unmonitored sites. This study predicts source contributions of PM2.5 at unmonitored locations using a recently developed novel spatial multivariate receptor modeling (BSMRM) approach, which incorporates spatial correlation in data into modeling and estimation for spatial prediction of latent source contributions. The validity of BSMRM results is also assessed based on the data from a test site (city), not used in model development and estimation.

A dependent Bayesian Dirichlet process model for source apportionment of particle number size distribution

Article

Full-text available

Sep 2022

The relationship between particle exposure and health risks has been well established in recent years. Particulate matter (PM) is made up of different components coming from several sources, which might have different level of toxicity. Hence, identifying these sources is an important task in order to implement effective policies to improve air quality and population health. The problem of identifying sources of particulate pollution has already been studied in the literature. However, current methods require an a priori specification of the number of sources and do not include information on covariates in the source allocations. Here, we propose a novel Bayesian nonparametric approach to overcome these limitations. In particular, we model source contribution using a Dirichlet process as a prior for source profiles, which allows us to estimate the number of components that contribute to particle concentration rather than fixing this number beforehand. To better characterize them we also include meteorological variables (wind speed and direction) as covariates within the allocation process via a flexible Gaussian kernel. We apply the model to apportion particle number size distribution measured near London Gatwick Airport (UK) in 2019. When analyzing this data, we are able to identify the most common PM sources, as well as new sources that have not been identified with the commonly used methods.

Hidden semi-Markov-switching quantile regression for time series

Article

Full-text available

Mar 2021
COMPUT STAT DATA AN

A hidden semi-Markov-switching quantile regression model is introduced as an extension of the hidden Markov-switching one. The proposed model allows for arbitrary sojourn-time distributions in the states of the Markov-switching chain. Parameters estimation is carried out via maximum likelihood estimation method using the Asymmetric Laplace distribution. As a by product of the model specification, the formulae and methods for forecasting, the state prediction, decoding and model checking that exist for ordinary hidden Markov-switching models can be applied to the proposed model. A simulation study to investigate the behaviour of the proposed model is performed covering several experimental settings. The empirical analysis studies the relationship between the stock index from the emerging market of China and those from the advanced markets, and investigates the determinants of high levels of pollution in an Italian small city.

Bilinear estimation of pollution source profiles in receptor models

Article

Full-text available

The Dynamic Linear Model

Chapter

Jan 1997

An Introduction to Multivariate Statistical Analysis

Article

Jan 1986
J BUS ECON STAT

Markov Chain Monte Carlo In Practice

Book

Jan 1996

Applied Multivariate Analysis: Using Bayesian and Frequentist Methods of Inference.

Article

Jun 1984

S. James Press

Bayesian Computation and Stochastic Systems

Article

Feb 1995
STAT SCI

Markov chain Monte Carlo (MCMC) methods have been used extensively in statistical physics over the last 40 years, in spatial statistics for the past 20 and in Bayesian image analysis over the last decade. In the last five years, MCMC has been introduced into significance testing, general Bayesian inference and maximum likelihood estimation. This paper presents basic methodology of MCMC, emphasizing the Bayesian paradigm, conditional probability and the intimate relationship with Markov random fields in spatial statistics. Hastings algorithms are discussed, including Gibbs, Metropolis and some other variations. Pairwise difference priors are described and are used subsequently in three Bayesian applications, in each of which there is a pronounced spatial or temporal aspect to the modeling. The examples involve logistic regression in the presence of unobserved covariates and ordinal factors; the analysis of agricultural field experiments, with adjustment for fertility gradients; and processing of low-resolution medical images obtained by a gamma camera. Additional methodological issues arise in each of these applications and in the Appendices. The paper lays particular emphasis on the calculation of posterior probabilities and concurs with others in its view that MCMC facilitates a fundamental breakthrough in applied Bayesian modeling. Comments: Arnoldo Frigessi (41–43), Alan E. Gelfand, Bradley P. Carlin (43–46), Charles J. Geyer (46–48), G. O. Roberts, S. K. Sahu, W. R. Gilks (49–51), Wing Hung Wong (52–53), Bin Yu (54–58), Julian Besag, Peter Green, David Higdon, Kerrie Mengersen (58–66).

Some thoughts on chemical mass balance models

Article

May 1997
CHEMOMETR INTELL LAB

Leon Jay Gleser

It is argued that multivariate chemical mass balance models are, in general, incapable of determining the number and composition of sources of environmental contamination entirely from the data alone; additional identifying information must be obtained from sources external to the data. Determination of the number and composition of sources is possible, however, if a `library' of possible source profiles (or partial source profiles) exists. In such a case, a generalized least squares approach to target transformation factor analysis has been shown in the Ph.D. dissertation of H. Yang to have large-sample performance properties that are robust against violations of distributional assumptions. One can alternatively make use of the popular computer program LISREL to fit chemical mass balance models when partial information from a library of source profiles is available. Using the fitted model, it is shown how to obtain point estimates and credible intervals for source contributions (either in absolute terms, or as proportions of total contributions from all sources).

Chapter 5 Multivariate Receptor Models

Article

Dec 1991
Data Handling Sci Tech

Ronald C Henry

This chapter will review the major multivariate methods which have been applied to receptor modeling. The Source Apportionment by Factors with Explicit Restrictions (SAFER) model is discussed in greater detail than other models because all the others suffer from the fundamental mathematical indeterminacy discussed above. The other models continue to be important as semi-quantitative methods to estimate composition of sources from the data alone. Of course, all the multivariate methods can be extremely valuable in identifying the existence of unsuspected sources. In the next sections the concepts basic to multivariate receptor models are introduced. This is followed by a more detailed description of the SAFER model and its application to Los Angeles PM-10 data. -Author

History and fundamentals of multivariate air quality receptor models

Article

May 1997
CHEMOMETR INTELL LAB

Ronald C Henry

Understanding the Metropolis-Hastings Algorithm

Article

Nov 1995

We provide a detailed, introductory exposition of the Metropolis-Hastings algorithm, a powerful Markov chain method to simulate multivariate distributions. A simple, intuitive derivation of this method is given along with guidance on implementation. Also discussed are two applications of the algorithm, one for implementing acceptance-rejection sampling when a blanketing function is not available and the other for implementing the algorithm with block-at-a-time scans. In the latter situation, many different algorithms, including the Gibbs sampler, are shown to be special cases of the Metropolis-Hastings algorithm. The methods are illustrated with examples.

Multivariate Receptor Modeling for Temporally Correlated Data by Using MCMC

Abstract and Figures

Recommended publications

Multivariate Receptor Modeling for Temporally Correlated Data by Using MCMC

Multivariate receptor models and model uncertainty

History and fundamentals of multivariate air quality receptor models

Receptor Model Applied to Patterns in Space (RMAPS) Part I-Model Description