Assessment of source-specific health effects associated with an

Biostatistics (2014), 15, 3, pp. 484–497
doi:10.1093/biostatistics/kxu004
Advance Access publication on March 11, 2014
Assessment of source-specific health effects associated
with an unknown number of major sources of multiple
air pollutants: a unified Bayesian approach
EUN SUG PARK∗
Texas A&M Transportation Institute, College Station, TX 77843, USA
[email protected]
PHILIP K. HOPKE
Center for Air Resources Engineering and Science, Clarkson University, Potsdam, NY 13699, USA
MAN-SUK OH
Department of Statistics, Ewha Women’s University, Seoul 120-750, Korea
ELAINE SYMANSKI
University of Texas School of Public Health, Houston, TX 77030, USA
DAIKWON HAN
Department of Epidemiology & Biostatistics, Texas A&M University, College Station, TX 77843, USA
CLIFFORD H. SPIEGELMAN
Department of Statistics, Texas A&M University, College Station, TX 77843, USA
SUMMARY
There has been increasing interest in assessing health effects associated with multiple air pollutants emitted by specific sources. A major difficulty with achieving this goal is that the pollution source profiles are
unknown and source-specific exposures cannot be measured directly; rather, they need to be estimated by
decomposing ambient measurements of multiple air pollutants. This estimation process, called multivariate
receptor modeling, is challenging because of the unknown number of sources and unknown identifiability
conditions (model uncertainty). The uncertainty in source-specific exposures (source contributions) as well
as uncertainty in the number of major pollution sources and identifiability conditions have been largely
ignored in previous studies. A multipollutant approach that can deal with model uncertainty in multivariate
receptor models while simultaneously accounting for parameter uncertainty in estimated source-specific
exposures in assessment of source-specific health effects is presented in this paper. The methods are
applied to daily ambient air measurements of the chemical composition of fine particulate matter (PM2.5 ),
∗ To
whom correspondence should be addressed.
c The Author 2014. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected].
Assessment of source-specific health effects: a unified Bayesian approach
485
weather data, and counts of cardiovascular deaths from 1995 to 1997 for Phoenix, AZ, USA. Our approach
for evaluating source-specific health effects yields not only estimates of source contributions along with
their uncertainties and associated health effects estimates but also estimates of model uncertainty (posterior model probabilities) that have been ignored in previous studies. The results from our methods agreed
in general with those from the previously conducted workshop/studies on the source apportionment of
PM health effects in terms of number of major contributing sources, estimated source profiles, and contributions. However, some of the adverse source-specific health effects identified in the previous studies
were not statistically significant in our analysis, which probably resulted because we incorporated parameter uncertainty in estimated source contributions that has been ignored in the previous studies into the
estimation of health effects parameters.
Keywords: Cardiovascular mortality; Model uncertainty; Multipollutant approach; Multivariate receptor models;
PM health effects; Source apportionment.
1. INTRODUCTION
There has been growing interest in assessing health effects of air pollution based on multiple pollutants
(Dominici and others, 2010). There often exist high correlations among multiple pollutants measured in
ambient air (due to common sources and meteorology), and these high correlations lead to an estimation
problem such as collinearity when multiple pollutants are included as covariates in the health effect regression model. This issue makes a straightforward extension of the existing single-pollutant health effect(s)
model(s) inappropriate.
Considering “source-specific exposures” to quantify exposure to multiple air pollutants addresses
the aforementioned problem. Air pollution is generated from several sources and each source emits a
combination of various pollutants. While high inter-pollutant correlations are problematic and lead to
a collinearity problem in the multivariate regression models of pollutant-specific health effects, it is not a
problem in estimating source-specific health effects. As a matter of fact, high inter-pollutant correlations
make it possible to effectively characterize complex air pollution mixtures by a few common underlying
source types using multivariate receptor models (Park and others, 2001; Park, Spiegelman, and others,
2002; Park, Oh, and others, 2002). Multivariate receptor models resolve the measured mixture of pollutants into the contributions from the individual source types (see, e.g. Hopke, 2010 for a review). Source
contributions (i.e. the amount of pollution from each source) can be viewed as source-specific exposures.
While many studies have evaluated increased risks of health effects associated with individual pollutants, there has been limited epidemiologic research on the source-specific health effects
(see, e.g. Laden and others, 2000; Ito and others, 2006; Mar and others, 2006; Lall and others, 2011;
Ostro and others, 2011). In most studies, the estimated source contributions were used as if they were true
source-specific exposures (thus ignoring the parameter uncertainty associated with estimated source contributions) in the health effect models. As is well known in the measurement error model literature, ignoring parameter uncertainty in exposure estimation results in a bias in the estimated health effect regression
coefficients. Notable exceptions are work by Nikolov and others (2006, 2007) who proposed a structural
equation framework to assess source-specific health effects by fitting a receptor model and the health outcome model jointly to account for the parameter uncertainty associated with the estimated source contributions in the health effects estimates. More importantly, however, the number of major pollution sources that
drives the number of regression terms in the health effect model was assumed known (or treated as known
once it was estimated) in all of the previous studies. The same is also true for model identifiability conditions. While identifiability conditions that are useful in multivariate receptor modeling have been proposed
(Park and others, 2001; Park, Spiegelman, and others, 2002) and also utilized in recent source-specific
health effect studies (Nikolov and others, 2007), those conditions were assumed to be known. The model
486
E. S. PARK AND
OTHERS
uncertainty due to the unknown number of sources and identifiability conditions has never been taken into
account in assessment of source-specific health effects. In this paper, we propose a method that can deal
with model uncertainty while simultaneously incorporating parameter uncertainty for estimated source
contributions into assessment of source-specific health effects. In Section 2, we introduce a basic modeling framework for evaluation of source-specific health effects. Section 3 presents the generalizations of
the approach developed by Park, Oh, and others (2002) computing marginal likelihoods/posterior model
probabilities to estimate model uncertainty while simultaneously accounting for parameter uncertainty in
the estimation of source-specific health effects. Section 4 contains the simulation study. In Section 5, the
suggested method is applied to speciated ambient air monitoring data for fine particulate matter (PM2.5 )
and daily cardiovascular mortality count data for Phoenix, AZ, USA over a 2-year period from 1995 to
1997. Finally, concluding remarks are made in Section 6.
2. BASIC MODELING FRAMEWORK
We employ a Bayesian hierarchical modeling framework to incorporate multiple data sources (ambient air
pollution data and health outcome data) into a single coherent statistical model. Our model consists of two
main parts: the receptor and health models. An additional hierarchical model on latent source contributions
and distributional assumptions on errors can also be added. A basic model form can be given as
Receptor model: X t = At P + E t ,
t = 1, . . . , n,
Health model: g(E(yt )) = λ + At β + Z t η = λ +
(2.1)
q
k=1
βk Atk +
I
ηi Z ti ,
(2.2)
i=1
where X t = (X t1 , X t2 , . . . , X t J ): concentrations of J pollutants (chemical species) measured in time t
at a receptor, n = # of observations (# of days), q = # of major pollution sources (unknown), P: q × J
source composition matrix of which rows are the source composition profiles (Pk , k = 1, . . . , q), Pk =
( pk1 , pk2 , . . . , pk J ): kth source composition profile consisting of the fractional amount of each chemical
species in the emissions from the kth source, At = (At1 , At2 , . . . , Atq ): source contribution vector in time t
where Atk is the contribution from the kth source, E t = (E t1 , E t2 , . . . , E t J ): measurement error
in pollutant
concentrations at time t, yt : health outcome at time t, λ: overall baseline risk of death, β = β1 , . . . , βq :
parameter describing the influence of each source-specific exposure on mortality rate, Z t = (Z t1 , . . . , Z t I ):
(transformations of) confounding variables such as temperature, humidity, the day of the week, etc., and
η = (η1 , . . . , η I ) : parameter describing the influence of confounding variables on mortality. Note that
(2.2) represents an individual-lag model. Without loss of generality, we present the model (and the method)
using lag 0 contributions. The link function g can be changed depending on the type of the health outcome
variable. For example, it can be the identity function for a continuous health outcome variable such as lung
function, or the log function for a discrete health outcome variable such as daily mortality or morbidity
count. We will assume that the measured pollutant concentrations and the health outcomes are conditionally
independent given the unobserved source contributions and other covariates in the model, which seems to
be a reasonable assumption.
Our main goal is to estimate parameters, A (n × q source contribution matrix of which rows are
parameters),
along with their parameter uncertainties
At , t = 1, . . . , n), P, and β (λ and η are nuisance
and model uncertainty. The parameters β = β1 , . . . , βq quantify the q source-specific health effects.
A major source of model uncertainty in (2.1) is the unknown number of major pollution sources, q, and
identifiability conditions. It is well known in multivariate receptor modeling as well as in factor analysis that (2.1) is not identifiable as it is. The parameters A and P in (2.1) are not uniquely defined even
under the assumption that q is known because AP = ARR−1 P for any q × q non-singular matrix R. This
Assessment of source-specific health effects: a unified Bayesian approach
487
is a well-known non-identifiability problem in factor analysis and multivariate receptor models, which is
often referred to as “factor indeterminacy”. Fortunately, under some constraints (called identifiability conditions) on either A or P, the parameters are uniquely defined (see Park, Spiegelman, and others, 2002).
There could be many different identifiability conditions, but not all make sense in the context of receptor
modeling or source apportionment (see Park, Spiegelman, and others, 2002; Park, Oh, and others, 2002).
Because identifiability conditions are additional assumptions on the parameters, it is important to select
conditions that are physically meaningful in the given context of the problem (though there could be
many other purely mathematical identifiability conditions). We restrict the type of identifiability conditions to be compared to those that are reasonable and often used in the receptor modeling context. One
such type of identifiability conditions is pre-specification of zero elements in the source composition
matrix P:
C1. There are at least q − 1 zero elements in each row of P,
C2. For each k = 1, . . . , q, the rank of P (k) is q − 1, where P (k) is the matrix composed of the columns
containing the assigned zeros in the kth row with those assigned zeros deleted.
C3. Pk j = 1 for some j ( j = 1, . . . , J ) for each k = 1, . . . , q.
These conditions imply that some pollutants are not contributed by a particular source type. The position of
zeros in P can be specified by some prior knowledge on likely sources. In practice, the pre-assigned zero
elements are rarely actually zeros but they are small enough (i.e. minor compounds). As demonstrated in
Nikolov and others (2007) by a simulation study designed to investigate the impact of choosing incorrect
identifiability constraints, the results are not sensitive to errors of assuming a zero where the truth is nonzero as long as the pre-assigned zero element is not a major constituent of a source profile. For example,
when Road Dust is considered as one of the likely sources for PM speciation data, we may pre-assign zero
for sulfur (S) in the candidate profile using the prior knowledge that S is not a major constituent of the
Road Dust profile. Alternatively, we may consider pre-assigning zeros in the source contribution matrix
A, which implies that each source is missing on some days (Park, Spiegelman, and others, 2002), or the
combination of pre-specification of zeros in P and A. If we do not have any a priori information on the
position of zeros in P or A, then we may start with several candidate positions for zeros proposed by
exploratory data analysis (PMF or UNMIX can also be utilized in this step) and select the one giving the
highest posterior probability.
As mentioned earlier, in most of the previous approaches to evaluating source-specific health effects,
the estimated source contributions were used as if they were true source-specific exposures or, at least,
the number of major pollution sources (that drives the number of regression terms in the health effects
model) and identifiability conditions were assumed known. As a matter of fact, estimation of parameters A, P, and β heavily depends on q and also on identifiability conditions employed (e.g. where to
pre-assign zeros in P), and these could be a major source of uncertainty in the estimated health effects.
In some cases, we may have some prior knowledge on this, i.e. the number of sources and the position
of zeros can be assumed known (see, e.g. Park and others, 2001). More frequently, that information is
lacking, and it becomes a main source of model uncertainty. In this paper, we aim to quantify the uncertainty in q and identifiability conditions along with other parameter uncertainties so that the inherent
variability of the source apportionment can be taken into account in the assessment of source-specific
health effects.
To develop a method that can cope with model uncertainty in the assessment of source-specific health
effects, we build on the Bayesian method developed in Park, Oh, and others (2002) that computes marginal
likelihoods/posterior model probabilities for a range of plausible models (with different q and identifiability conditions) by Markov chain Monte Carlo (MCMC). In this paper, we extend the method of
Park, Oh, and others (2002) in two aspects: (i) by adapting enhanced multivariate receptor models explicitly incorporating correlated source contributions and (ii) by including health models.
E. S. PARK AND
488
OTHERS
3. ESTIMATION OF PARAMETER UNCERTAINTIES AND
MODEL UNCERTAINTIES
The method of Park, Oh, and others (2002) was developed under the assumption of orthogonal factor
models, i.e. assuming a priori uncorrelated centered source contributions,
γt = At − ξ ∼ Nq 0, Iq ,
(3.1)
where ξ is the mean of At and Iq is the q × q identity matrix and Nq (·, ·) represent a q-variate normal
distribution. A normal distribution was chosen as a prior for γt for mathematical convenience because γt
can be considered as latent variables, and as mentioned in Bartholomew and Knot (1999), the form of the
prior distribution of the latent variables is essentially arbitrary and largely a matter of convention. Note
that (2.1) can be reparamaterized as
X t = μ + γt P + E t ,
t = 1, . . . , n,
(3.2)
where μ = ξ P. Although the method of Park, Oh, and others (2002) was shown to be robust to violation
of the prior assumption on the correlation structure of source contributions, in this paper, the method is
generalized to formally account for correlated source contributions, i.e. assuming oblique factor models
as a prior distribution for source contributions:
γt ∼ Nq (0, )
(3.3)
At ∼ Nq (ξ, ) I (At > 0) ,
(3.4)
or
where is a general covariance matrix. (Note that (2.1) and (2.2) can be reparameterized in terms of
centered source contributions γt without loss of generality. Especially, our key parameter β is not affected
whether or not the source contributions are centered.) Originally, we explored both priors (3.3) and (3.4),
but it turned out that, for the purpose of computing marginal likelihoods, reparameterization of (2.1) and
(2.2) using centered source contributions γt is more convenient because we can more effectively cope with
the issue of scale invariance of factors by a constant multiplication.
When yt is a continuous health outcome variable (e.g. peak ST-segment elevation in Nikolov and others,
2007) or the daily mortality (or morbidity) count with a large enough mean, yt may be assumed to follow
a normal distribution and the link function g in (2.2) becomes the identity function. Using the centered
source contribution and the identity link function for the health model, the model for source-specific health
effects can be written as follows:
Receptor model: X t = μ + γt P + E t ,
t = 1, . . . , n,
Health model: yt = α + γt β + Z t η + εt = α +
q
(3.5)
βk γtk +
k=1
I
ηi Z ti + εt ,
(3.6)
i=1
where μ = ξ P, α = λ + ξβ. We make the following assumptions on errors E t and εt :
E t ∼ N J (0, ),
= diag σ12 , σ22 , . . . , σ J2 ,
εt ∼ N 0, σ y2 .
To complete a Bayesian model specification, the prior distributions for the unknown parameters,
= {γt , t = 1, . . . , n}, , P, , μ, α, β, η, and σ y2 are required. We assume independence between
Assessment of source-specific health effects: a unified Bayesian approach
489
{γt , t = 1, . . . , n} given the hyperparameter and the parameters , , P, , α, μ, β, η, and σ y2 as
follows:
p , P, , , μ, α, β, η, σ y2 = p ( | ) p (P) p (
) p () p (μ) p (α) p (β) p (η) p σ y2 .
As noted earlier, the prior distribution for the centered source contribution {γt } is assumed to be γt ∼
Nq (0, ). For a prior distribution for P, we assume a point mass at zero for q(q − 1) elements of P
preselected for identifiability conditions. For the free elements of P, we use the truncated normal distribution, vecP + ∼ N J q−q(q−1) (c0 , C0 )I(vecP + 0), where vecP + denotes the J q − q(q − 1)-dimensional
vector of free elements of P stacked columnwise, to incorporate the non-negativity constraints (which is
critical for the estimated source profiles to make a physical sense) while facilitating computation. Other
choices for prior distributions for P that have been used in the literature are discussed in Appendix
A1 of supplementary material available at Biostatistics
online. For the prior distributions of , α,
μ, , β, η, and σ y2 , we assume σ j−2 ∼ Gamma a0 , b0 j , j = 1, . . . , J , μ ∼ N J (m 0 , M0 ), α ∼ N (α0 , U0 ),
y y
β ∼ Nq (β0 , B0 ), η ∼ N I (η0 , 0 ), σ y−2 ∼ Gamma a0 , b0 , and ∼ I W (R0 , r0 ). Because of the complexity of the joint posterior distribution, MCMC methods are employed for estimation of parameters.
In the MCMC sampling algorithm employed here, one sweep consists of nine updating procedures: (i)
updating , (ii) updating P, (iii) updating , (iv) updating , (v) updating μ, (vi) updating α, (vii)
updating β, (viii) updating η, and (ix) updating σ y2 . The full conditional distributions as well as the
joint posterior distribution are given in Appendix A1 of supplementary material available at Biostatistics
online.
As discussed in Section 2, each combination of q and identifiability conditions (here, position of prespecified zeros) leads to a different model. Assume that there are G candidate models associated with
different q and/or identifiability conditions, M1 , . . . , MG . Typical Bayesian model comparison is based
on posterior model probabilities
p Mg l X, y Mg
p Mg |X, y = G
k=1
[ p (Mk ) l (X, y |Mk )]
,
(3.7)
where p Mg is the prior model probability and l X, y Mg is the marginal likelihood for model Mg ,
respectively. Note that under the indifference prior model probabilities, the posterior model probability is
proportional to the marginal likelihood and (3.7) becomes
l X, y Mg
p Mg |X, y = G
k=1 l
(X, y |Mk )
.
(3.8)
Thus,
we
only
need to calculate the marginal likelihood of each model for model comparison. Note that
l X, y Mg can be estimated by
l X, y θgc , Mg p θgc Mg
ˆl X, y Mg =
,
π̂ θgc X, y, Mg
θg under model Mg , p θg Mg is the prior of θg under
where l X, y θg , Mg is the likelihood of
2
under model Mg , and
model Mg , θgc is a single point of θg = g , Pg , g , g , μg , αg , βg , ηg , σ y,g
c
c
π̂ θg X, y, Mg is the estimated posterior density function of π θg X, y, Mg . For simplicity of
490
E. S. PARK AND
OTHERS
notation, we suppress the index g for the rest of the section. Using the same algorithm of Oh (1999),
we have
π θ c |X, y, M = E π c P c , c , c , μc , α c , β c , ηc , σ y2c × π P c , c , c , μc , α c , β c , ηc , σ y2c
× π c , P, c , μc , α c , β c , ηc , σ y2c × π c , P, , μc , α c , β c , ηc , σ y2c
× π μc , P, , , α c , β c , ηc , σ y2c × π α c , P, , , μ, β c , ηc , σ y2c
× π β c , P, , , μ, α, ηc , σ y2c
(3.9)
×π ηc , P, , , μ, α, β, σ y2c × π σ y2c |, P, , , μ, α, β, η .
Appendix A2 of supplementary material available at Biostatistics online contains a description of the
algorithm of Oh (1999) and additional explanation of (3.9). Because the full conditional posterior density functions are known, π(θ̂ c | X, y, M) can be estimated as the sample average of the product of the
full conditional posterior density functions using the posterior sample of θ under model M. Although in
theory θ c can be an arbitrary point in the parameter space, for efficiency it needs to be chosen from the
region with high posterior density. An approximate posterior mode of θ , based on a preliminary MCMC
run, would be a reasonable choice for θ c .
4. SIMULATION
We conducted a simulation study to assess the performance of the new method that incorporates parameter uncertainty in source contributions into estimation of source-specific health effects while coping with
uncertainty in both the number of sources and identifiability conditions in multivariate receptor models.
The elemental concentration data were generated from (3.5) with μ = (7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7) ,
and using the same parameter values for P, , and used in the simulation study of Nikolov and others
(2006). The health outcome datawere generated from
the weather variables with the
incorporating
(3.6),
following parameter values: β = 0.5 0 1 0.5 , η = 1 0.5 , α = 3, σ y2 = 1. The weather data were
generated from the lognormal distribution, Log (Z ti ) ∼ N (0, 0.5), t = 1, . . . , n, i = 1, 2. The sample size
n was taken to be 100. Appendix B of supplementary material available at Biostatistics online provides
further details on simulation. As opposed to assuming the known number of sources (q0 = 4) and identifiability conditions, we defined the candidate models by varying the number of sources q from 1 to 5 with
pre-specified q × (q − 1) zeros for each q, and estimated the parameters under each q−source model as
well as computed marginal likelihoods. Although the diagonal covariance matrix was used for to generate the simulated data (as in Nikolov and others, 2006), we treated as an unknown general covariance
matrix in estimation, allowing estimation of correlations among the source contributions. The simulation
was repeated 200 times. The estimated marginal likelihoods for each q-source model are reported (on a
log scale as logMD) in Table B1 of Appendix B of supplementary material available at Biostatistics online.
The selected model is the one having the maximum logMD. The true model (with q = 4) was selected for
199 out of 200 simulations, i.e. for 99.5% of times. (For only one simulation out of 200, a model with
q = 5 was selected.) We also monitored the R 2 values between the true source composition profiles and
the estimated source composition profiles as well as the true source contributions and the estimated source
contributions for q = 4. Throughout the simulation, R 2 values were all greater than 0.94, which indicates
that the estimated source profiles and contributions agree well with the true source profiles and contributions. Figure B1 in Appendix B of supplementary material available at Biostatistics online presents the time
series plots of the true centered source contributions and estimated centered source contributions (based
on one of the simulated datasets), which again shows that the estimated source contributions are very close
to the true source contributions. The estimated source-specific health effect parameter β was very close to
Assessment of source-specific health effects: a unified Bayesian approach
491
the true value throughout the simulation. The 95% posterior intervals were computed in each simulation.
Overall, the posterior interval for each element of β contains the true value approximately 96% of times
(there were 32 instances out of 800 that the posterior interval for an element of β does not contain the true
value). The average widths of posterior intervals for beta parameters for each of the four sources over 200
simulations were 0.31, 0.38, 0.42, and 0.39, respectively.
5. APPLICATION TO PHOENIX DATA
The proposed method was applied to the daily 24-h PM2.5 speciation data along with temperature and relative humidity data and daily cardiovascular mortality count data for Phoenix (Hopke and others, 2006;
Mar and others, 2006). The original PM2.5 speciation data consist of 981 samples with measured concentrations for 46 chemical elements collected from March 11, 1995 through June 30, 1998. The Phoenix
mortality data consist of daily numbers of deaths due to cardiovascular causes for February 9, 1995 to
December 31, 1997 for residents 65 years of age at the time of death. More details on the data are provided in Appendix C of supplementary material available at Biostatistics online. The first important step in
multivariate receptor modeling is to select an appropriate subset of species for the analysis. The rationale
for selection of species is also given in Appendix C of supplementary material available at Biostatistics
online. The final selection of species used in the model fitting included the following 15 species: Na, Al,
Si, S, Cl, K, Ca, Mn, Fe, Cu, Zn, Br, Pb, OC, and EC. There are actually 1027 days between March 11, 1995
and December 31, 1997. While there were no missing values for the mortality data for that time period, the
PM2.5 specification data were available only for 868 days at most. We imputed missing values in the data
to enable exploration of a different lag structure of the source-specific effects on mortality other than a lag
0 effect. Further details on imputation are available in Appendix C of supplementary material available at
Biostatistics online. To provide a consistent basis for comparison with other results from Mar and others
(2006), we used the same mortality model as that used in Mar and others (2006), controlling for confounding by including an indicator variable for extreme temperatures, a day of week variable, and smoothing
terms for time trends, temperature, and relative humidity in our analysis (namely, natural spline smoothers
with 12 degrees of freedom (df) for the smoothing of time trend, 5 df for the smoothing of temperature with 2 days lag, and 2 df for the smoothing of relative humidity with 0 days lag). We constructed a
range of different receptor models (resulting from each combination of different number of sources and
identifiability conditions) to be compared for the Phoenix data. Based on several previous studies on the
Phoenix PM2.5 data (Ramadan and others, 2003; Lewis and others, 2003; Hopke and others, 2006) and
the NUMFACT procedure (Henry and others, 1999), we presumed that the number of major sources is
between 3 and 8. For candidate positions of zeros in P under each q-source model, we also use the information on the major sources from previous studies. For example, we use the information that the element
Al is not a major constituent of Motor Vehicles and pre-assign a zero to it. Note that we use this type of
information from previous studies only to find out the plausible sets of identifiability conditions (positions of zeros) under each q-source model. Other than that, the candidate models do not depend on the
results from those previous studies. Ten candidate models with different number of sources (q = 3, 4,
5, 6, 7, 8) and different pre-specification of identifiability conditions (zeros in P) that we compare are
given in Table 1.
The PM2.5 data and cardiovascular mortality data were simultaneously fitted to estimate source composition profiles, contributions, health effect parameters as well as marginal likelihood under each model
in Table 1 at lag 0–5 days. Because concentrations of PM2.5 species differed by two or three orders
of magnitude, each element was scaled by its sample standard deviation before running MCMC. It is
known that convergence problems are common when elemental concentrations are on widely different
scales (Nikolov and others, 2007). However, after the run, the individual elements of the estimated source
profiles were multiplied by the corresponding sample standard deviations to bring them back to the
E. S. PARK AND
492
OTHERS
Table 1. Marginal likelihoods for candidate models for Phoenix PM2.5 speciation
data and cardiovascular mortality at lag 0 days
Model number
q
1
3
2
4
3
5
4
5
5
5
6
6
7
6
8
6
9
7
Pre-specified position of zeros in P
Source 1: Al, S
Source 2: Cl, Fe
Source 3: OC, EC
Source 1: Al, Si, K
Source 2: Al, Cl, Fe
Source 3: Cl, OC, EC
Source 4: Al, Si, Ca
Source 1: Al, Si, S, K
Source 2: Al, Si, Cl, Fe
Source 3: Cl, Cu, OC, EC
Source 4: K, Ca, Br, EC
Source 5: Al, Si, OC, EC
Source 1: Al, Si, S, K
Source 2: Al, Si, Cl, Fe
Source 3: Cl, Cu, OC, EC
Source 4: Al, Si, Ca, Fe
Source 5: K, Ca, Br, EC
Source 1: Al, S, Cl, Fe
Source 2: Cl, Fe, Cu, Zn
Source 3: Cl, Cu, OC, EC
Source 4: Al, Si, Ca, Cu
Source 5: Al, Si, K, Ca
Source 1: Al, Si, S, Cl, Ca
Source 2: Al, Si, Cl, K, Fe
Source 3: Cl, Cu, Pb, OC, EC
Source 4: Al, Si, Cl, Ca, Fe
Source 5: Al, Si, Cl, K, Ca
Source 6: Al, Cl, K, Ca, EC
Source 1: Al, Si, S, Cl, K
Source 2: Cl, Ca, Mn, Br, EC
Source 3: Cl, Cu, Pb, OC, EC
Source 4: Al, Si, Cl, Ca, Fe
Source 5: Cl, Fe, Cu, Zn, Pb
Source 6: Al, K, Pb, OC, EC
Source 1: Al, Si, S, Cl, K
Source 2: Al, Si, Cl, K, Fe
Source 3: Cl, Cu, Pb, OC, EC
Source 4: Al, Si, Cl, Ca, Fe
Source 5: Al, Cl, Mn, Br, EC
Source 6: Al, K, Pb, OC, EC
Source 1: Al, Si, S, Cl, K, Fe
Source 2: Al, Cl, Ca, Mn, Br, EC
Source 3: Cl, Cu, Br, Pb, OC, EC
Source 4: Al, Si, Cl, Ca, Fe, Zn
Source 5: Na, Al, Si, Cl, K, Ca
Source 6: Na, Cl, Fe, Cu, Zn, Pb
Source 7: Al, K, Cu, Pb, OC, EC
LogMD
PostP
−1.5761 × 10−4
0.0000
−1.5580 × 10−4
0.0000
−1.5598 × 10−4
0.0000
−1.5219 × 10−4
0.0000
−1.5549 × 10−4
0.0000
−1.5392 × 10−4
0.0000
−1.5153 × 10−4
1.0000
−1.5316 × 10−4
0.0000
−1.5440 × 10−4
0.0000
Continued.
Assessment of source-specific health effects: a unified Bayesian approach
493
Table 1. Continued.
Model number
q
Pre-specified position of zeros in P
10
8
Source 1: Al, S, Ca, Mn, Zn, Br, Pb
Source 2: Al, Cl, Mn, Fe, Cu, Zn, Pb
Source 3: Cl, Mn, Cu, Br, Pb, OC, EC
Source 4: Al, Si, Cl, Ca, Mn, Fe, Zn
Source 5: Al, Si, Cl, K, Ca, Mn, Br
Source 6: Al, Cl, Mn, Zn, Br, Pb, EC
Source 7: Al, Cu, Zn, Br, Pb, OC, EC
Source 8: S, K, Fe, Cu, Pb, OC, EC
LogMD
PostP
−1.5849 × 10−4
0.0000
LogMD and PostP denote the log of marginal likelihood and posterior model probability, respectively.
original scale so that the relative amounts of species in each profile are physically interpretable. Then,
the source composition profiles in the original scale were normalized so that the sum of elements of
each source composition profile is 1. The estimated source contributions were also normalized by multiplying the corresponding normalizing constant for each source (i.e. the sum of elements of the corresponding source composition profile in the original scale). It needs to be noted that although the PM data
were scaled by the standard deviations at the beginning, it does not actually affect the estimation of P
or A. It only changes the scales in the source composition matrix during the MCMC implementation.
By rescaling back to the original scale at the end, however, the relative amounts of species in each source
profile are preserved. The following hyperparameter values were used for generating MCMC samples:
a0 = 0.01, b0 j = 0.01( j = 1, . . . , 15), c0 = 0.5, C0 = 100, m 0 = X̄ , M0 = 100, α0 = 0, U0 = 100, β0 = 0q ,
y
y
B0 = 100 × Iq , η0 = 0I , 0 = 100 × II , a0 = 0.01, and b0 = 0.01. Also, an orthogonal factor model assuming = Iq a priori was employed for γt . Note that, from a Bayesian standpoint, can be viewed as a
hyperparameter of the prior distribution for γ , and, as shown in Park, Oh, and others (2002), the correlation structure in γ can still be uncovered by the sample correlations of the estimated γ ’s even in the case
an approximate posterior mode
is obtained from a
where = Iq is misspecified a priori. For each model,
2c
at which the marginal
preliminary MCMC run, and this is used for θgc = gc , Pgc , gc , μcg , αgc , βgc , ηgc , σ y,g
likelihood is calculated. An approximate posterior mode is obtained by evaluating the joint posterior density for 100 000 iterations after the first 100 000 draws are discarded. A main MCMC run is then started
2c
, and the samples are collected for 200 000 iterations, subsamfrom θgc = gc , Pgc , gc , μcg , αgc , βgc , ηgc , σ y,g
pling every 10th value (resulting in 20 000 samples), without additional burn-in. The marginal likelihood
for each model can be computed in sample generation without storing the samples. Table 1 also gives
the estimated marginal likelihood (in log) for each model, jointly modeling the PM2.5 and cardiovascular
mortality data at lag 0 days. The posterior probability of each model under the indifference prior is also
provided in Table 1. Model 7 with six sources is selected as the best model because the posterior probability
for Model 7 is almost 1. For other lag days (lag days 1–5) also, Model 7 led to the highest posterior model
probability that is very close to 1. This is consistent with the observation from Mar and others (2006) who
noted that there are six source components most consistently reported for the Phoenix data by the various
investigators/methods. The estimated source profiles under Model 7 are given in Table C1 of Appendix
C of supplementary material available at Biostatistics online. The estimated source profiles and contributions based on the PM2.5 and cardiovascular mortality data with other lags are materially the same as
those in Table C1 of supplementary material available at Biostatistics online. The estimated source profiles
in Table C1 of supplementary material available at Biostatistics online were labeled as Traffic, Smelter,
Soil/Crustal, Biomass/Wood combustion, Secondary sulfate, and Sea salt, respectively. The reasoning for
such labeling is also provided in Appendix C of supplementary material available at Biostatistics online.
To obtain the corresponding source contributions that are scaled appropriately by the normalizing constants
of the source profiles, S and OC in Table C1 of supplementary material available at Biostatistics online
E. S. PARK AND
494
OTHERS
Table 2. Rescaled source composition profiles along with contributions under Model 7 and sourcespecific health effects on cardiovascular mortality
Species
Source 1
(Traffic)
Source 2
(Smelter)
Na
Al
Si
(NH4 )2 SO4
Cl
K
Ca
Mn
Fe
Cu
Zn
Br
Pb
OM
EC
0.00
0
0
0
0
0
0.40
0.08
1.95
0.08
0.28
0.03
0.12
75.71
21.34
0.01
0.07
0.82
84.49
0
0.50
0
0
0.28
0.36
0.46
0
0.55
12.44
0
Mean
Standard deviation
Fifth-to-95th increment
4.37
3.62
11.81
0.31
0.65
1.92
β
β
β
β
β
β
−0.37
0.29
0.00
0.08
0.30
−0.25
0.46
−0.02
0.08
−0.03
0.15
0.01
Species #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
(lag 0)
(lag 1)
(lag 2)
(lag 3)
(lag 4)
(lag 5)
Source 3
(Soil/Crustal)
Source 4
(Biomass/Wood
combustion)
Source compositions
1.05
0.02
16.11
0
42.17
0
7.23
0.60
0
0
6.33
2.27
14.42
0
0.32
0.00
12.03
0
0
0.03
0.27
0.04
0.06
0.03
0
0.09
0
82.65
0
14.27
Source contributions (µg/m3 )
0.83
1.92
0.64
1.85
1.91
4.97
Source-specific health effects
0.08
0.13
−0.17
0.24
−0.22
0.19
−0.34
0.26
−0.45
−0.21
−0.03
0.10
Source 5
(Secondary
sulfate)
Source 6
(Sea salt)
3.56
0.18
0.20
54.22
0
1.07
0.06
0.00
0
0
0
0.07
0
39.93
0.71
33.48
0
8.24
5.37
39.68
0
6.07
0.31
5.39
0.06
1.21
0.19
0
0
0
2.34
0.97
2.97
0.03
0.10
0.14
0.28
0.07
0.08
0.16
0.45
−0.27
−0.12
−0.14
−0.03
0.18
0.12
0.39
(i) Source profiles are normalized to sum to 100%. (ii) Zeros in bold give the position of pre-assigned zeros. (iii) The β coefficient of PM2.5 contributions from each source type represents the estimated average increase in daily mortality counts per 5th-to95th percentile increment of estimated PM2.5 source contribution (in µg/m3 ) while controlling for other variables in the model.
(iv) Significant health effects are denoted in bold.
needed to be renormalized because all of the S that will be present will be present as sulfate which has
three times the mass of S, and OC only includes the carbon in organic compounds and does not include the
H, O, and N that will also be in the organic species but are not measured. Since ammonium is not included
in the profile, S was multiplied by 4.125 to be converted to (NH4 )2 SO4 . Also, OC was multiplied by 1.5 to
be converted to OM that includes H, O, and N. The renormalized source composition profiles along with
the estimates of the mean, standard deviations, and the 5th-to-95th percentiles of source contributions are
presented in Table 2.
Figure 1 contains the time series plots of the estimated source contributions (in µg/m3 ) for 1027 days
(March 11, 1995 to December 31, 1997). In general, the daily patterns of estimated source contributions of
Figure 1 are similar to those of Figure 1 in Mar and others (2006) and those of Ramadan and others (2003,
Figure 2). The plots of predicted versus measured concentrations for species used in model fitting as well
as the plot of the sum of estimated source contributions versus measured total PM2.5 mass concentration
(which was not used in model fitting) are also provided in Appendix D of supplementary material available
Assessment of source-specific health effects: a unified Bayesian approach
20
15
10
5
0
Traffic
100
200
300
400
500
600
700
800
8
6
4
2
0
900
1000
Smelter
100
Mass Contribution
495
200
300
400
500
600
700
800
8
6
4
2
0
900
1000
Soil/Crustal
100
200
300
400
500
600
700
800
900
1000
Biomass/Wood
combustion
15
10
5
0
100
200
300
400
500
600
700
8
6
4
2
0
800
900
1000
Secondary sulfate
100
200
300
400
500
600
700
800
1. 5
900
1000
Sea salt
1
0. 5
0
100
200
300
400
500
600
700
800
900
1000
Day
Fig. 1. Time series plots of the estimated source contributions (in µg/m3 ) for 1027 days (March 11, 1995 to December
31, 1997).
at Biostatistics online for this paper. The R 2 values between the measured and predicted values were greater
than 0.7 for all but two minor species Zn and Br. The R 2 value between the sum of the estimated source
contributions and measured total PM2.5 mass concentration was 0.93.
Table 2 also presents source-specific health effects on cardiovascular mortality at lags 0–5 days. Only
the health effects due to Source 2 (that appears to be Smelter) at lag 0 days and Source 6 (that appears
to be Sea salt) at lag 5 days were statistically significant (i.e. a 95% credible interval does not contain 0).
In Mar and others (2006), the effects of Secondary sulfate (lag 0), Traffic (lag 1), Smelter (lag 0), and Sea
salt (lag 5) on cardiovascular mortality were found to be statistically significant. The effects of the fine
particle soil and biomass burning factors were not significant at any lags in Mar and others (2006) as well
496
E. S. PARK AND
OTHERS
as in our analysis. Overall, the health effects of Smelter, Sea salt, Soil/Crustal, and Biomass/Wood combustion seemed to be consistent between Mar and others (2006) and our analysis. However, Secondary sulfate
at lag 0 days and Traffic at lag 1 day that were statistically significant in Mar and others (2006) were not
statistically significant in our analysis. Recall that the uncertainties in the estimated source contributions
were not accounted for in the estimation of the health effects parameters in Mar and others (2006), which
may have introduced the potential bias as noted in Mar and others (2006). On the other hand, our approach
does account for the uncertainty in the estimated source contributions in estimation of the health effects
parameters. Statistically insignificant estimates for Secondary sulfate (lag 0) and Traffic (lag 1) might have
been a consequence of incorporating the uncertainty that has been previously ignored.
6. DISCUSSION
We presented a new statistical approach to the evaluation of source-specific health effects associated with
an unknown number of major sources of multiple air pollutants. The proposed method effectively deals
with model uncertainty in source apportionment while accounting for parameter uncertainty that has been
largely ignored in the previous assessments of source-specific health effects. The new approach was illustrated with PM2.5 speciation data and cardiovascular mortality data from Phoenix. The results from our
methods agreed in general with those from the previously conducted workshop/studies on PM source
apportionment and health effects for the Phoenix data in terms of the number of major contributing sources
as well as estimated source profiles and contributions. For the health effects of specific sources, there were
similarities and dissimilarities. The health effects of Soil/Crustal and Biomass/Wood combustion were
statistically insignificant in both Mar and others (2006) and our analysis. However, while Mar and others
(2006) identified adverse health effects for four source types (Sulfate at lag 0, Traffic at lag 1, Smelter
at lag 0, and Sea salt at lag 5, our analysis identified only two (Smelter at lag 0 and Sea salt at lag 5) to
be statistically significant, which seems to be a natural consequence of incorporating uncertainty in the
estimated source contributions into the health effects parameter estimation.
SUPPLEMENTARY MATERIAL
Supplementary Material is available at http://biostatistics.oxfordjournals.org.
ACKNOWLEDGMENTS
The authors thank Dr Therese Mar for providing Phoenix mortality data and two anonymous reviewers for
helpful comments. Conflict of Interest: None declared.
FUNDING
Research described in this article was conducted under contract to the Health Effects Institute (HEI),
an organization jointly funded by the United States Environmental Protection Agency (EPA) (Assistance
Award No. R-82811201) and certain motor vehicle and engine manufacturers. The contents of this article
do not necessarily reflect the views of HEI, or its sponsors, nor do they necessarily reflect the views and
policies of the EPA or motor vehicle and engine manufacturers.
REFERENCES
BARTHOLOMEW, D. J. AND KNOTT, M. (1999). Latent Variable Models and Factor Analysis, 2nd edition. New York:
Oxford University Press Inc.
Assessment of source-specific health effects: a unified Bayesian approach
497
DOMINICI, F., PENG, R. D., BARR, C. D. AND BELLE, M. L. (2010). Protecting human health from air pollution: shifting
from a single-pollutant to a multipollutant approach. Epidemiology 21, 187–194.
HENRY, R. C., PARK, E. S. AND SPIEGELMAN, C. H. (1999). Comparing a new algorithm with the classic methods for
estimating the number of factors. Chemometrics and Intelligent Laboratory Systems 48, 91–97.
HOPKE, P. K. (2010). The application of receptor modeling to air quality data. Pollution Atmosphérique, Special Issue,
91–109. http://www.appa.asso.fr/national/Pages/article.php?art=487
HOPKE, P. K., ITO, K., MAR, T., CHRISTENSEN, W. F., EATOUGH, D. J., HENRY, R. C., KIM, E., LADEN, F., LALL, R.,
LARSON, T.V., AND OTHERS. (2006). PM source apportionment and health effects: 1. Intercomparison of source
apportionment results. Journal of Exposure Science and Environmental Epidemiology 16(3), 275–286.
ITO, K., CHRISTENSEN, W. F., EATOUGH, D. J., HENRY, R. C., KIM, E., LADEN, F., LALL, R., LARSON, T.V., NEAS,
L., HOPKE, P. K., AND THURSTON, G.D. (2006). PM source apportionment and health effects: 2. An investigation
of intermethod variability in associations between source-apportioned fine particle mass and daily mortality in
Washington, DC. Journal of Exposure Science and Environmental Epidemiology 16, 300–310.
LADEN, F., NEAS, L. M., DOCKERY, D. W. AND SCHWARTZ, J. (2000). Association of fine particulate matter from
different sources with daily mortality in six U.S. cities. Environmental Health Perspectives 108, 941–947.
LALL, R., ITO, K. AND THURSTON, G. D. (2011). Distributed lag analyses of daily hospital admissions and source—
apportioned fine particle air pollution. Environmental Health Perspectives 119, 455–460.
LEWIS, C. W., NORRIS, G. A., CONNER, T. L. AND HENRY, R. C. (2003). Source apportionment of Phoenix PM2.5
aerosol with the Unmix receptor model. Journal of the Air and Waste Management Association 53, 325–338.
MAR, T. F., ITO, K., KOENIG, J. Q., LARSON, T. V., EATOUGH, D. J., HENRY, R. C., KIM, E., LADEN, F., LALL, R., NEAS,
L., AND OTHERS (2006). PM source apportionment and health effects. 3. Investigation of inter-method variations
in associations between estimated source contributions of PM(2.5) and daily mortality in Phoenix, AZ. Journal of
Exposure Science and Environmental Epidemiology 16(4), 311–320.
NIKOLOV, M. C., COULL, B. A., CATALANO, P. J. AND GODLESKI, J. J. (2006). An informative Bayesian structural
equation model to assess source-specific health effects of air pollution. Harvard University Biostatistics Working
Paper Series, 46.
NIKOLOV, M. C., COULL, B. A., CATALANO, P. J. AND GODLESKI, J. J. (2007). An informative Bayesian structural
equation model to assess source specific health effects of air pollution. Biostatistics 8, 609–624.
OH, M. S. (1999). Estimation of posterior density functions from a posterior sample. Computational Statistics and
Data Analysis 29, 411–427.
OSTRO, B., TOBIAS, A., QUEROL, X., ALASTUEY, A., AMATO, F., PEY, J., PÉREZ, N. AND SUNYER, J. (2011). The effects
of particulate matter sources on daily mortality: a case-crossover study of Barcelona, Spain. Environmental Health
Perspectives 119, 1781–1787.
PARK, E. S., GUTTORP, P. AND HENRY, R. C. (2001). Multivariate receptor modeling for temporally correlated data by
using MCMC. Journal of the American Statistical Association 96, 1171–1183.
PARK, E. S., SPIEGELMAN, C. H. AND HENRY, R. C. (2002). Bilinear estimation of pollution source profiles and amounts
by using multivariate receptor models. Environmetrics 13, 775–798.
PARK, E. S., OH, M. S. AND GUTTORP, P. (2002). Multivariate receptor models and model uncertainty. Chemometrics
and Intelligent Laboratory Systems 60, 49–67.
RAMADAN, Z., EICKHOUT, B., SONG, X. H., BUYDENS, L. M. C. AND HOPKE, P. K. (2003). Comparison of positive
matrix factorization and multilinear engine for the source apportionment of particulate pollutants. Chemometrics
and Intelligent Laboratory Systems 66, 15–28.
[Received January 24, 2013; revised January 19, 2014; accepted for publication January 20, 2014]