Biostatistics (2014), 15, 3, pp. 484–497 doi:10.1093/biostatistics/kxu004 Advance Access publication on March 11, 2014 Assessment of source-specific health effects associated with an unknown number of major sources of multiple air pollutants: a unified Bayesian approach EUN SUG PARK∗ Texas A&M Transportation Institute, College Station, TX 77843, USA [email protected] PHILIP K. HOPKE Center for Air Resources Engineering and Science, Clarkson University, Potsdam, NY 13699, USA MAN-SUK OH Department of Statistics, Ewha Women’s University, Seoul 120-750, Korea ELAINE SYMANSKI University of Texas School of Public Health, Houston, TX 77030, USA DAIKWON HAN Department of Epidemiology & Biostatistics, Texas A&M University, College Station, TX 77843, USA CLIFFORD H. SPIEGELMAN Department of Statistics, Texas A&M University, College Station, TX 77843, USA SUMMARY There has been increasing interest in assessing health effects associated with multiple air pollutants emitted by specific sources. A major difficulty with achieving this goal is that the pollution source profiles are unknown and source-specific exposures cannot be measured directly; rather, they need to be estimated by decomposing ambient measurements of multiple air pollutants. This estimation process, called multivariate receptor modeling, is challenging because of the unknown number of sources and unknown identifiability conditions (model uncertainty). The uncertainty in source-specific exposures (source contributions) as well as uncertainty in the number of major pollution sources and identifiability conditions have been largely ignored in previous studies. A multipollutant approach that can deal with model uncertainty in multivariate receptor models while simultaneously accounting for parameter uncertainty in estimated source-specific exposures in assessment of source-specific health effects is presented in this paper. The methods are applied to daily ambient air measurements of the chemical composition of fine particulate matter (PM2.5 ), ∗ To whom correspondence should be addressed. c The Author 2014. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected]. Assessment of source-specific health effects: a unified Bayesian approach 485 weather data, and counts of cardiovascular deaths from 1995 to 1997 for Phoenix, AZ, USA. Our approach for evaluating source-specific health effects yields not only estimates of source contributions along with their uncertainties and associated health effects estimates but also estimates of model uncertainty (posterior model probabilities) that have been ignored in previous studies. The results from our methods agreed in general with those from the previously conducted workshop/studies on the source apportionment of PM health effects in terms of number of major contributing sources, estimated source profiles, and contributions. However, some of the adverse source-specific health effects identified in the previous studies were not statistically significant in our analysis, which probably resulted because we incorporated parameter uncertainty in estimated source contributions that has been ignored in the previous studies into the estimation of health effects parameters. Keywords: Cardiovascular mortality; Model uncertainty; Multipollutant approach; Multivariate receptor models; PM health effects; Source apportionment. 1. INTRODUCTION There has been growing interest in assessing health effects of air pollution based on multiple pollutants (Dominici and others, 2010). There often exist high correlations among multiple pollutants measured in ambient air (due to common sources and meteorology), and these high correlations lead to an estimation problem such as collinearity when multiple pollutants are included as covariates in the health effect regression model. This issue makes a straightforward extension of the existing single-pollutant health effect(s) model(s) inappropriate. Considering “source-specific exposures” to quantify exposure to multiple air pollutants addresses the aforementioned problem. Air pollution is generated from several sources and each source emits a combination of various pollutants. While high inter-pollutant correlations are problematic and lead to a collinearity problem in the multivariate regression models of pollutant-specific health effects, it is not a problem in estimating source-specific health effects. As a matter of fact, high inter-pollutant correlations make it possible to effectively characterize complex air pollution mixtures by a few common underlying source types using multivariate receptor models (Park and others, 2001; Park, Spiegelman, and others, 2002; Park, Oh, and others, 2002). Multivariate receptor models resolve the measured mixture of pollutants into the contributions from the individual source types (see, e.g. Hopke, 2010 for a review). Source contributions (i.e. the amount of pollution from each source) can be viewed as source-specific exposures. While many studies have evaluated increased risks of health effects associated with individual pollutants, there has been limited epidemiologic research on the source-specific health effects (see, e.g. Laden and others, 2000; Ito and others, 2006; Mar and others, 2006; Lall and others, 2011; Ostro and others, 2011). In most studies, the estimated source contributions were used as if they were true source-specific exposures (thus ignoring the parameter uncertainty associated with estimated source contributions) in the health effect models. As is well known in the measurement error model literature, ignoring parameter uncertainty in exposure estimation results in a bias in the estimated health effect regression coefficients. Notable exceptions are work by Nikolov and others (2006, 2007) who proposed a structural equation framework to assess source-specific health effects by fitting a receptor model and the health outcome model jointly to account for the parameter uncertainty associated with the estimated source contributions in the health effects estimates. More importantly, however, the number of major pollution sources that drives the number of regression terms in the health effect model was assumed known (or treated as known once it was estimated) in all of the previous studies. The same is also true for model identifiability conditions. While identifiability conditions that are useful in multivariate receptor modeling have been proposed (Park and others, 2001; Park, Spiegelman, and others, 2002) and also utilized in recent source-specific health effect studies (Nikolov and others, 2007), those conditions were assumed to be known. The model 486 E. S. PARK AND OTHERS uncertainty due to the unknown number of sources and identifiability conditions has never been taken into account in assessment of source-specific health effects. In this paper, we propose a method that can deal with model uncertainty while simultaneously incorporating parameter uncertainty for estimated source contributions into assessment of source-specific health effects. In Section 2, we introduce a basic modeling framework for evaluation of source-specific health effects. Section 3 presents the generalizations of the approach developed by Park, Oh, and others (2002) computing marginal likelihoods/posterior model probabilities to estimate model uncertainty while simultaneously accounting for parameter uncertainty in the estimation of source-specific health effects. Section 4 contains the simulation study. In Section 5, the suggested method is applied to speciated ambient air monitoring data for fine particulate matter (PM2.5 ) and daily cardiovascular mortality count data for Phoenix, AZ, USA over a 2-year period from 1995 to 1997. Finally, concluding remarks are made in Section 6. 2. BASIC MODELING FRAMEWORK We employ a Bayesian hierarchical modeling framework to incorporate multiple data sources (ambient air pollution data and health outcome data) into a single coherent statistical model. Our model consists of two main parts: the receptor and health models. An additional hierarchical model on latent source contributions and distributional assumptions on errors can also be added. A basic model form can be given as Receptor model: X t = At P + E t , t = 1, . . . , n, Health model: g(E(yt )) = λ + At β + Z t η = λ + (2.1) q k=1 βk Atk + I ηi Z ti , (2.2) i=1 where X t = (X t1 , X t2 , . . . , X t J ): concentrations of J pollutants (chemical species) measured in time t at a receptor, n = # of observations (# of days), q = # of major pollution sources (unknown), P: q × J source composition matrix of which rows are the source composition profiles (Pk , k = 1, . . . , q), Pk = ( pk1 , pk2 , . . . , pk J ): kth source composition profile consisting of the fractional amount of each chemical species in the emissions from the kth source, At = (At1 , At2 , . . . , Atq ): source contribution vector in time t where Atk is the contribution from the kth source, E t = (E t1 , E t2 , . . . , E t J ): measurement error in pollutant concentrations at time t, yt : health outcome at time t, λ: overall baseline risk of death, β = β1 , . . . , βq : parameter describing the influence of each source-specific exposure on mortality rate, Z t = (Z t1 , . . . , Z t I ): (transformations of) confounding variables such as temperature, humidity, the day of the week, etc., and η = (η1 , . . . , η I ) : parameter describing the influence of confounding variables on mortality. Note that (2.2) represents an individual-lag model. Without loss of generality, we present the model (and the method) using lag 0 contributions. The link function g can be changed depending on the type of the health outcome variable. For example, it can be the identity function for a continuous health outcome variable such as lung function, or the log function for a discrete health outcome variable such as daily mortality or morbidity count. We will assume that the measured pollutant concentrations and the health outcomes are conditionally independent given the unobserved source contributions and other covariates in the model, which seems to be a reasonable assumption. Our main goal is to estimate parameters, A (n × q source contribution matrix of which rows are parameters), along with their parameter uncertainties At , t = 1, . . . , n), P, and β (λ and η are nuisance and model uncertainty. The parameters β = β1 , . . . , βq quantify the q source-specific health effects. A major source of model uncertainty in (2.1) is the unknown number of major pollution sources, q, and identifiability conditions. It is well known in multivariate receptor modeling as well as in factor analysis that (2.1) is not identifiable as it is. The parameters A and P in (2.1) are not uniquely defined even under the assumption that q is known because AP = ARR−1 P for any q × q non-singular matrix R. This Assessment of source-specific health effects: a unified Bayesian approach 487 is a well-known non-identifiability problem in factor analysis and multivariate receptor models, which is often referred to as “factor indeterminacy”. Fortunately, under some constraints (called identifiability conditions) on either A or P, the parameters are uniquely defined (see Park, Spiegelman, and others, 2002). There could be many different identifiability conditions, but not all make sense in the context of receptor modeling or source apportionment (see Park, Spiegelman, and others, 2002; Park, Oh, and others, 2002). Because identifiability conditions are additional assumptions on the parameters, it is important to select conditions that are physically meaningful in the given context of the problem (though there could be many other purely mathematical identifiability conditions). We restrict the type of identifiability conditions to be compared to those that are reasonable and often used in the receptor modeling context. One such type of identifiability conditions is pre-specification of zero elements in the source composition matrix P: C1. There are at least q − 1 zero elements in each row of P, C2. For each k = 1, . . . , q, the rank of P (k) is q − 1, where P (k) is the matrix composed of the columns containing the assigned zeros in the kth row with those assigned zeros deleted. C3. Pk j = 1 for some j ( j = 1, . . . , J ) for each k = 1, . . . , q. These conditions imply that some pollutants are not contributed by a particular source type. The position of zeros in P can be specified by some prior knowledge on likely sources. In practice, the pre-assigned zero elements are rarely actually zeros but they are small enough (i.e. minor compounds). As demonstrated in Nikolov and others (2007) by a simulation study designed to investigate the impact of choosing incorrect identifiability constraints, the results are not sensitive to errors of assuming a zero where the truth is nonzero as long as the pre-assigned zero element is not a major constituent of a source profile. For example, when Road Dust is considered as one of the likely sources for PM speciation data, we may pre-assign zero for sulfur (S) in the candidate profile using the prior knowledge that S is not a major constituent of the Road Dust profile. Alternatively, we may consider pre-assigning zeros in the source contribution matrix A, which implies that each source is missing on some days (Park, Spiegelman, and others, 2002), or the combination of pre-specification of zeros in P and A. If we do not have any a priori information on the position of zeros in P or A, then we may start with several candidate positions for zeros proposed by exploratory data analysis (PMF or UNMIX can also be utilized in this step) and select the one giving the highest posterior probability. As mentioned earlier, in most of the previous approaches to evaluating source-specific health effects, the estimated source contributions were used as if they were true source-specific exposures or, at least, the number of major pollution sources (that drives the number of regression terms in the health effects model) and identifiability conditions were assumed known. As a matter of fact, estimation of parameters A, P, and β heavily depends on q and also on identifiability conditions employed (e.g. where to pre-assign zeros in P), and these could be a major source of uncertainty in the estimated health effects. In some cases, we may have some prior knowledge on this, i.e. the number of sources and the position of zeros can be assumed known (see, e.g. Park and others, 2001). More frequently, that information is lacking, and it becomes a main source of model uncertainty. In this paper, we aim to quantify the uncertainty in q and identifiability conditions along with other parameter uncertainties so that the inherent variability of the source apportionment can be taken into account in the assessment of source-specific health effects. To develop a method that can cope with model uncertainty in the assessment of source-specific health effects, we build on the Bayesian method developed in Park, Oh, and others (2002) that computes marginal likelihoods/posterior model probabilities for a range of plausible models (with different q and identifiability conditions) by Markov chain Monte Carlo (MCMC). In this paper, we extend the method of Park, Oh, and others (2002) in two aspects: (i) by adapting enhanced multivariate receptor models explicitly incorporating correlated source contributions and (ii) by including health models. E. S. PARK AND 488 OTHERS 3. ESTIMATION OF PARAMETER UNCERTAINTIES AND MODEL UNCERTAINTIES The method of Park, Oh, and others (2002) was developed under the assumption of orthogonal factor models, i.e. assuming a priori uncorrelated centered source contributions, γt = At − ξ ∼ Nq 0, Iq , (3.1) where ξ is the mean of At and Iq is the q × q identity matrix and Nq (·, ·) represent a q-variate normal distribution. A normal distribution was chosen as a prior for γt for mathematical convenience because γt can be considered as latent variables, and as mentioned in Bartholomew and Knot (1999), the form of the prior distribution of the latent variables is essentially arbitrary and largely a matter of convention. Note that (2.1) can be reparamaterized as X t = μ + γt P + E t , t = 1, . . . , n, (3.2) where μ = ξ P. Although the method of Park, Oh, and others (2002) was shown to be robust to violation of the prior assumption on the correlation structure of source contributions, in this paper, the method is generalized to formally account for correlated source contributions, i.e. assuming oblique factor models as a prior distribution for source contributions: γt ∼ Nq (0, ) (3.3) At ∼ Nq (ξ, ) I (At > 0) , (3.4) or where is a general covariance matrix. (Note that (2.1) and (2.2) can be reparameterized in terms of centered source contributions γt without loss of generality. Especially, our key parameter β is not affected whether or not the source contributions are centered.) Originally, we explored both priors (3.3) and (3.4), but it turned out that, for the purpose of computing marginal likelihoods, reparameterization of (2.1) and (2.2) using centered source contributions γt is more convenient because we can more effectively cope with the issue of scale invariance of factors by a constant multiplication. When yt is a continuous health outcome variable (e.g. peak ST-segment elevation in Nikolov and others, 2007) or the daily mortality (or morbidity) count with a large enough mean, yt may be assumed to follow a normal distribution and the link function g in (2.2) becomes the identity function. Using the centered source contribution and the identity link function for the health model, the model for source-specific health effects can be written as follows: Receptor model: X t = μ + γt P + E t , t = 1, . . . , n, Health model: yt = α + γt β + Z t η + εt = α + q (3.5) βk γtk + k=1 I ηi Z ti + εt , (3.6) i=1 where μ = ξ P, α = λ + ξβ. We make the following assumptions on errors E t and εt : E t ∼ N J (0, ), = diag σ12 , σ22 , . . . , σ J2 , εt ∼ N 0, σ y2 . To complete a Bayesian model specification, the prior distributions for the unknown parameters, = {γt , t = 1, . . . , n}, , P, , μ, α, β, η, and σ y2 are required. We assume independence between Assessment of source-specific health effects: a unified Bayesian approach 489 {γt , t = 1, . . . , n} given the hyperparameter and the parameters , , P, , α, μ, β, η, and σ y2 as follows: p , P, , , μ, α, β, η, σ y2 = p ( | ) p (P) p ( ) p () p (μ) p (α) p (β) p (η) p σ y2 . As noted earlier, the prior distribution for the centered source contribution {γt } is assumed to be γt ∼ Nq (0, ). For a prior distribution for P, we assume a point mass at zero for q(q − 1) elements of P preselected for identifiability conditions. For the free elements of P, we use the truncated normal distribution, vecP + ∼ N J q−q(q−1) (c0 , C0 )I(vecP + 0), where vecP + denotes the J q − q(q − 1)-dimensional vector of free elements of P stacked columnwise, to incorporate the non-negativity constraints (which is critical for the estimated source profiles to make a physical sense) while facilitating computation. Other choices for prior distributions for P that have been used in the literature are discussed in Appendix A1 of supplementary material available at Biostatistics online. For the prior distributions of , α, μ, , β, η, and σ y2 , we assume σ j−2 ∼ Gamma a0 , b0 j , j = 1, . . . , J , μ ∼ N J (m 0 , M0 ), α ∼ N (α0 , U0 ), y y β ∼ Nq (β0 , B0 ), η ∼ N I (η0 , 0 ), σ y−2 ∼ Gamma a0 , b0 , and ∼ I W (R0 , r0 ). Because of the complexity of the joint posterior distribution, MCMC methods are employed for estimation of parameters. In the MCMC sampling algorithm employed here, one sweep consists of nine updating procedures: (i) updating , (ii) updating P, (iii) updating , (iv) updating , (v) updating μ, (vi) updating α, (vii) updating β, (viii) updating η, and (ix) updating σ y2 . The full conditional distributions as well as the joint posterior distribution are given in Appendix A1 of supplementary material available at Biostatistics online. As discussed in Section 2, each combination of q and identifiability conditions (here, position of prespecified zeros) leads to a different model. Assume that there are G candidate models associated with different q and/or identifiability conditions, M1 , . . . , MG . Typical Bayesian model comparison is based on posterior model probabilities p Mg l X, y Mg p Mg |X, y = G k=1 [ p (Mk ) l (X, y |Mk )] , (3.7) where p Mg is the prior model probability and l X, y Mg is the marginal likelihood for model Mg , respectively. Note that under the indifference prior model probabilities, the posterior model probability is proportional to the marginal likelihood and (3.7) becomes l X, y Mg p Mg |X, y = G k=1 l (X, y |Mk ) . (3.8) Thus, we only need to calculate the marginal likelihood of each model for model comparison. Note that l X, y Mg can be estimated by l X, y θgc , Mg p θgc Mg ˆl X, y Mg = , π̂ θgc X, y, Mg θg under model Mg , p θg Mg is the prior of θg under where l X, y θg , Mg is the likelihood of 2 under model Mg , and model Mg , θgc is a single point of θg = g , Pg , g , g , μg , αg , βg , ηg , σ y,g c c π̂ θg X, y, Mg is the estimated posterior density function of π θg X, y, Mg . For simplicity of 490 E. S. PARK AND OTHERS notation, we suppress the index g for the rest of the section. Using the same algorithm of Oh (1999), we have π θ c |X, y, M = E π c P c , c , c , μc , α c , β c , ηc , σ y2c × π P c , c , c , μc , α c , β c , ηc , σ y2c × π c , P, c , μc , α c , β c , ηc , σ y2c × π c , P, , μc , α c , β c , ηc , σ y2c × π μc , P, , , α c , β c , ηc , σ y2c × π α c , P, , , μ, β c , ηc , σ y2c × π β c , P, , , μ, α, ηc , σ y2c (3.9) ×π ηc , P, , , μ, α, β, σ y2c × π σ y2c |, P, , , μ, α, β, η . Appendix A2 of supplementary material available at Biostatistics online contains a description of the algorithm of Oh (1999) and additional explanation of (3.9). Because the full conditional posterior density functions are known, π(θ̂ c | X, y, M) can be estimated as the sample average of the product of the full conditional posterior density functions using the posterior sample of θ under model M. Although in theory θ c can be an arbitrary point in the parameter space, for efficiency it needs to be chosen from the region with high posterior density. An approximate posterior mode of θ , based on a preliminary MCMC run, would be a reasonable choice for θ c . 4. SIMULATION We conducted a simulation study to assess the performance of the new method that incorporates parameter uncertainty in source contributions into estimation of source-specific health effects while coping with uncertainty in both the number of sources and identifiability conditions in multivariate receptor models. The elemental concentration data were generated from (3.5) with μ = (7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7) , and using the same parameter values for P, , and used in the simulation study of Nikolov and others (2006). The health outcome datawere generated from the weather variables with the incorporating (3.6), following parameter values: β = 0.5 0 1 0.5 , η = 1 0.5 , α = 3, σ y2 = 1. The weather data were generated from the lognormal distribution, Log (Z ti ) ∼ N (0, 0.5), t = 1, . . . , n, i = 1, 2. The sample size n was taken to be 100. Appendix B of supplementary material available at Biostatistics online provides further details on simulation. As opposed to assuming the known number of sources (q0 = 4) and identifiability conditions, we defined the candidate models by varying the number of sources q from 1 to 5 with pre-specified q × (q − 1) zeros for each q, and estimated the parameters under each q−source model as well as computed marginal likelihoods. Although the diagonal covariance matrix was used for to generate the simulated data (as in Nikolov and others, 2006), we treated as an unknown general covariance matrix in estimation, allowing estimation of correlations among the source contributions. The simulation was repeated 200 times. The estimated marginal likelihoods for each q-source model are reported (on a log scale as logMD) in Table B1 of Appendix B of supplementary material available at Biostatistics online. The selected model is the one having the maximum logMD. The true model (with q = 4) was selected for 199 out of 200 simulations, i.e. for 99.5% of times. (For only one simulation out of 200, a model with q = 5 was selected.) We also monitored the R 2 values between the true source composition profiles and the estimated source composition profiles as well as the true source contributions and the estimated source contributions for q = 4. Throughout the simulation, R 2 values were all greater than 0.94, which indicates that the estimated source profiles and contributions agree well with the true source profiles and contributions. Figure B1 in Appendix B of supplementary material available at Biostatistics online presents the time series plots of the true centered source contributions and estimated centered source contributions (based on one of the simulated datasets), which again shows that the estimated source contributions are very close to the true source contributions. The estimated source-specific health effect parameter β was very close to Assessment of source-specific health effects: a unified Bayesian approach 491 the true value throughout the simulation. The 95% posterior intervals were computed in each simulation. Overall, the posterior interval for each element of β contains the true value approximately 96% of times (there were 32 instances out of 800 that the posterior interval for an element of β does not contain the true value). The average widths of posterior intervals for beta parameters for each of the four sources over 200 simulations were 0.31, 0.38, 0.42, and 0.39, respectively. 5. APPLICATION TO PHOENIX DATA The proposed method was applied to the daily 24-h PM2.5 speciation data along with temperature and relative humidity data and daily cardiovascular mortality count data for Phoenix (Hopke and others, 2006; Mar and others, 2006). The original PM2.5 speciation data consist of 981 samples with measured concentrations for 46 chemical elements collected from March 11, 1995 through June 30, 1998. The Phoenix mortality data consist of daily numbers of deaths due to cardiovascular causes for February 9, 1995 to December 31, 1997 for residents 65 years of age at the time of death. More details on the data are provided in Appendix C of supplementary material available at Biostatistics online. The first important step in multivariate receptor modeling is to select an appropriate subset of species for the analysis. The rationale for selection of species is also given in Appendix C of supplementary material available at Biostatistics online. The final selection of species used in the model fitting included the following 15 species: Na, Al, Si, S, Cl, K, Ca, Mn, Fe, Cu, Zn, Br, Pb, OC, and EC. There are actually 1027 days between March 11, 1995 and December 31, 1997. While there were no missing values for the mortality data for that time period, the PM2.5 specification data were available only for 868 days at most. We imputed missing values in the data to enable exploration of a different lag structure of the source-specific effects on mortality other than a lag 0 effect. Further details on imputation are available in Appendix C of supplementary material available at Biostatistics online. To provide a consistent basis for comparison with other results from Mar and others (2006), we used the same mortality model as that used in Mar and others (2006), controlling for confounding by including an indicator variable for extreme temperatures, a day of week variable, and smoothing terms for time trends, temperature, and relative humidity in our analysis (namely, natural spline smoothers with 12 degrees of freedom (df) for the smoothing of time trend, 5 df for the smoothing of temperature with 2 days lag, and 2 df for the smoothing of relative humidity with 0 days lag). We constructed a range of different receptor models (resulting from each combination of different number of sources and identifiability conditions) to be compared for the Phoenix data. Based on several previous studies on the Phoenix PM2.5 data (Ramadan and others, 2003; Lewis and others, 2003; Hopke and others, 2006) and the NUMFACT procedure (Henry and others, 1999), we presumed that the number of major sources is between 3 and 8. For candidate positions of zeros in P under each q-source model, we also use the information on the major sources from previous studies. For example, we use the information that the element Al is not a major constituent of Motor Vehicles and pre-assign a zero to it. Note that we use this type of information from previous studies only to find out the plausible sets of identifiability conditions (positions of zeros) under each q-source model. Other than that, the candidate models do not depend on the results from those previous studies. Ten candidate models with different number of sources (q = 3, 4, 5, 6, 7, 8) and different pre-specification of identifiability conditions (zeros in P) that we compare are given in Table 1. The PM2.5 data and cardiovascular mortality data were simultaneously fitted to estimate source composition profiles, contributions, health effect parameters as well as marginal likelihood under each model in Table 1 at lag 0–5 days. Because concentrations of PM2.5 species differed by two or three orders of magnitude, each element was scaled by its sample standard deviation before running MCMC. It is known that convergence problems are common when elemental concentrations are on widely different scales (Nikolov and others, 2007). However, after the run, the individual elements of the estimated source profiles were multiplied by the corresponding sample standard deviations to bring them back to the E. S. PARK AND 492 OTHERS Table 1. Marginal likelihoods for candidate models for Phoenix PM2.5 speciation data and cardiovascular mortality at lag 0 days Model number q 1 3 2 4 3 5 4 5 5 5 6 6 7 6 8 6 9 7 Pre-specified position of zeros in P Source 1: Al, S Source 2: Cl, Fe Source 3: OC, EC Source 1: Al, Si, K Source 2: Al, Cl, Fe Source 3: Cl, OC, EC Source 4: Al, Si, Ca Source 1: Al, Si, S, K Source 2: Al, Si, Cl, Fe Source 3: Cl, Cu, OC, EC Source 4: K, Ca, Br, EC Source 5: Al, Si, OC, EC Source 1: Al, Si, S, K Source 2: Al, Si, Cl, Fe Source 3: Cl, Cu, OC, EC Source 4: Al, Si, Ca, Fe Source 5: K, Ca, Br, EC Source 1: Al, S, Cl, Fe Source 2: Cl, Fe, Cu, Zn Source 3: Cl, Cu, OC, EC Source 4: Al, Si, Ca, Cu Source 5: Al, Si, K, Ca Source 1: Al, Si, S, Cl, Ca Source 2: Al, Si, Cl, K, Fe Source 3: Cl, Cu, Pb, OC, EC Source 4: Al, Si, Cl, Ca, Fe Source 5: Al, Si, Cl, K, Ca Source 6: Al, Cl, K, Ca, EC Source 1: Al, Si, S, Cl, K Source 2: Cl, Ca, Mn, Br, EC Source 3: Cl, Cu, Pb, OC, EC Source 4: Al, Si, Cl, Ca, Fe Source 5: Cl, Fe, Cu, Zn, Pb Source 6: Al, K, Pb, OC, EC Source 1: Al, Si, S, Cl, K Source 2: Al, Si, Cl, K, Fe Source 3: Cl, Cu, Pb, OC, EC Source 4: Al, Si, Cl, Ca, Fe Source 5: Al, Cl, Mn, Br, EC Source 6: Al, K, Pb, OC, EC Source 1: Al, Si, S, Cl, K, Fe Source 2: Al, Cl, Ca, Mn, Br, EC Source 3: Cl, Cu, Br, Pb, OC, EC Source 4: Al, Si, Cl, Ca, Fe, Zn Source 5: Na, Al, Si, Cl, K, Ca Source 6: Na, Cl, Fe, Cu, Zn, Pb Source 7: Al, K, Cu, Pb, OC, EC LogMD PostP −1.5761 × 10−4 0.0000 −1.5580 × 10−4 0.0000 −1.5598 × 10−4 0.0000 −1.5219 × 10−4 0.0000 −1.5549 × 10−4 0.0000 −1.5392 × 10−4 0.0000 −1.5153 × 10−4 1.0000 −1.5316 × 10−4 0.0000 −1.5440 × 10−4 0.0000 Continued. Assessment of source-specific health effects: a unified Bayesian approach 493 Table 1. Continued. Model number q Pre-specified position of zeros in P 10 8 Source 1: Al, S, Ca, Mn, Zn, Br, Pb Source 2: Al, Cl, Mn, Fe, Cu, Zn, Pb Source 3: Cl, Mn, Cu, Br, Pb, OC, EC Source 4: Al, Si, Cl, Ca, Mn, Fe, Zn Source 5: Al, Si, Cl, K, Ca, Mn, Br Source 6: Al, Cl, Mn, Zn, Br, Pb, EC Source 7: Al, Cu, Zn, Br, Pb, OC, EC Source 8: S, K, Fe, Cu, Pb, OC, EC LogMD PostP −1.5849 × 10−4 0.0000 LogMD and PostP denote the log of marginal likelihood and posterior model probability, respectively. original scale so that the relative amounts of species in each profile are physically interpretable. Then, the source composition profiles in the original scale were normalized so that the sum of elements of each source composition profile is 1. The estimated source contributions were also normalized by multiplying the corresponding normalizing constant for each source (i.e. the sum of elements of the corresponding source composition profile in the original scale). It needs to be noted that although the PM data were scaled by the standard deviations at the beginning, it does not actually affect the estimation of P or A. It only changes the scales in the source composition matrix during the MCMC implementation. By rescaling back to the original scale at the end, however, the relative amounts of species in each source profile are preserved. The following hyperparameter values were used for generating MCMC samples: a0 = 0.01, b0 j = 0.01( j = 1, . . . , 15), c0 = 0.5, C0 = 100, m 0 = X̄ , M0 = 100, α0 = 0, U0 = 100, β0 = 0q , y y B0 = 100 × Iq , η0 = 0I , 0 = 100 × II , a0 = 0.01, and b0 = 0.01. Also, an orthogonal factor model assuming = Iq a priori was employed for γt . Note that, from a Bayesian standpoint, can be viewed as a hyperparameter of the prior distribution for γ , and, as shown in Park, Oh, and others (2002), the correlation structure in γ can still be uncovered by the sample correlations of the estimated γ ’s even in the case an approximate posterior mode is obtained from a where = Iq is misspecified a priori. For each model, 2c at which the marginal preliminary MCMC run, and this is used for θgc = gc , Pgc , gc , μcg , αgc , βgc , ηgc , σ y,g likelihood is calculated. An approximate posterior mode is obtained by evaluating the joint posterior density for 100 000 iterations after the first 100 000 draws are discarded. A main MCMC run is then started 2c , and the samples are collected for 200 000 iterations, subsamfrom θgc = gc , Pgc , gc , μcg , αgc , βgc , ηgc , σ y,g pling every 10th value (resulting in 20 000 samples), without additional burn-in. The marginal likelihood for each model can be computed in sample generation without storing the samples. Table 1 also gives the estimated marginal likelihood (in log) for each model, jointly modeling the PM2.5 and cardiovascular mortality data at lag 0 days. The posterior probability of each model under the indifference prior is also provided in Table 1. Model 7 with six sources is selected as the best model because the posterior probability for Model 7 is almost 1. For other lag days (lag days 1–5) also, Model 7 led to the highest posterior model probability that is very close to 1. This is consistent with the observation from Mar and others (2006) who noted that there are six source components most consistently reported for the Phoenix data by the various investigators/methods. The estimated source profiles under Model 7 are given in Table C1 of Appendix C of supplementary material available at Biostatistics online. The estimated source profiles and contributions based on the PM2.5 and cardiovascular mortality data with other lags are materially the same as those in Table C1 of supplementary material available at Biostatistics online. The estimated source profiles in Table C1 of supplementary material available at Biostatistics online were labeled as Traffic, Smelter, Soil/Crustal, Biomass/Wood combustion, Secondary sulfate, and Sea salt, respectively. The reasoning for such labeling is also provided in Appendix C of supplementary material available at Biostatistics online. To obtain the corresponding source contributions that are scaled appropriately by the normalizing constants of the source profiles, S and OC in Table C1 of supplementary material available at Biostatistics online E. S. PARK AND 494 OTHERS Table 2. Rescaled source composition profiles along with contributions under Model 7 and sourcespecific health effects on cardiovascular mortality Species Source 1 (Traffic) Source 2 (Smelter) Na Al Si (NH4 )2 SO4 Cl K Ca Mn Fe Cu Zn Br Pb OM EC 0.00 0 0 0 0 0 0.40 0.08 1.95 0.08 0.28 0.03 0.12 75.71 21.34 0.01 0.07 0.82 84.49 0 0.50 0 0 0.28 0.36 0.46 0 0.55 12.44 0 Mean Standard deviation Fifth-to-95th increment 4.37 3.62 11.81 0.31 0.65 1.92 β β β β β β −0.37 0.29 0.00 0.08 0.30 −0.25 0.46 −0.02 0.08 −0.03 0.15 0.01 Species # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (lag 0) (lag 1) (lag 2) (lag 3) (lag 4) (lag 5) Source 3 (Soil/Crustal) Source 4 (Biomass/Wood combustion) Source compositions 1.05 0.02 16.11 0 42.17 0 7.23 0.60 0 0 6.33 2.27 14.42 0 0.32 0.00 12.03 0 0 0.03 0.27 0.04 0.06 0.03 0 0.09 0 82.65 0 14.27 Source contributions (µg/m3 ) 0.83 1.92 0.64 1.85 1.91 4.97 Source-specific health effects 0.08 0.13 −0.17 0.24 −0.22 0.19 −0.34 0.26 −0.45 −0.21 −0.03 0.10 Source 5 (Secondary sulfate) Source 6 (Sea salt) 3.56 0.18 0.20 54.22 0 1.07 0.06 0.00 0 0 0 0.07 0 39.93 0.71 33.48 0 8.24 5.37 39.68 0 6.07 0.31 5.39 0.06 1.21 0.19 0 0 0 2.34 0.97 2.97 0.03 0.10 0.14 0.28 0.07 0.08 0.16 0.45 −0.27 −0.12 −0.14 −0.03 0.18 0.12 0.39 (i) Source profiles are normalized to sum to 100%. (ii) Zeros in bold give the position of pre-assigned zeros. (iii) The β coefficient of PM2.5 contributions from each source type represents the estimated average increase in daily mortality counts per 5th-to95th percentile increment of estimated PM2.5 source contribution (in µg/m3 ) while controlling for other variables in the model. (iv) Significant health effects are denoted in bold. needed to be renormalized because all of the S that will be present will be present as sulfate which has three times the mass of S, and OC only includes the carbon in organic compounds and does not include the H, O, and N that will also be in the organic species but are not measured. Since ammonium is not included in the profile, S was multiplied by 4.125 to be converted to (NH4 )2 SO4 . Also, OC was multiplied by 1.5 to be converted to OM that includes H, O, and N. The renormalized source composition profiles along with the estimates of the mean, standard deviations, and the 5th-to-95th percentiles of source contributions are presented in Table 2. Figure 1 contains the time series plots of the estimated source contributions (in µg/m3 ) for 1027 days (March 11, 1995 to December 31, 1997). In general, the daily patterns of estimated source contributions of Figure 1 are similar to those of Figure 1 in Mar and others (2006) and those of Ramadan and others (2003, Figure 2). The plots of predicted versus measured concentrations for species used in model fitting as well as the plot of the sum of estimated source contributions versus measured total PM2.5 mass concentration (which was not used in model fitting) are also provided in Appendix D of supplementary material available Assessment of source-specific health effects: a unified Bayesian approach 20 15 10 5 0 Traffic 100 200 300 400 500 600 700 800 8 6 4 2 0 900 1000 Smelter 100 Mass Contribution 495 200 300 400 500 600 700 800 8 6 4 2 0 900 1000 Soil/Crustal 100 200 300 400 500 600 700 800 900 1000 Biomass/Wood combustion 15 10 5 0 100 200 300 400 500 600 700 8 6 4 2 0 800 900 1000 Secondary sulfate 100 200 300 400 500 600 700 800 1. 5 900 1000 Sea salt 1 0. 5 0 100 200 300 400 500 600 700 800 900 1000 Day Fig. 1. Time series plots of the estimated source contributions (in µg/m3 ) for 1027 days (March 11, 1995 to December 31, 1997). at Biostatistics online for this paper. The R 2 values between the measured and predicted values were greater than 0.7 for all but two minor species Zn and Br. The R 2 value between the sum of the estimated source contributions and measured total PM2.5 mass concentration was 0.93. Table 2 also presents source-specific health effects on cardiovascular mortality at lags 0–5 days. Only the health effects due to Source 2 (that appears to be Smelter) at lag 0 days and Source 6 (that appears to be Sea salt) at lag 5 days were statistically significant (i.e. a 95% credible interval does not contain 0). In Mar and others (2006), the effects of Secondary sulfate (lag 0), Traffic (lag 1), Smelter (lag 0), and Sea salt (lag 5) on cardiovascular mortality were found to be statistically significant. The effects of the fine particle soil and biomass burning factors were not significant at any lags in Mar and others (2006) as well 496 E. S. PARK AND OTHERS as in our analysis. Overall, the health effects of Smelter, Sea salt, Soil/Crustal, and Biomass/Wood combustion seemed to be consistent between Mar and others (2006) and our analysis. However, Secondary sulfate at lag 0 days and Traffic at lag 1 day that were statistically significant in Mar and others (2006) were not statistically significant in our analysis. Recall that the uncertainties in the estimated source contributions were not accounted for in the estimation of the health effects parameters in Mar and others (2006), which may have introduced the potential bias as noted in Mar and others (2006). On the other hand, our approach does account for the uncertainty in the estimated source contributions in estimation of the health effects parameters. Statistically insignificant estimates for Secondary sulfate (lag 0) and Traffic (lag 1) might have been a consequence of incorporating the uncertainty that has been previously ignored. 6. DISCUSSION We presented a new statistical approach to the evaluation of source-specific health effects associated with an unknown number of major sources of multiple air pollutants. The proposed method effectively deals with model uncertainty in source apportionment while accounting for parameter uncertainty that has been largely ignored in the previous assessments of source-specific health effects. The new approach was illustrated with PM2.5 speciation data and cardiovascular mortality data from Phoenix. The results from our methods agreed in general with those from the previously conducted workshop/studies on PM source apportionment and health effects for the Phoenix data in terms of the number of major contributing sources as well as estimated source profiles and contributions. For the health effects of specific sources, there were similarities and dissimilarities. The health effects of Soil/Crustal and Biomass/Wood combustion were statistically insignificant in both Mar and others (2006) and our analysis. However, while Mar and others (2006) identified adverse health effects for four source types (Sulfate at lag 0, Traffic at lag 1, Smelter at lag 0, and Sea salt at lag 5, our analysis identified only two (Smelter at lag 0 and Sea salt at lag 5) to be statistically significant, which seems to be a natural consequence of incorporating uncertainty in the estimated source contributions into the health effects parameter estimation. SUPPLEMENTARY MATERIAL Supplementary Material is available at http://biostatistics.oxfordjournals.org. ACKNOWLEDGMENTS The authors thank Dr Therese Mar for providing Phoenix mortality data and two anonymous reviewers for helpful comments. Conflict of Interest: None declared. FUNDING Research described in this article was conducted under contract to the Health Effects Institute (HEI), an organization jointly funded by the United States Environmental Protection Agency (EPA) (Assistance Award No. R-82811201) and certain motor vehicle and engine manufacturers. The contents of this article do not necessarily reflect the views of HEI, or its sponsors, nor do they necessarily reflect the views and policies of the EPA or motor vehicle and engine manufacturers. REFERENCES BARTHOLOMEW, D. J. AND KNOTT, M. (1999). Latent Variable Models and Factor Analysis, 2nd edition. New York: Oxford University Press Inc. Assessment of source-specific health effects: a unified Bayesian approach 497 DOMINICI, F., PENG, R. D., BARR, C. D. AND BELLE, M. L. (2010). Protecting human health from air pollution: shifting from a single-pollutant to a multipollutant approach. Epidemiology 21, 187–194. HENRY, R. C., PARK, E. S. AND SPIEGELMAN, C. H. (1999). Comparing a new algorithm with the classic methods for estimating the number of factors. Chemometrics and Intelligent Laboratory Systems 48, 91–97. HOPKE, P. K. (2010). The application of receptor modeling to air quality data. Pollution Atmosphérique, Special Issue, 91–109. http://www.appa.asso.fr/national/Pages/article.php?art=487 HOPKE, P. K., ITO, K., MAR, T., CHRISTENSEN, W. F., EATOUGH, D. J., HENRY, R. C., KIM, E., LADEN, F., LALL, R., LARSON, T.V., AND OTHERS. (2006). PM source apportionment and health effects: 1. Intercomparison of source apportionment results. Journal of Exposure Science and Environmental Epidemiology 16(3), 275–286. ITO, K., CHRISTENSEN, W. F., EATOUGH, D. J., HENRY, R. C., KIM, E., LADEN, F., LALL, R., LARSON, T.V., NEAS, L., HOPKE, P. K., AND THURSTON, G.D. (2006). PM source apportionment and health effects: 2. An investigation of intermethod variability in associations between source-apportioned fine particle mass and daily mortality in Washington, DC. Journal of Exposure Science and Environmental Epidemiology 16, 300–310. LADEN, F., NEAS, L. M., DOCKERY, D. W. AND SCHWARTZ, J. (2000). Association of fine particulate matter from different sources with daily mortality in six U.S. cities. Environmental Health Perspectives 108, 941–947. LALL, R., ITO, K. AND THURSTON, G. D. (2011). Distributed lag analyses of daily hospital admissions and source— apportioned fine particle air pollution. Environmental Health Perspectives 119, 455–460. LEWIS, C. W., NORRIS, G. A., CONNER, T. L. AND HENRY, R. C. (2003). Source apportionment of Phoenix PM2.5 aerosol with the Unmix receptor model. Journal of the Air and Waste Management Association 53, 325–338. MAR, T. F., ITO, K., KOENIG, J. Q., LARSON, T. V., EATOUGH, D. J., HENRY, R. C., KIM, E., LADEN, F., LALL, R., NEAS, L., AND OTHERS (2006). PM source apportionment and health effects. 3. Investigation of inter-method variations in associations between estimated source contributions of PM(2.5) and daily mortality in Phoenix, AZ. Journal of Exposure Science and Environmental Epidemiology 16(4), 311–320. NIKOLOV, M. C., COULL, B. A., CATALANO, P. J. AND GODLESKI, J. J. (2006). An informative Bayesian structural equation model to assess source-specific health effects of air pollution. Harvard University Biostatistics Working Paper Series, 46. NIKOLOV, M. C., COULL, B. A., CATALANO, P. J. AND GODLESKI, J. J. (2007). An informative Bayesian structural equation model to assess source specific health effects of air pollution. Biostatistics 8, 609–624. OH, M. S. (1999). Estimation of posterior density functions from a posterior sample. Computational Statistics and Data Analysis 29, 411–427. OSTRO, B., TOBIAS, A., QUEROL, X., ALASTUEY, A., AMATO, F., PEY, J., PÉREZ, N. AND SUNYER, J. (2011). The effects of particulate matter sources on daily mortality: a case-crossover study of Barcelona, Spain. Environmental Health Perspectives 119, 1781–1787. PARK, E. S., GUTTORP, P. AND HENRY, R. C. (2001). Multivariate receptor modeling for temporally correlated data by using MCMC. Journal of the American Statistical Association 96, 1171–1183. PARK, E. S., SPIEGELMAN, C. H. AND HENRY, R. C. (2002). Bilinear estimation of pollution source profiles and amounts by using multivariate receptor models. Environmetrics 13, 775–798. PARK, E. S., OH, M. S. AND GUTTORP, P. (2002). Multivariate receptor models and model uncertainty. Chemometrics and Intelligent Laboratory Systems 60, 49–67. RAMADAN, Z., EICKHOUT, B., SONG, X. H., BUYDENS, L. M. C. AND HOPKE, P. K. (2003). Comparison of positive matrix factorization and multilinear engine for the source apportionment of particulate pollutants. Chemometrics and Intelligent Laboratory Systems 66, 15–28. [Received January 24, 2013; revised January 19, 2014; accepted for publication January 20, 2014]
© Copyright 2026 Paperzz