Nested Dirichlet Process Models for Household Data

Nested Dirichlet Process Models for Household
Data Synthesis
Monika Jingchen Hu & Jerry Reiter
Department of Statistical Science, Duke University
Work supported by NSF
October 28, 2013
Outline
Challenges in household dataset synthesis
Review of existing approaches for categorical data imputation
and synthesis
The nested Dirichlet process model
Simulation results on a CPS dataset
Summary and future work
Features of a typical dataset on households
A subset of Current Population Survey (CPS) March 2011 dataset,
available on IPUMS-USA.
Features of a typical dataset on households
A subset of Current Population Survey (CPS) March 2011 dataset,
available on IPUMS-USA.
Challenges:
1. Multivariate nominal/unordered variables
2. Dependencies among different family members in the same
household
3. Both household level variables and individual level variables
Existing approaches for categorical data imputation and
synthesis
Typical models for imputation and synthesis:
Sequential regression modeling (MICE)
Log-linear models
Existing approaches for categorical data imputation and
synthesis
Typical models for imputation and synthesis:
Sequential regression modeling (MICE)
Log-linear models
However such models have limitations in model selection and
estimation, especially in high dimension.
Existing approaches for categorical data imputation and
synthesis
Typical models for imputation and synthesis:
Sequential regression modeling (MICE)
Log-linear models
However such models have limitations in model selection and
estimation, especially in high dimension.
Nonparametric Bayes Modeling of Multivariate Categorical
Data (Dunson and Xing 2009)
Dirichlet process mixtures of multinomial distributions prior
full support on the space of distributions for multiple
unordered categorical variables
Multiple Imputation application outperforms MICE (Si and
Reiter 2013)
Intuition and literature of nested modeling
Try to put households with similar features together
(household types)
Cluster households first; cluster household members next.
Intuition and literature of nested modeling
Try to put households with similar features together
(household types)
Cluster households first; cluster household members next.
Literature on nested Dirichlet Process (Rodriguez, Dunson
and Gelfand 2008): DP(βDP(αH))
1. Replace the random atoms with random probability measure
drawn from a DP.
2. Households drawn from the same DP component (DP(αH))
automatically clustered together
nDP model on household dataset
The model can be expressed in the following hierarchical form:
nDP model on household dataset
The model can be expressed in the following hierarchical form:
(k)
(k)
Xijk |ηi , ψij , φ ∼ Multinomial({1, . . . , dk }, φηi ψij 1 , . . . , φηi ψij dk ) for all i,j,k
(k 0 )
(k 0 )
Xik 0 |ηi , λ ∼ Multinomial({1, . . . , dk 0 }, ληi 1 , . . . , ληi d 0 ) for all i,k’
k
nDP model on household dataset
The model can be expressed in the following hierarchical form:
(k)
(k)
Xijk |ηi , ψij , φ ∼ Multinomial({1, . . . , dk }, φηi ψij 1 , . . . , φηi ψij dk ) for all i,j,k
(k 0 )
(k 0 )
Xik 0 |ηi , λ ∼ Multinomial({1, . . . , dk 0 }, ληi 1 , . . . , ληi d 0 ) for all i,k’
k
ηi |π ∼ Multinomial(π1 , . . . , π∞ )
ψij |ηi , w ∼ Multinomial(wηi 1 , . . . , wηi ∞ ) for all i,j
nDP model on household dataset
The model can be expressed in the following hierarchical form:
(k)
(k)
Xijk |ηi , ψij , φ ∼ Multinomial({1, . . . , dk }, φηi ψij 1 , . . . , φηi ψij dk ) for all i,j,k
(k 0 )
(k 0 )
Xik 0 |ηi , λ ∼ Multinomial({1, . . . , dk 0 }, ληi 1 , . . . , ληi d 0 ) for all i,k’
k
ηi |π ∼ Multinomial(π1 , . . . , π∞ )
ψij |ηi , w ∼ Multinomial(wηi 1 , . . . , wηi ∞ ) for all i,j
πf = uf
Y
Y
(1 − ul ) for f = 1, . . . , ∞, wfs = vfs
(1 − vfl ) for s = 1, . . . , ∞
l<f
l<s
uf ∼ Beta(1, α), vfs ∼ Beta(1, β), α ∼ Gamma(aα , bα ), β ∼ Gamma(aβ , bβ )
nDP model on household dataset
The model can be expressed in the following hierarchical form:
(k)
(k)
Xijk |ηi , ψij , φ ∼ Multinomial({1, . . . , dk }, φηi ψij 1 , . . . , φηi ψij dk ) for all i,j,k
(k 0 )
(k 0 )
Xik 0 |ηi , λ ∼ Multinomial({1, . . . , dk 0 }, ληi 1 , . . . , ληi d 0 ) for all i,k’
k
ηi |π ∼ Multinomial(π1 , . . . , π∞ )
ψij |ηi , w ∼ Multinomial(wηi 1 , . . . , wηi ∞ ) for all i,j
πf = uf
Y
Y
(1 − ul ) for f = 1, . . . , ∞, wfs = vfs
(1 − vfl ) for s = 1, . . . , ∞
l<f
l<s
uf ∼ Beta(1, α), vfs ∼ Beta(1, β), α ∼ Gamma(aα , bα ), β ∼ Gamma(aβ , bβ )
(k)
(k)
(k)
φfs = (φfs1 , . . . , φfsdk ) ∼ Dirichlet(gk1 , . . . , gkdk )
(k 0 )
λf
(k 0 )
(k 0 )
= (λf 1 , . . . , λfd 0 ) ∼ Dirichlet(hk 0 1 , . . . , hk 0 dk 0 )
k
Computation
The algorithm based on the blocked Gibbs sampler proposed
by Ishwaran and James (2001) is applied for posterior
computation.
It truncates the infinite stick-breaking probabilities at some
large numbers F and S while fast computation is guaranteed.
Matlab and C++ for sampling.
CPS data simulation
We applied the nested Dirichlet process model to the March
2011 Current Population Survey household dataset.
5 selected individual level categorical variables: relate (12),
sex (2), race (5), marital.status (6) and age (19).
1 selected household level categorical variables: ownership (3).
10000 households and 26661 individuals in the sample.
CPS data simulation cont’d
We put hyperpriors aα = bα = 1, aβ = bβ = 1 and
ak1 = · · · = akdk = 1.
We initialize multinomial probabilities φ’s and λ’s with the
maximum likelihood estimates from the original dataset.
F=25 and S=15; nrun is 10000, burn-in is 5000 and thin is 10.
CPS data simulation cont’d
We put hyperpriors aα = bα = 1, aβ = bβ = 1 and
ak1 = · · · = akdk = 1.
We initialize multinomial probabilities φ’s and λ’s with the
maximum likelihood estimates from the original dataset.
F=25 and S=15; nrun is 10000, burn-in is 5000 and thin is 10.
Marginal and joint distributions of variables, and some within
household relationships in the original dataset and the synthetic
dataset (the one generated at the last iteration of the MCMC
chain) are compared.
CPS data simulation results nDP
Table: Sex distributions
original
synthetic
15
10
age
15
Frequency
15000
Frequency
5
0 5000
3000
Frequency
0
10
age
3
3378
3354
25000
synthetic
1000
3000
Frequency
1000
0
5
2
144
138
Table: Ownership distributions
25000
original
1
6478
6508
original
synthetic
1
2
3
race
4
5
15000
2
13831
13712
0 5000
1
12830
12949
original
synthetic
1
2
3
race
4
5
CPS data simulation results nDP
synthetic
percentage
0.2
0.3
orig
syn
0.0
0.1
15
5
10
age
10
5
age
15
original
1
2
1
sex
2
1,1
5,1
3,2
sex
1,3
5,3
3,4
original
synthetic
5
10
age
10
5
0.1
0.2
age
0.3
15
15
orig
syn
0.4
5,5
0.0
percentage
1,5
(marital status,race)
1,1 2,1 1,2 2,2 1,3 2,3 1,4 2,4 1,5 2,5
(sex,race)
1
3
marst
5
1
3
marst
5
Within household relationships - nDP
Within household race in all households of size 3 in both dataset.
original
synthetic
0
1744
1692
1
97
149
Table: Within household race for 1841 households of size 3
Within household relationships - nDP cont’d
Some measures on households of size 2:
0.20
0.10
0.00
percentage
0.30
o orig
− syn
0
2
4
6
8
10
12
age difference
14
Within household relationships - nDP cont’d
0.10
o orig
− syn
0.00
0.10
percentage
0.20
o orig
− syn
0.00
percentage
0.20
Some measures on age difference in households of size 3 and 6:
0
2
4
6
8
10
12
age difference
15
0
2
4
6
8
10
12
age difference
14
16
18
To deal with structural zeros
Define a new variable:
(age, marst)
eliminate impossible combinations of ”younger than 15 years
old” with ”married/separated/divorced/widowed”
6 ∗ 18 − 15 = 93 categories
CPS data simulation results with structural zeros
Table: Sex distributions
15
age
15
age
25000
Frequency
15000
Frequency
5
3
3378
3418
synthetic
25000
original
0 5000
3000
Frequency
0
5
2
144
167
Table: Ownership distributions
synthetic
1000
3000
1000
0
Frequency
original
1
6478
6415
original
synthetic
1
2
3
race
4
5
15000
2
13831
13682
0 5000
1
12830
12979
original
synthetic
1
2
3
race
4
5
CPS data simulation results with structural zeros
15
10
5
5
age
10
age
0.3
0.2
0.1
0.0
percentage
0.4
orig
syn
synthetic
15
original
1,1 2,1 1,2 2,2 1,3 2,3 1,4 2,4 1,5 2,5
(sex,race)
1
3
marst
5
1
3
marst
5
Within household relationships with structural zeros
Within household race in all households of size 3 in both dataset.
original
synthetic
0
1744
1669
1
97
172
Table: Within household race for 1841 households of size 3
Summary and future work
Proposed a fully Bayesian, joint modeling approach for
categorical household data, based on the nested Dirichlet
Process
Summary and future work
Proposed a fully Bayesian, joint modeling approach for
categorical household data, based on the nested Dirichlet
Process
Dealing with relational structural zeros in Decennial Census
household dataset
Summary and future work
Proposed a fully Bayesian, joint modeling approach for
categorical household data, based on the nested Dirichlet
Process
Dealing with relational structural zeros in Decennial Census
household dataset
Developing risk measures of the generated synthetic
household dataset
Summary and future work
Proposed a fully Bayesian, joint modeling approach for
categorical household data, based on the nested Dirichlet
Process
Dealing with relational structural zeros in Decennial Census
household dataset
Developing risk measures of the generated synthetic
household dataset
Extensions to both categorical and continuous variables
Reference
Dunson, D.B., and Xing. C. (2009) ”Nonparametric Bayes Modeling of
Multivariate Categorical Data”, Journal of the American Statistical Association,
104(487), 1042-1051.
Ishwaran, H., and James, L.F., (2001), “Gibbs sampling for stick-breaking
priors”, Journal of the American Statistical Association, 96, 161-173.
Manrique-Vallier, D. and Reiter, J. (2013),”Bayesian Estimation of Discrete
Multivariate Latent Structure Models with Structural Zeros”, Journal of
Computational and Graphical Statistics (accepted for publication)
Reiter, J. P. (2011), ”Data confidentiality”, Wiley Interdisciplinary Reviews:
Computational Statistics, 3, 450-456.
Rodriguez, A., Dunson, D.B., and Gelfand, A.E., (2008), “The nested Dirichlet
process”, Journal of the American Statistical Association, 103, 1131-1154.
Sethuraman, J. (1994), “A constructive definition of Dirichlet priors”, Statistica
Sinica, 4, 639-650.
Si, Y. and Reiter, J. P. (forthcoming), ”Nonparametric Bayesian multiple
imputation for incomplete categorical variables in large-scale assessment
surveys”, Journal of Educational and Behavioral Statistics.
Thank you! Questions?
Mixing of the MCMC chain - with structural zeros
Traceplots of concentration parameters α and β in the
two-level DP priors (thin 10 in the second half of the chain).
The traceplots seem to show convergence of α and β.
Figure: Traceplot of α; 2.2
Figure: Traceplot of β; 0.37
Cluster assignments at different iterations
Parameter estimations in a certain cluster across different
iterations
Risk measure
Pr(Risk of matching)
= Pr (Yi = yi |Z
(1)
(1)
,...,Z
(m)
(m)
, M, Y−i )
= c ∗ P(Z , . . . , Z
|Y−i , Yi = yi , M) ∗ P(Yi = yi |M)
Z
(1)
(m)
= c ∗ ( P(Z , . . . , Z
|Y−i , Yi = yi , θ) ∗ P(θ|Yi = yi , M, Y−i )dθ) ∗ P(Yi = yi |M)
θ includes π, w , φ
Importance sampling