Estimation of Joint Distribution from Marginal Distributions

2003 Joint Statistical Meetings - Section on Survey Research Methods
ESTIMATION OF JOINT DISTRIBUTION FROM MARGINAL
DISTRIBUTIONS
Yogendra P. Chaubey, Fassil Nebebe, Concordia University
Krzystof Dzicielowski, Bell Canada
Debaraj Sen,Concordia University
Yogendra P. Chaubey, Department of Mathematics and Statistics,
Concordia University, Montréal, Québec, H4B 1R6 Canada
([email protected]) ∗
November 14, 2003
Key Words: Bayesian Prediction, Contingency Ta- methods, because they require multidimensional nuble, Dirichlet Prior.
merical integration. However, the Bayesian method
is preferred as it readily presents an estimate of the
standard error derived from the posterior variance.
The proposal in this paper is to reduce the continAbstract
gency table into several 2×2 contingency tables, estiIn many consumer surveys, demographic and other mate the target cell frequency along with its standard
data are used as covariates for predictions of con- error and combine the results obtained from various
sumer preferences and other decision variables. How- tables in an appropriate manner. We highlight here
ever, to protect the confidentiality of the consumers, that computations in the case of 2 × 2 table can be
the data are only available in marginal frequency dis- easily performed. For example, suppose, we have a
tribution format. This creates a problem in predict- 2 × 2 × 2 contingency table, with unknown cell proing joint frequencies required for further decision- portions written as xijk ; i = 1, 2, j = 1, 2, k = 1, 2.
making. In this paper we consider the dependence For, estimating x111 , we can consider the following 3,
structure in formulating the joint distribution given 2 × 2 contingency tables:
in the form of a Dirichlet prior. The computations in
Table 1: Three 2 × 2 Contingency Tables for
the case of 2 × 2 table are extended to multiway conEstimating x111 from a 2 × 2 × 2 Table
tingency tables and shown to provide similar results
as obtained using the Monte Carlo methods.
k=1
1
Introduction
In this paper we consider estimating the joint distribution of several demographic variable when only information about their marginal distributions is available. Putler, Kalyanam and Hodges (1996) considered a Bayesian approach using Dirichlet prior for
the cell proportions. They also compared iterative
proprtional fitting (IPF) method (see Bishop, Feinberg, and Holland 1975).They presented the case of
2 × 2 table in detail. However, the higher contingency tables have to be analysed using Monte Carlo
∗ This research is partially supported by a research grant
from Bell University Labs to Yogendra P. Chaubey and Fassil
Nebebe.
883
i=1
i=2
1
2
j=1
j=2
x111
x211
x121
x221
i=1
k=1 k=2
j=1
j=2
1
2
x111
x121
x112
x122
j=1
k=1 k=2
i=1
i=2
1
2
x111
x211
x112
x212
2003 Joint Statistical Meetings - Section on Survey Research Methods
(l)
Let, x̂111 and se2(l) (x̂111 ) denote the posterior where αij refer to the prior information about pij ,
mean and posterior variance based on the lth table often available from some bench-mark data or a larger
comprising of rl observation, the final estimates of survey, so that
x111 and its proposed estimator of error is given by;
X
αij ∈ [0, 1],
αij = 1,
P3
(l)
ij
r
x̂
l=1 l 111
(1)
x̂111 =
P
3
and m refers to the weight assigned to the prior inl=1 rl
P3
formation. The prior implicitly assumes that in m
2(l)
(x̂111 )
l=1 rl se
se2 (x111 ) =
(2) observations of prior data, the total in the (i, j)−cell
P3
l=1 rl
is mαij .
With this set-up, because of the constraints on
In section 2, we present the Bayesian analysis of a
the
cell-proprtions, only one of the cell-totals is in2 × 2 contingency table as given in Putler, Kalyanam
dependent.
We choose arbitrarily the (2, 2) − cell,
and Hodges (1996). Section 3 outlines the comand
define
the
random variable z = nx22 . The posputational details along with an example. Section
terior
inference
about z may be based on the density
4 presents computational summary of the examples
g(z|data),
given
by
considered in Putler, Kalyanam and Hodges (1996)
and some additional examples from a marketing consumer survey data in Montreal.
2
The Bayesian Approach in 2×
2 Case
g(z) ∝ Γ(n1. − n.2 + z + mα11 )Γ(n.2 − z + mα12 )
Γ(n2. − z + mα21 )Γ(z + mα22 )/
{Γ(n1. − n.2 + z + 1)Γ(n.2 − z + 1)
Γ(n2. − z + 1)Γ(z + 1)} ,
(3)
Consider as if we have the observed data in the form where ni. = nxi. , n.j = nx.j ; i, j = 1, 2. Putler,
Kalyanam and Hodges (1996) suggest that the posteof a 2 × 2 contingency table:
rior mean can be computed in a two step procedure.
Table 2: A Simple 2 × 2 Contingency Table
The first step consists of computing the constant k
of proportionality, through a trapezoidal rule over a
Column Factor Levels
large number of intervals. The second step consists
1
2
Total
of evaluating the integral
Row
Z zmax
Factor
1
nx11
nx12
nx1.
ẑ = E(Z|data) =
zg(z)dz,
2
nx21
nx22
nx2.
Levels
zmin
Total nx.1
nx.2
nx.. = n
again using the trapezoidal rule, where
where
X
zmin = max(0, n.2 − n1. ), zmax = min(n.2 , n2. ).
xij ∈ [0, 1],
xij = 1,
ij
An estimate of the standard error of the estimate
and n is the sample size. The set-up we are con- is obtained from the posterior variance given by
cerned with is that xij are unknown, however the
Z zmax
row-marginals and column-marginals are known. Of2
2
se
(Z)
=
E[(Z
−
ẑ)
|data]
=
(z − ẑ)2 g(z)dz
fcourse, under some given assumptions such as indezmin
pendence assumption, the known marginals can be
is computed using similar approach
used to estimate the joint probabilities
pij = Pr[X1 = i, X2 = j],
3
Computational Aspects
where X1 refers to the row attribute and X2 refers
to the column attribute. Putler, Kalyanam and For the general case, Monte Carlo (MC) methods are
Hodges (1996) consider a joint Dirichlet prior on proposed for estimation of proportions pi1 i2 ... , such
as Importance Sampling and Gibbs Sampling. It is
p = (p11 , p12 , p21 , p22 ) given by
clear that MC methods are very computationally exY mα −1
tensive in this context and therefore, we investigate
π(p) ∝
pij ij ,
the possibility of direct computation as suggested in
ij
884
2003 Joint Statistical Meetings - Section on Survey Research Methods
the previous section. In practice, though the compu- The numbers in the parentheses are counts. The
tation of gamma functions arising in Eq. 1 presents prior-probabilities are given in the following table and
problems. In most of the situations, we may alleviate the value of the weight m is given by m = 12843 :
this problem by using the following technique. Let
us denote z0 , the approximate value of z where g(z)
Table 4: Prior Proportions in a Simple 2 × 2
takes its maxima (it is shown in Putler et al. (1996)
Contingency Table
that g(z) is unimodal). We can thus write
Household Status
R zmax
Not
Married Total
z exp(h(z) − h(z0 ))dz
zmin
Married
,
E(Z|data) = R zmax
exp(h(z) − h(z0 ))dz
Row
zmin
Type of Rental
.3072
.0955
.3544
where, h(z) is defined, so that
.1793
.4180
.6456
Housing Owned
Total
.4331
.5669
g(z) = k exp(h(z)).
The value of z0 can be found by plotting h(z)
against z, which may be computable, where as
exp(h(z)) may not be. We first demonstrate this
on the data used in Putler, Kalyanam and Hodges
(1996), and then we use it on a survey data conducted
by a market research firm. For the computation of the
integral we use the R-codes for area function given
in Figure 1.
This function works pretty well, except for the
cases where the function f may take extremely small
values for a wide range of arguments. For example,
if f (x) < eps for x in an interval (c, b), c < (a + b)/2,
and f (a) = f (b) = 0, the algorithm will produce a
small but wrong value of the integral. To avoid such
situations, we evaluate the integral over two intervals, (a, z0 ) and (z0 , b), where z0 is the approximate
value of the argument where the function peaks.
The approximate peak is obtained by taking the
maximum of h(z) evaluated over a grid of z−values.
The R-Codes given in Figure 2 is used to compute
the function h and approximate z0 .
The value of zmin and zmax are
zmin = 7281 − 4552 = 2729, zmax = 7281.
The value of z0 may be obtained by visual inspection of the graph of z vs. h(z). This is done by using
the codes given below.
rowt<-c(4552, 8291)
colt<-c(5562, 7281)
pprob<-matrix(c(0.2670, 0.0874, 0.1661,
0.4795), nr=2,byrow=T)
pcount<- 12843
Example 1. Consider the data (see Table 5 of zseq<-seq(2729,7281,length=20)
Putler et al. (1996) on Stain-Resistent Carpeting Direct Mail Campaign. We consider a collapsed form of hzseq<-sapply(zseq,h.prior,rowt=rowt,
the contingency table into two factors, Type of Housing and Marital Status of the Household Head. The colt=colt, pprob=pprob, pcount=pcount)
actual proportions are given below:
plot(zseq,hzseq)
Table 3: Proportions in a Simple 2 × 2 Contingency
Table
Household Status
Not
Married
Married
Row
Type of
Rental
.2670
.0874
Housing
Owned
.1661
.4795
Total
.4331
(5562)
.5669
(7281)
Total
.3544
(4552)
.6456
(8291)
n=
12843
885
This gives the value of z0 approximately 6000. So
we compute h(z0 ) as
>h.prior(6000,rowt,colt,pprob,pcount)
[1] 110799.4
so, h(z0 ) = 110799.4 and therefore we use the following codes for the integrands in the denominator
2003 Joint Statistical Meetings - Section on Survey Research Methods
Figure 1: R-codes for area Function
area<-function(f,a,b,...,fa=f(a,...),fb=f(b,...),limit=50,eps=1.0e-06)
{
#Program to integrate a function f using recursive simpson’s rule
#eps is the absolute target error #limit is max number of
iterations
h<-b-a
d<-(b+a)/2
fd<-f(d,...)
a1<-((fa+fb)*h)/2
a2<-((fa+4*fd+fb)*h)/6
if ( abs(a1-a2) < eps )
return(a2)
if (limit ==0){
warning(paste("recursion limit reached near x= ",d))
return(a2)
}
Recall(f,a,d,...,fa=fa,fb=fd,limit=limit-1,eps=eps)+
Recall(f,d,b,...,fa=fd,fb=fb,limit=limit-1,eps=eps)
}
obtained are given below in table 5.
g.prior<-function(x, rowt, colt, pprob,
pcount){exp(h.prior(x,rowt, colt,
pprob, pcount) -110799) }
Table 5: Bayes Proportions in a Simple 2 × 2
Contingency Table
Using the area over subintervals (2729, 6000) and
(6000, 7281) as given below
Household Status
Not
Married
Married
> area(g.prior,2729,6000,rowt,colt,
pprob,
pcount)
[1] 10.02941
> area(g.prior,6000,7281,rowt,colt,
pprob,
pcount)
[1] 7382576
Row
Type of
Housing
Rental
Owned
Total
.2670
.1661
.4331
.0874
.4795
.5669
Total
.3544
.6456
We also computed the posterior mean for
the denominator = 7382586.02941. Hence the ex- n = 15023 and m = 12843, and results were almost
pression for the posterior mean can be obtained the same.
through the following codes:
> kernel.mean<-function(x, rowt, colt, pprob,
pcount)x * g.prior(x, rowt, colt, pprob,
pcount)/7382586.03
4
Summary of Numerical Illustrations
> area(kernel.mean,2729,6000,rowt, colt,
4.1 Data from Putler et al. (1996)
pprob,
pcount)
[1] 0.008141925
Here we present the computations for the three exam> area(kernel.mean,6000,7281,rowt, colt, pprob,
ples considered in Tables 5-7 from Putler, Kalyanam
pcount)
and Hodges (1996). The subscripts i, j, k refer to var[1] 6158.561
ious demographic characteristics as follows:
This gives the Bayes estimate of x22 =
6158.5691/12843 = .4795. We can complete the
estimate of other cells in the same way. The results
886
• Stain-Resistant Carpeting Direct Mail Campain(Table 5)
2003 Joint Statistical Meetings - Section on Survey Research Methods
Figure 2: R-codes for h Function and z0
h.prior<-function(x, rowt, colt, pprob, pcount)
{
#log Density-kernel for Estimating Unknown Proportions
#this finds the prior for a cell count, see Eq (6) Putler et al.
# lam is the variable in the (2,2) cell
# pprob is the matrix of prior probabilities
# pcount is the weight in the Dirichlet prior, see Eq (2) of Putler et al.
# rowt is a vector of row totals
# rowc is a vector of column totals
# definition of the probability kernel
x1. <- rowt[1]
x2. <- rowt[2]
x.1 <- colt[1]
x.2 <- colt[2]
n <- x1. + x2.
a11 <- x1. - x.2 + x + pcount * pprob[1,1]
a12 <- x.2 - x + pcount * pprob[1, 2]
b11 <- x2. - x + pcount * pprob[2, 1]
b12 <- x + pcount * pprob[2, 2]
c11 <- x1. - x.2 + x + 1.
c12 <- x.2 - x + 1
d11 <- x2. - x + 1.
d12 <- x + 1.
tnum <- c(a11, a12, b11, b12)
tden <- c(c11, c12, d11, d12)
sum(lgamma(tnum))-sum(lgamma(tden))
}
• Custom-made Golf Clubs Direct Mail Campaign
Data (Table 7)
– Housing: Rent, i = 1.
Own, i = 2.
– Marital Status of Household Head: Married, j = 1.
Not Maried, j = 2.
– Annual Household Income: < 50, 0000, i =
1.
≥ 50, 000, i = 2.
– Household Member Age: < 18, k = 1.
≥ 18, k = 2.
– Sex: Male, j = 1.
Female, j = 2.
– Years of Formal College Education: < 4,
k = 1.
≥ 4, k = 2.
The following tables (Tables 6-8) give the values of
• Discount Home-Improvement Retail Site Locathe estimates of the cell proportions along with their
tion (Table 6)
standard errors following the method outlined above,
contrasting the values obtained by Putler, Kalyanam
– Housing: Rent, i = 1.
and Hodges (1996) using the Monte Carlo method.
Own, i = 2.
It is striking that our method gives almost the same
– Household Income: ≥ 40, 000, j = 1.
estimate as obtained by the Monte Carlo Method.
< 40, 000, j = 2.
The estimates of the standard errors seem a bit in– Household Member Age: ≥ 45, k = 1.
flated in some cases. At this point, we like to note
< 45, k = 2.
that since, there is only one degree of freedom in a
2 × 2 × 2 table, all the standard errors should be the
887
2003 Joint Statistical Meetings - Section on Survey Research Methods
Table 6: Summary for the Results for Table 5 of Putler et al.(1996)
Cell ID
Actual
Putler
et al.
Our
Proposal
Independent
Prior
111
.2252
.2219
(.0053)
.2293
(.0064)
.1955
(.0070)
112
.0418
.0366
(.0037)
.0368
(.0070)
.0315
(.0066)
121
.0413
.0333
(.0037)
.0342
(.0060)
.0691
(.0066)
122
.0461
.0622
(.0044)
.0546
(.0065)
.0544
(.0063)
211
.1378
.1453
(.0052)
.1357
(.0057)
.1476
(.0066)
212
.0283
.0292
(.0035)
.0308
(.0061)
.0541
(.0063)
221
.2276
.2315
(.0054)
.2324
(.0054)
.2179
(.0064)
222
.2520
.2400
(.0058)
.2465
(.0058)
.2292
(.0060)
Table 7: Summary for the Results for Table 6 of Putler et al.(1996)
Cell ID
Actual
Putler
et al.
Our
Proposal
Independent
Prior
111
.0400
.0109
(.0024)
.0106
(.0058)
.0180
(.0063)
112
.0471
.0240
(.0033)
.0151
(.0062)
.0328
(.0068)
121
.1055
.0870
(.0052)
.0919
(.0060)
.1089
(.0065)
same. In such cases, we can average out the standard
errors in an appropriate way (such as waiting by the
total sample size in each partition of sub-tables). In
the above examples, we can combine the posterior
variances obtained from the two sub-tables and then
average them. However, in more complicated set up
this type of combination may not be possible. We
also investigated using the estimates under independence as priors. The resulting estimates are not far
from the original prior, which seem to be much closer
to the actual proportions. We like to think the independence estimates as non-informative prior which
get adjusted by the data through the posterior mean.
122
.1243
.1951
(.0056)
.1912
(.0063)
.1575
(.0069)
211
.0862
.1116
(.0058)
.1096
(.0064)
.1167
(.0068)
212
.1015
.1285
(.0060)
.1372
(.0067)
.1163
(.0072)
221
.2274
.2495
(.0060)
.2471
(.0064)
.2275
(.0069)
222
.2680
.1934
(.0061)
.1911
(.0067)
.2242
(.0071)
We investigated using the estimates under independence as priors. The resulting estimates are not far
from the actual proportions. We like to think the independence estimates as non-informative prior which
get adjusted by the data through the posterior mean.
5
Summary
This paper has put forward for simplifying Bayesian
calculations for multiway contingency tables for estimating the unknown cell proportions for given
marginals. Knowledge of prior for cell proportions in
the form of a Dirichlet distribution is assumed. However, in the absence of such prior, it is illustrated
4.2 Survey Data
that the probabilities obtained using the indepenHere we present the computations for survey data. dence assumption of the factors may be used as nonThe subscripts i, j, k refer to various demographic informative prior. The resulting estimates are often
quite close to the actual observed cell proportions.
characteristics as follows:
The technique is illustrated on several real data.
• Gender: Male, i = 1.
Female, i = 2.
• Age between 1 to 44 : j = 1.
between 45 and above , j = 2.
• Education: less than Graduate, k = 1.
Graduate and above, k = 2.
888
2003 Joint Statistical Meetings - Section on Survey Research Methods
Table 8: Summary for the Results for Table 7 of Putler et al.(1996)
Cell ID
Actual
Putler
et al.
Our
Proposal
Independent
Prior
111
.3379
.3426
(.0034)
.3418
(.0033)
.3401
(.0030)
112
.3742
.3921
(.0036)
.3919
(.0032)
.3885
(.0029)
121
.0464
.0356
(.0025)
.0367
(.0039)
.0411
(.0037)
122
.0513
.0397
(.0024)
.0393
(.0038)
.0427
(.0035)
211
.0793
.0709
(.0030)
.0703
(.0043)
.0733
(.0040)
212
.0879
, .0734
(.0030)
.0751
(.0041)
.0794
(.0038)
221
.0109
.0259
(.0021)
.0257
(.0064)
.0167
(.0062)
222
.0121
.0198
(.0018)
.0195
(.0059)
.0145
(.0057)
221
.0957
.0790
.1178
(.0288)
222
.0858
.1025
.0963
(.0295)
Table 9: Summary for the Results for Survey Data
Cell ID
Actual
Independent
Our
Proposal
111
.1188
.1205
.1119
(.0277)
112
.1749
.1732
.1219
(.0270)
121
.0924
.0907
.1856
(.0275)
References
[1] Bishop, Y. M. M., Feinberg, S. E. and Holland, P. W. (1975). Discrete Multivariate Analysis: Theory and Practice, Cambridge, MA: MIT
Press.
[2] Putler, D. S., Kalyanam, K. and Hodges, J. S.
(1996), “A Bayesian Approach for Estimating
Target Market Potential with Limited Geodemographic Information,” J ournal of Market Research, 33 134-149.
889
122
.1287
.1304
.1778
(.0269)
211
.1155
.1322
.0904
(.0292)
212
.1882
.1714
.0896
(.0285)