2003 Joint Statistical Meetings - Section on Survey Research Methods ESTIMATION OF JOINT DISTRIBUTION FROM MARGINAL DISTRIBUTIONS Yogendra P. Chaubey, Fassil Nebebe, Concordia University Krzystof Dzicielowski, Bell Canada Debaraj Sen,Concordia University Yogendra P. Chaubey, Department of Mathematics and Statistics, Concordia University, Montréal, Québec, H4B 1R6 Canada ([email protected]) ∗ November 14, 2003 Key Words: Bayesian Prediction, Contingency Ta- methods, because they require multidimensional nuble, Dirichlet Prior. merical integration. However, the Bayesian method is preferred as it readily presents an estimate of the standard error derived from the posterior variance. The proposal in this paper is to reduce the continAbstract gency table into several 2×2 contingency tables, estiIn many consumer surveys, demographic and other mate the target cell frequency along with its standard data are used as covariates for predictions of con- error and combine the results obtained from various sumer preferences and other decision variables. How- tables in an appropriate manner. We highlight here ever, to protect the confidentiality of the consumers, that computations in the case of 2 × 2 table can be the data are only available in marginal frequency dis- easily performed. For example, suppose, we have a tribution format. This creates a problem in predict- 2 × 2 × 2 contingency table, with unknown cell proing joint frequencies required for further decision- portions written as xijk ; i = 1, 2, j = 1, 2, k = 1, 2. making. In this paper we consider the dependence For, estimating x111 , we can consider the following 3, structure in formulating the joint distribution given 2 × 2 contingency tables: in the form of a Dirichlet prior. The computations in Table 1: Three 2 × 2 Contingency Tables for the case of 2 × 2 table are extended to multiway conEstimating x111 from a 2 × 2 × 2 Table tingency tables and shown to provide similar results as obtained using the Monte Carlo methods. k=1 1 Introduction In this paper we consider estimating the joint distribution of several demographic variable when only information about their marginal distributions is available. Putler, Kalyanam and Hodges (1996) considered a Bayesian approach using Dirichlet prior for the cell proportions. They also compared iterative proprtional fitting (IPF) method (see Bishop, Feinberg, and Holland 1975).They presented the case of 2 × 2 table in detail. However, the higher contingency tables have to be analysed using Monte Carlo ∗ This research is partially supported by a research grant from Bell University Labs to Yogendra P. Chaubey and Fassil Nebebe. 883 i=1 i=2 1 2 j=1 j=2 x111 x211 x121 x221 i=1 k=1 k=2 j=1 j=2 1 2 x111 x121 x112 x122 j=1 k=1 k=2 i=1 i=2 1 2 x111 x211 x112 x212 2003 Joint Statistical Meetings - Section on Survey Research Methods (l) Let, x̂111 and se2(l) (x̂111 ) denote the posterior where αij refer to the prior information about pij , mean and posterior variance based on the lth table often available from some bench-mark data or a larger comprising of rl observation, the final estimates of survey, so that x111 and its proposed estimator of error is given by; X αij ∈ [0, 1], αij = 1, P3 (l) ij r x̂ l=1 l 111 (1) x̂111 = P 3 and m refers to the weight assigned to the prior inl=1 rl P3 formation. The prior implicitly assumes that in m 2(l) (x̂111 ) l=1 rl se se2 (x111 ) = (2) observations of prior data, the total in the (i, j)−cell P3 l=1 rl is mαij . With this set-up, because of the constraints on In section 2, we present the Bayesian analysis of a the cell-proprtions, only one of the cell-totals is in2 × 2 contingency table as given in Putler, Kalyanam dependent. We choose arbitrarily the (2, 2) − cell, and Hodges (1996). Section 3 outlines the comand define the random variable z = nx22 . The posputational details along with an example. Section terior inference about z may be based on the density 4 presents computational summary of the examples g(z|data), given by considered in Putler, Kalyanam and Hodges (1996) and some additional examples from a marketing consumer survey data in Montreal. 2 The Bayesian Approach in 2× 2 Case g(z) ∝ Γ(n1. − n.2 + z + mα11 )Γ(n.2 − z + mα12 ) Γ(n2. − z + mα21 )Γ(z + mα22 )/ {Γ(n1. − n.2 + z + 1)Γ(n.2 − z + 1) Γ(n2. − z + 1)Γ(z + 1)} , (3) Consider as if we have the observed data in the form where ni. = nxi. , n.j = nx.j ; i, j = 1, 2. Putler, Kalyanam and Hodges (1996) suggest that the posteof a 2 × 2 contingency table: rior mean can be computed in a two step procedure. Table 2: A Simple 2 × 2 Contingency Table The first step consists of computing the constant k of proportionality, through a trapezoidal rule over a Column Factor Levels large number of intervals. The second step consists 1 2 Total of evaluating the integral Row Z zmax Factor 1 nx11 nx12 nx1. ẑ = E(Z|data) = zg(z)dz, 2 nx21 nx22 nx2. Levels zmin Total nx.1 nx.2 nx.. = n again using the trapezoidal rule, where where X zmin = max(0, n.2 − n1. ), zmax = min(n.2 , n2. ). xij ∈ [0, 1], xij = 1, ij An estimate of the standard error of the estimate and n is the sample size. The set-up we are con- is obtained from the posterior variance given by cerned with is that xij are unknown, however the Z zmax row-marginals and column-marginals are known. Of2 2 se (Z) = E[(Z − ẑ) |data] = (z − ẑ)2 g(z)dz fcourse, under some given assumptions such as indezmin pendence assumption, the known marginals can be is computed using similar approach used to estimate the joint probabilities pij = Pr[X1 = i, X2 = j], 3 Computational Aspects where X1 refers to the row attribute and X2 refers to the column attribute. Putler, Kalyanam and For the general case, Monte Carlo (MC) methods are Hodges (1996) consider a joint Dirichlet prior on proposed for estimation of proportions pi1 i2 ... , such as Importance Sampling and Gibbs Sampling. It is p = (p11 , p12 , p21 , p22 ) given by clear that MC methods are very computationally exY mα −1 tensive in this context and therefore, we investigate π(p) ∝ pij ij , the possibility of direct computation as suggested in ij 884 2003 Joint Statistical Meetings - Section on Survey Research Methods the previous section. In practice, though the compu- The numbers in the parentheses are counts. The tation of gamma functions arising in Eq. 1 presents prior-probabilities are given in the following table and problems. In most of the situations, we may alleviate the value of the weight m is given by m = 12843 : this problem by using the following technique. Let us denote z0 , the approximate value of z where g(z) Table 4: Prior Proportions in a Simple 2 × 2 takes its maxima (it is shown in Putler et al. (1996) Contingency Table that g(z) is unimodal). We can thus write Household Status R zmax Not Married Total z exp(h(z) − h(z0 ))dz zmin Married , E(Z|data) = R zmax exp(h(z) − h(z0 ))dz Row zmin Type of Rental .3072 .0955 .3544 where, h(z) is defined, so that .1793 .4180 .6456 Housing Owned Total .4331 .5669 g(z) = k exp(h(z)). The value of z0 can be found by plotting h(z) against z, which may be computable, where as exp(h(z)) may not be. We first demonstrate this on the data used in Putler, Kalyanam and Hodges (1996), and then we use it on a survey data conducted by a market research firm. For the computation of the integral we use the R-codes for area function given in Figure 1. This function works pretty well, except for the cases where the function f may take extremely small values for a wide range of arguments. For example, if f (x) < eps for x in an interval (c, b), c < (a + b)/2, and f (a) = f (b) = 0, the algorithm will produce a small but wrong value of the integral. To avoid such situations, we evaluate the integral over two intervals, (a, z0 ) and (z0 , b), where z0 is the approximate value of the argument where the function peaks. The approximate peak is obtained by taking the maximum of h(z) evaluated over a grid of z−values. The R-Codes given in Figure 2 is used to compute the function h and approximate z0 . The value of zmin and zmax are zmin = 7281 − 4552 = 2729, zmax = 7281. The value of z0 may be obtained by visual inspection of the graph of z vs. h(z). This is done by using the codes given below. rowt<-c(4552, 8291) colt<-c(5562, 7281) pprob<-matrix(c(0.2670, 0.0874, 0.1661, 0.4795), nr=2,byrow=T) pcount<- 12843 Example 1. Consider the data (see Table 5 of zseq<-seq(2729,7281,length=20) Putler et al. (1996) on Stain-Resistent Carpeting Direct Mail Campaign. We consider a collapsed form of hzseq<-sapply(zseq,h.prior,rowt=rowt, the contingency table into two factors, Type of Housing and Marital Status of the Household Head. The colt=colt, pprob=pprob, pcount=pcount) actual proportions are given below: plot(zseq,hzseq) Table 3: Proportions in a Simple 2 × 2 Contingency Table Household Status Not Married Married Row Type of Rental .2670 .0874 Housing Owned .1661 .4795 Total .4331 (5562) .5669 (7281) Total .3544 (4552) .6456 (8291) n= 12843 885 This gives the value of z0 approximately 6000. So we compute h(z0 ) as >h.prior(6000,rowt,colt,pprob,pcount) [1] 110799.4 so, h(z0 ) = 110799.4 and therefore we use the following codes for the integrands in the denominator 2003 Joint Statistical Meetings - Section on Survey Research Methods Figure 1: R-codes for area Function area<-function(f,a,b,...,fa=f(a,...),fb=f(b,...),limit=50,eps=1.0e-06) { #Program to integrate a function f using recursive simpson’s rule #eps is the absolute target error #limit is max number of iterations h<-b-a d<-(b+a)/2 fd<-f(d,...) a1<-((fa+fb)*h)/2 a2<-((fa+4*fd+fb)*h)/6 if ( abs(a1-a2) < eps ) return(a2) if (limit ==0){ warning(paste("recursion limit reached near x= ",d)) return(a2) } Recall(f,a,d,...,fa=fa,fb=fd,limit=limit-1,eps=eps)+ Recall(f,d,b,...,fa=fd,fb=fb,limit=limit-1,eps=eps) } obtained are given below in table 5. g.prior<-function(x, rowt, colt, pprob, pcount){exp(h.prior(x,rowt, colt, pprob, pcount) -110799) } Table 5: Bayes Proportions in a Simple 2 × 2 Contingency Table Using the area over subintervals (2729, 6000) and (6000, 7281) as given below Household Status Not Married Married > area(g.prior,2729,6000,rowt,colt, pprob, pcount) [1] 10.02941 > area(g.prior,6000,7281,rowt,colt, pprob, pcount) [1] 7382576 Row Type of Housing Rental Owned Total .2670 .1661 .4331 .0874 .4795 .5669 Total .3544 .6456 We also computed the posterior mean for the denominator = 7382586.02941. Hence the ex- n = 15023 and m = 12843, and results were almost pression for the posterior mean can be obtained the same. through the following codes: > kernel.mean<-function(x, rowt, colt, pprob, pcount)x * g.prior(x, rowt, colt, pprob, pcount)/7382586.03 4 Summary of Numerical Illustrations > area(kernel.mean,2729,6000,rowt, colt, 4.1 Data from Putler et al. (1996) pprob, pcount) [1] 0.008141925 Here we present the computations for the three exam> area(kernel.mean,6000,7281,rowt, colt, pprob, ples considered in Tables 5-7 from Putler, Kalyanam pcount) and Hodges (1996). The subscripts i, j, k refer to var[1] 6158.561 ious demographic characteristics as follows: This gives the Bayes estimate of x22 = 6158.5691/12843 = .4795. We can complete the estimate of other cells in the same way. The results 886 • Stain-Resistant Carpeting Direct Mail Campain(Table 5) 2003 Joint Statistical Meetings - Section on Survey Research Methods Figure 2: R-codes for h Function and z0 h.prior<-function(x, rowt, colt, pprob, pcount) { #log Density-kernel for Estimating Unknown Proportions #this finds the prior for a cell count, see Eq (6) Putler et al. # lam is the variable in the (2,2) cell # pprob is the matrix of prior probabilities # pcount is the weight in the Dirichlet prior, see Eq (2) of Putler et al. # rowt is a vector of row totals # rowc is a vector of column totals # definition of the probability kernel x1. <- rowt[1] x2. <- rowt[2] x.1 <- colt[1] x.2 <- colt[2] n <- x1. + x2. a11 <- x1. - x.2 + x + pcount * pprob[1,1] a12 <- x.2 - x + pcount * pprob[1, 2] b11 <- x2. - x + pcount * pprob[2, 1] b12 <- x + pcount * pprob[2, 2] c11 <- x1. - x.2 + x + 1. c12 <- x.2 - x + 1 d11 <- x2. - x + 1. d12 <- x + 1. tnum <- c(a11, a12, b11, b12) tden <- c(c11, c12, d11, d12) sum(lgamma(tnum))-sum(lgamma(tden)) } • Custom-made Golf Clubs Direct Mail Campaign Data (Table 7) – Housing: Rent, i = 1. Own, i = 2. – Marital Status of Household Head: Married, j = 1. Not Maried, j = 2. – Annual Household Income: < 50, 0000, i = 1. ≥ 50, 000, i = 2. – Household Member Age: < 18, k = 1. ≥ 18, k = 2. – Sex: Male, j = 1. Female, j = 2. – Years of Formal College Education: < 4, k = 1. ≥ 4, k = 2. The following tables (Tables 6-8) give the values of • Discount Home-Improvement Retail Site Locathe estimates of the cell proportions along with their tion (Table 6) standard errors following the method outlined above, contrasting the values obtained by Putler, Kalyanam – Housing: Rent, i = 1. and Hodges (1996) using the Monte Carlo method. Own, i = 2. It is striking that our method gives almost the same – Household Income: ≥ 40, 000, j = 1. estimate as obtained by the Monte Carlo Method. < 40, 000, j = 2. The estimates of the standard errors seem a bit in– Household Member Age: ≥ 45, k = 1. flated in some cases. At this point, we like to note < 45, k = 2. that since, there is only one degree of freedom in a 2 × 2 × 2 table, all the standard errors should be the 887 2003 Joint Statistical Meetings - Section on Survey Research Methods Table 6: Summary for the Results for Table 5 of Putler et al.(1996) Cell ID Actual Putler et al. Our Proposal Independent Prior 111 .2252 .2219 (.0053) .2293 (.0064) .1955 (.0070) 112 .0418 .0366 (.0037) .0368 (.0070) .0315 (.0066) 121 .0413 .0333 (.0037) .0342 (.0060) .0691 (.0066) 122 .0461 .0622 (.0044) .0546 (.0065) .0544 (.0063) 211 .1378 .1453 (.0052) .1357 (.0057) .1476 (.0066) 212 .0283 .0292 (.0035) .0308 (.0061) .0541 (.0063) 221 .2276 .2315 (.0054) .2324 (.0054) .2179 (.0064) 222 .2520 .2400 (.0058) .2465 (.0058) .2292 (.0060) Table 7: Summary for the Results for Table 6 of Putler et al.(1996) Cell ID Actual Putler et al. Our Proposal Independent Prior 111 .0400 .0109 (.0024) .0106 (.0058) .0180 (.0063) 112 .0471 .0240 (.0033) .0151 (.0062) .0328 (.0068) 121 .1055 .0870 (.0052) .0919 (.0060) .1089 (.0065) same. In such cases, we can average out the standard errors in an appropriate way (such as waiting by the total sample size in each partition of sub-tables). In the above examples, we can combine the posterior variances obtained from the two sub-tables and then average them. However, in more complicated set up this type of combination may not be possible. We also investigated using the estimates under independence as priors. The resulting estimates are not far from the original prior, which seem to be much closer to the actual proportions. We like to think the independence estimates as non-informative prior which get adjusted by the data through the posterior mean. 122 .1243 .1951 (.0056) .1912 (.0063) .1575 (.0069) 211 .0862 .1116 (.0058) .1096 (.0064) .1167 (.0068) 212 .1015 .1285 (.0060) .1372 (.0067) .1163 (.0072) 221 .2274 .2495 (.0060) .2471 (.0064) .2275 (.0069) 222 .2680 .1934 (.0061) .1911 (.0067) .2242 (.0071) We investigated using the estimates under independence as priors. The resulting estimates are not far from the actual proportions. We like to think the independence estimates as non-informative prior which get adjusted by the data through the posterior mean. 5 Summary This paper has put forward for simplifying Bayesian calculations for multiway contingency tables for estimating the unknown cell proportions for given marginals. Knowledge of prior for cell proportions in the form of a Dirichlet distribution is assumed. However, in the absence of such prior, it is illustrated 4.2 Survey Data that the probabilities obtained using the indepenHere we present the computations for survey data. dence assumption of the factors may be used as nonThe subscripts i, j, k refer to various demographic informative prior. The resulting estimates are often quite close to the actual observed cell proportions. characteristics as follows: The technique is illustrated on several real data. • Gender: Male, i = 1. Female, i = 2. • Age between 1 to 44 : j = 1. between 45 and above , j = 2. • Education: less than Graduate, k = 1. Graduate and above, k = 2. 888 2003 Joint Statistical Meetings - Section on Survey Research Methods Table 8: Summary for the Results for Table 7 of Putler et al.(1996) Cell ID Actual Putler et al. Our Proposal Independent Prior 111 .3379 .3426 (.0034) .3418 (.0033) .3401 (.0030) 112 .3742 .3921 (.0036) .3919 (.0032) .3885 (.0029) 121 .0464 .0356 (.0025) .0367 (.0039) .0411 (.0037) 122 .0513 .0397 (.0024) .0393 (.0038) .0427 (.0035) 211 .0793 .0709 (.0030) .0703 (.0043) .0733 (.0040) 212 .0879 , .0734 (.0030) .0751 (.0041) .0794 (.0038) 221 .0109 .0259 (.0021) .0257 (.0064) .0167 (.0062) 222 .0121 .0198 (.0018) .0195 (.0059) .0145 (.0057) 221 .0957 .0790 .1178 (.0288) 222 .0858 .1025 .0963 (.0295) Table 9: Summary for the Results for Survey Data Cell ID Actual Independent Our Proposal 111 .1188 .1205 .1119 (.0277) 112 .1749 .1732 .1219 (.0270) 121 .0924 .0907 .1856 (.0275) References [1] Bishop, Y. M. M., Feinberg, S. E. and Holland, P. W. (1975). Discrete Multivariate Analysis: Theory and Practice, Cambridge, MA: MIT Press. [2] Putler, D. S., Kalyanam, K. and Hodges, J. S. (1996), “A Bayesian Approach for Estimating Target Market Potential with Limited Geodemographic Information,” J ournal of Market Research, 33 134-149. 889 122 .1287 .1304 .1778 (.0269) 211 .1155 .1322 .0904 (.0292) 212 .1882 .1714 .0896 (.0285)
© Copyright 2026 Paperzz