Regression method (basic level) Joze Sambt NTA Hands-On Workshop Berkeley, CA January 14, 2009 Why do we need a regression method? Education and health care expenditures are usually reported at the household level, but in NTA context everything has to be assigned to individuals Example of the linear relation between two variables 2500 Consumption 2000 1500 y = 0.4998x + 678.93 R2 = 0.8269 1000 500 0 0 1000 2000 Income 3000 The idea of the regression analysis yi 0 1 x1i 2 x2i ... k xki i For example: CONSUMPTION ' 678.93 0.4998 * INCOMEi ◦ The slope coefficient is about 0.50, suggesting that an increase in real income of 1 dollar is leading, on average, to an increase of about 50 cents in real consumption expenditure. ◦ Constant: about 679 dollars is the level of autonomous consumption (in the case that person receives no income, i.e. if the value of the independent variable is 0). Allocating health expenditure Private health expenditure of household j is regressed on the number of household members in each age group x CFH j ( x) M j ( x) x 0 To use broader age groups could be a good idea (because of degrees of freedom, small number of observations in some age groups). Don’t worry, your age profile will most likely not look like stairs because of that. CFH j 1 ( M j ,04 ) 2 ( M j ,59 ) ... 19 ( M j ,90 ) STATA code Grouping into 5-year age groups: gen agegrp=age recode agegrp (0/4=2.5) (5/9=7.5) (10/14=12.5) (15/19=17.5) … (90/max=90) Calculating the number of individuals in each age group (by households): by by by by by … by hhid: hhid: hhid: hhid: hhid: egen egen egen egen egen p4=sum(agegrp==2.5) p9=sum(agegrp==7.5) p14=sum(agegrp==12.5) p19=sum(agegrp==17.5) p24=sum(agegrp==22.5) hhid: egen p90=sum(agegrp==90) Household health expenditures are regressed on the number of individuals in each age group within a household (without an intercept), reg cfhc p4 p9 p14 p19 p24 p29 p34 p39 p44 p49 p54 p59 p64 p69 p74 p79 p84 p89 p90 [w=weight], robust noconstant intercept supressed! … and coefficients are stored for a future use gen gen gen … gen bp4=_b[p4] bp9=_b[p9] bp14=_b[p14] bp90=_b[p90] However, ◦ summing up obtained values for all members of the household results in different amount of health expenditures than reported in the survey (at the household level). ◦ Therefore: we need further adjustment whereby we use only relative size of those coefficients between household members, i.e. we consider them as within household shares (weights). For example: Assume a household with three individuals: ◦ child, aged 6 years ◦ mother, aged 33 years ◦ father, aged 36 years Let’s further assume that the obtained (from the regression) coefficient for the age group 5-9 years is 20, for the age group of 30-34 years it is 80, and for the age group of 35-39 years is 100. This would sum up to 200. However, in the survey household has reported 300 dollars for health expenditures. This means, we have to rescale those values, so they will match 300: Regression coefficients Share Rescaled values Child 20 10% 30 10% Mother 80 40% 120 40% Father 100 50% 150 50% SUM 200 300 Share (remains the same) STATA code Coefficients are assigned corresponding age groups (coefficients by age groups are multiplied with the number of household members by age groups): gen hp4=bp4*(agegrp==2.5) gen hp9=bp9*(agegrp==7.5) gen hp14=bp14*(agegrp==12.5) … gen hp90=bp90*(agegrp==90) egen sum=rsum(hp4 hp9 hp14 hp19 hp24 hp29 hp34 hp39 hp44 hp49 hp54 hp59 hp64 hp69 hp74 hp79 hp84 hp89 hp90) Sum of weights by households are calculated (i.e. household estimated expenditures on health) : by hhid:egen total=sum(sum) Relative shares of weights (of individuals) in total sum of household weights are used to distribute reported health expenditures (CFHj) among household members CFH ij CFH j ( x) ( x) M j ( x) … rewritten as STATA code: Relative share for each household member is calculated (by dividing individuals’ coefficients with a total sum of coefficients of all household members): gen rhp4=hp4/total replace rhp4=0 if rhp4==. gen rhp9=hp9/total replace rhp9=0 if rhp9==. … gen rhp90=hp90/total replace rhp90=0 if rhp90==. Finally, relative shares of household members of each household are multiplied with reported health expenditures of that household to obtain health expenditures by individuals: gen th4=cfhc*rhp4 gen th9=cfhc*rhp9 … gen th90=cfhc*rhp90 It has already been said during the workshop: if you have information… about the subcategories of the expenditures (for example information on household members using inpatient care (IN) or out-patient (OUT) services), use that information: x 0 x 0 CFH j ( x) IN j ( x) ( x)OUT j ( x) about household members being enrolled (E) and nonenrolled (NE) into the educational process, use that information: x 0 x 0 CFE j ( x) E j ( x) ( x) NE j ( x) … or if you… have external profile of per capita utilization by age (U) and number of household members by age (M), use that information: CFH j ( x)U ( x) M j ( x) x 0 have detailed data available (for example separately reported expenditures on primary, secondary and tertiary education level), use them: limit the analysis only to those age groups, for which the expenditures are relevant. This is especially relevant for education expenditures. Some final details There is no constant term (i.e. estimated in homogeneous form), so the household consumption is fully allocated. Constrain negative values of the coefficients to zero. In the case of education – do not apply smoothing. You might want to have two age groups (age 0 and age 1-4) instead of the 0-4 age group, to capture higher health expenditures in the first year of age, reported in countries where such detailed data is available.
© Copyright 2026 Paperzz