Integrating data from different sources: Improved spatially-disaggregated livestock measures for Uganda Carlo Azzarri International Food Policy Research Institute 2033 K Street, NW, Washington, DC 20007 [email protected] Elizabeth Cross US Bureau of Labor Statistics Postal Square Building, 2 Massachusetts Avenue, NE Washington, DC 20212 [email protected] ABSTRACT Although livestock contributes to household livelihoods in a variety of ways -i.e., by providing cash income, food, manure, traction power, savings and insurance, and collateral for financial servicesin this paper we focus on livestock as a source of income and food. Our focus is on Uganda, where agricultural data including livestock are relatively abundant, and the proportion of rural poor holding livestock is high -around 70%-. The objective of our study is twofold: on one side, to complement the analysis of Benson and Mugarura (2013) by using a suite of different methods to assessing the spatial density of livestock holdings; on the other, to show that combining different data sources -the latest Uganda National Panel Survey (UNPS) 2009/10 and National Livestock Census (NLC) 2008- and applying the Small Area Estimation (SAE) technique of Elbers, Lanjouw, Lanjouw (2003) can improve the spatial disaggregation of missing livestock measures in the Census. First, we combine our livestock number and density mapping results with those from the NLC. Second, we fit an estimation model of livestock income and share on the UNPS, and predict the missing information in the NLC, mapping livestock income and share at the local level. Our results suggest that the integrated use of multiple data sources, such as household surveys and censuses, satellite imagery and administrative data, together with Corresponding author. This work benefited from comments by Alberto Zezza, Gero Carletto, Stanley Wood, Zhe Guo and seminar participants at the 3rd Advisory Committee Meeting of the Livestock Data Innovation Project (FAO, 20-21 February 2012) and at Makerere University in Kampala, whom we would like to thank here. The usual disclaimer applies. 1 spatial analysis techniques such as SAE can provide reliable, coherent, and location-specific insights to guide policy and investment. This work shows a useful way out that allows a reliable spatial livestock analysis whenever sectorial databases offer great coverage of the population of interest, but relatively fewer detailed information than specialized surveys. The method can be applied in all countries where there is a similar livestock information system, and common support between livestock census and households surveys with detailed agricultural/livestock modules. Cross-validation across data sources provides clearer insights into livestock-related farmer behavior and, in so doing, provides a better springboard for effective poverty-reduction policy action. Beyond policy-decision support, the results of the paper demonstrate how integration of different data sets can greatly enhance spatial analysis. Keywords: Uganda, livestock, spatial analysis, data integration 1. Introduction The importance of livestock in the world economy has grown recently due to the “livestock revolution” -an increase in livestock consumption- in developing countries as population increases, becomes richer, moves to urban areas, and changes its dietary preferences (Fischer, 2003). Despite the ongoing livestock revolution, widespread recognition of the importance of the livestock sector in household livelihoods is yet to be achieved. Given the benefits of livestock ownership and the growth of the livestock sector, there are multiple potential benefits from mapping aspects of livestock ownership. Mapping can aid in the targeting of livestock-sector policies and recommendations and can help demonstrate the impacts of policies over time. The ability to map variables of interest offers a way to present large amounts of data to the public and policy makers in a visually stimulating way. However, the ability to generate maps is often limited due to a lack of information. The use of Small Area Estimation (SAE) techniques with the integration of survey and census data from Elbers, Lanjouw, Lanjouw (2003) is a possible mean to combine information from different data sources. Surveys are detailed but lack a sufficient sample size to be representative at lower levels of disaggregation to yield statistically reliable estimates. At the same time, census data have a large enough sample size but lack detailed information on income and consumption. Through the integration of survey and census data, researchers benefit from the detailed information in the survey and the large sample size of the census to analyze variables at a higher spatial disaggregation than would be possible with the survey alone, allowing for higher spatially-disaggregated maps. While SAE has been predominantly used to impute measures of consumption or income into census data using estimates based on survey data, this technique could be especially beneficial in the analysis of livestock numbers and income from livestock activities to enhance knowledge at the local level of livestock owners for policy targeting. 2. Analytical Method The SAE methodology for dataset-to-dataset prediction (Survey-to-Census in our case) comprises three steps or stages. The “stage 0” (according to Mistiaen, Özler, Razafimanantena, Razafindravonona, 2002) involves the selection of comparable information -in terms of how it was collected and the statistical distribution of variables- between the Census and the Survey. At this stage, means, standard deviations, and frequency distribution at the national and regional levels are compared across Survey and Census in order to check whether variables are equivalent. 2 In stage one, the dependent variable of interest is modeled as a function of the independent variables selected, using the equation ′ ln 𝑦𝑐ℎ = 𝐸(ln 𝑦𝑐ℎ |𝑋𝑐ℎ ) + 𝜇𝑐ℎ = 𝑋𝑐ℎ 𝛽 + 𝜇𝑐ℎ , (1) where ych is the outcome variable for household h in cluster c; X is the vector of independent variables in both the Census and the Survey; and µ is the error term. One of the most important aspects of this stage is the specification of the error term. The model above is first estimated by OLS weighted by Survey sampling weights. Residuals from this regression serve as estimates of overall disturbance, 𝜇̂ 𝑐ℎ . A portion of this disturbance is due to location-specific effects common to all households in a given cluster. Since not all clusters in the Census are sampled in the Survey, cluster fixed effects cannot be controlled for in stage one. Location effects must be accounted for in the error term, and as such residuals must be decomposed into location (“within-cluster means of overall residuals”) and household (“overall residuals net of location components”) elements (Mistiaen, Özler, Razafimanantena, Razafindravonona, 2002). Incorporating the decomposition of the error term, the linear approximation of the model becomes ′ ln 𝑦𝑐ℎ = 𝑋𝑐ℎ 𝛽 + 𝜂𝑐 + 𝜀𝑐ℎ (2) Where η is cluster error and is applicable to all households in a cluster and ε is the household idiosyncratic error term. The two components of the error term are assumed to be uncorrelated with each other and independent of the regressors. The location component of the error term will allow for spatial autocorrelation and the possibility of heteroskedasticity of the household-specific error component (Simler and Nhate, 2005). Additionally, the µ error term -the unexplained location-specific component- is minimized, capturing as much variation as possible through the X vector, by incorporating cluster-level means from the Census into the Survey. This is done through estimation of Generalized Least Squares (GLS) model that takes heteroskedasticity of the household-specific error term into account. In the final stage, parameter estimates of stage one and error terms of stage two are applied to the Census data. The disturbance term is accounted for by using bootstrapping re-sampling methods and converting from logarithms to levels, according to ̃ 𝑦̂𝑐ℎ = 𝑒 (𝑥𝑐ℎ ∗𝛽+𝜂̃𝑐+𝜀̃𝑐ℎ ) . (3) In each of the n simulations run (in our case we set n=100), parameter estimates are drawn from the multivariate normal distribution with the variance-covariance matrix and the two disturbance terms are drawn from the distributions described by the same parameters estimated in the first stage. It should be noted that there are two sources of error that arise from the use of this method. First, there is model error due to the parameter estimates; second, there is idiosyncratic error from deviation of the actual y from the expected y (Alderman et al., 2002). Crucial assumptions of the model are presence of high spatial correlation between Enumeration Area and sub-county, and homogeneity of households within EAs. Yet, there could be unexplained effects that impact that error-term at the sub-county level (for instance, livestock prices) and at the more local EA level (for instance, disease) that are unaccounted for in the Xch vector. It is important to consider that the model estimated is assumed to hold for all levels of disaggregation. These two sources for errors could substantially impact the standard errors of the estimates under certain conditions, as proved by Tarozzi and Deaton (2009) through an empirical test using 3 Monte-Carlo simulations. Nevertheless, if one is eager to accept the area homogeneity assumption, hence that ”at least some aspects of the conditional distribution of income be the same in the small area as in the larger area that is used to calibrate the imputation rule”, then the bias in the standard errors calculated by the version of the SAE method (Elbers, Lanjouw, Lanjouw, 2003) used in this paper can be considered negligible, as in most empirical applications. 3. Data Two datasets are used for this analysis. The 2009/2010 Uganda National Panel Survey (UNPS), representative at national level plus the strata of (i) Kampala City, (ii) Other Urban Areas, (iii) Central Rural, (iv) Eastern Rural, (v) Western Rural, and (vi) Northern Rural, collected information on 2,975 households from 322 Enumeration Areas (EA), although the sample is narrowed to 2, 375 households, as 45 households report incomplete information and 555 households had moved, of which 521 are urban. The other dataset used, the 2008 Uganda National Livestock Census (UNLC), collected data from 964,690 rural holdings in all 80 districts of the country in a single visit. The UNLC is not a full enumeration Census but a sample-based one, and is representative at the district level, that is the level our results are presented. However, given that the average sample size at the sub-county level is adequately large (around 1,000 households), results are also reported at this lower geographic administrative level. Nonetheless, the limited amount of information collected in the 2008 UNLC is a constraint on the number of explanatory variables in the estimation model. The predictors used include: land size (separately by agricultural, pasture, and other land); number of livestock heads by type (disaggregated by indigenous and exotic bulls, cows and calves; poultry; small ruminants); average weekly egg and milk production; age and gender of the household head; whether the household hired agricultural labor; area covered by each agro-ecological zone and the NDVI1 at the sub-county level. 4. Results Three models are estimated on the 2009/10 UNPS and fitted. In the first model, the densities of large ruminants at the sub-county level are predicted and then compared to actual values in the census. This model is used to test the reliability of the prediction method used. In the second model, the dependent variable is the log of per capita livestock income (expressed in 2005 international Purchasing Power Parity -PPP- dollars); and, finally, the third dependent variable is the share of total household income from livestock. The latter two models are the core of the analysis, since they estimate dimensions (livestock income) not captured in the census but collected in the survey. One of the main results of the analysis is that, by virtue of survey-to-census prediction, it is possible to draw higher spatially-disaggregated maps than using the survey alone. Figure 1 displays the actual densities (# of livestock/squared kilometer) of large ruminants from the survey and census, as well as the predicted density into the census. Some important elements emerge. First of all, what from the survey appear to be homogeneous regions, once disaggregated to the sub-county level through the census, becomes a more detailed and scattered picture. Second, the density range is wider in the census 1 It is a variable assessing the degree of live green vegetation in the observed area. Negative values of NDVI (approaching -1) correspond to water. Values close to zero (-0.1 to 0.1) generally correspond to barren areas of rock, sand, or snow. Lastly, low, positive values represent shrub and grassland (approximately 0.2 to 0.4), while high values indicate temperate and tropical rainforests (values approaching 1). 4 than in the survey, as in the latter the distribution is composed of four values, one for each region, as averages of sub-county values within each region. Third, and foremost, from a policy perspective the census map is more meaningful for targeting purposes. Survey (actual) Census (actual) Census (actual) Census (predicted) Figure 1. Density of Large Ruminants: Actual from Survey (left), Actual from Census (right), and Predicted from Census (below) at regional and district level The first model also tests the reliability of the methods used in conducting this analysis. Figure 2 witnesses that the actual as well as the predicted densities of large ruminants from the census is very close to the predicted one using the SAE method. This result offers an insight as to how SAE can be a viable and reliable method to estimate spatial distribution of missing information through prediction. While the density of large ruminants in the census resembles closely the distribution from the Survey, the model fitted on the log of per capita livestock income in PPP is less able to predict missing information into the census. Figure 2 shows maps from the survey and the census for the estimated model. 5 Census (district) Census (sub-county) Figure 2. Per Capita Livestock Income (PPP): Actual from Survey and Predicted to Census Finally, the analysis of the predicted income share from livestock at the sub-county level yields surprising results (Figure 3). The predicted spatial distribution looks consistent regardless of the method used (maps not shown here), and this reinforces the argument that it is the lack of timely, reliable, and comprehensive survey and census data the constraining factor in addressing policy at the local level more than advancement in spatial methodology. 6 Survey Census Figure 3. Share of Income from Livestock: Actual from Survey (left) and Predicted to Census (right) 5. Conclusions Our results suggest that the integrated use of multiple data sources, such as household surveys and censuses, satellite imagery and administrative data, together with spatial analysis techniques such as SAE and spatial allocation models, can provide reliable, coherent, and location-specific insights to guide policy and investment. Cross-validation across primary and secondary data sources provides clearer insights into livestock-related farmer behavior and, in so doing, provides a better springboard for effective poverty-reduction policy action. By fitting accurate prediction models, there is the concrete possibility of combining multi-topic household surveys with specialized databases to estimate contribution of livestock to household livelihoods. Among the various econometric models tested, Small Area Estimation technique has been successfully used for targeting poverty programs in many countries worldwide, and the present work has shown that it could represent a potentially useful tool for informing livestock policy. Indeed, our 7 work demonstrates that the integration between different data sources allows for finer spatial resolution, hence regional distributions looking homogeneous using the survey masks very diverse sub-county distributions using the census. Our results are internally and externally consistent with the literature, strengthening reliability. The novelty of our approach is that it relies on micro-data and census, particularly important for policy targeting as it would greatly enhance the local relevance of policy interventions; in fact, there is the need to complement survey data with census information to provide more spatially-specific findings. In terms of external relevance and viability, our approach can be easily scaled-out to other countries with similar statistical data systems (e.g., those included in the LSMS-ISA). Other possible refinements include using the enhanced livestock module developed in the UNPS 2011/12, and combining environmental and satellite data with household-level characteristics beyond agro-ecological zone and NDVI. Finally, a suggestion to national statistical agencies is to collect the Livestock Census regularly and with an adequate suite of information so as to allow a more effective integration of different databases. Expanding the topics covered in the census goes in this direction and would greatly refine the results. REFERENCES Alderman, H., Babita, M., Demombynes, G., Makhatha, N., & Özler, B. (2002). How Low Can You Go? Combining Census and Survey Data for Mapping Poverty in South Africa, Journal of African Economies, 11(2): 169-200. Benson T. and Mugarura S. (2013). Livestock Development Planning in Uganda: Identification of Areas of Opportunities and Challenge, Land Use Policy, 35: 131-139 Elbers, C., Lanjouw J.O., and Lanjouw, P. (2003). Micro-Level Estimation of Poverty and Inequality, Econometrica, 71:1, pp. 355-364. Fischer. (2003). The Livestock Revolution: A Pathway from Poverty?, in The Livestock Revolution: A Pathway from Poverty, ed. A.G. Brown. Hentschel, J., Lanjouw J.O., Lanjouw, P., Poggi, J. (2000). Combining census and survey data to trace the spatial dimensions of poverty, The World Bank Economic Review, 14(1): 147-165. Mistiaen, J.A., Özler, B., Razafimanantena, T., & Razafindravonoma, J. (2002). Putting Welfare on the Map in Madagascar, Africa Region Working Paper Series No. 34, The World Bank, Washington, D.C. Simler, K. and Nhate, V. (2005). Poverty, Inequality, and Geographic Targeting: Evidence from SmallArea Estimates in Mozambique, FCND Discussion Paper 192, IFPRI, Washington, D.C. Tarozzi A. and Deaton A. (2009). Using Census and Survey data to estimate poverty and inequality for small areas, The Review of Economics and Statistics, 91(4): 773–792 8
© Copyright 2026 Paperzz