CURRENT ISSUES Denominator Issues for Personally Generated Data in Population Health Monitoring Rumi Chunara, PhD,1,2 Lauren E. Wisk, PhD,3,4 Elissa R. Weitzman, ScD, MSc3,4,5 INTRODUCTION W idespread use of Internet and mobile technologies provides opportunities to gather health-related information to complement data generated through traditional healthcare and public health systems. These personally generated data (PGD) are increasingly viewed as informative of the patient experience of conditions, symptoms, treatments, and side effects.1 Behavior, sentiment, and disease patterns can be discerned from mining unstructured PGD in text, image, or metadata form, and from analyzing PGD collected via structured, opt-in, and web-enabled platforms and devices, including wearables.2,3 Models that employ PGD from distributed cohorts are being used increasingly to measure public health outcomes; moreover, PGD collection forms the centerpiece of important new federal investments into personalized medicine that seek to energize vast cohorts in donating data via apps and devices.4 PGD offer the opportunity to inform gap areas of health research through high-resolution views into spatial, temporal, or demographic features. However, when PGD are used to answer epidemiologic questions, it is not always clear what constitutes the population at risk (PAR), or the denominator, challenging researchers’ abilities to make inferences, draw comparisons, and evaluate change. Because of this, initial PGD studies have tended toward numerator-only investigations5–7; however, the field is advancing. This report summarizes issues related to specifying PAR and denominator metrics when using PGD for health research and outlines approaches for resolving these issues using design and analytic strategies. CHALLENGES IN SPECIFYING THE POPULATION AT RISK Individuals considered capable of acquiring a disease are considered the PAR (Figure 1). Accurate knowledge of the PAR is a fundamental requirement in preventive medicine and public health, and is necessary for measuring effects and gauging impact. Challenges in assessing PAR have been deliberated even for traditional healthcare-based data sources8; when PAR is unobservable, different proxy metrics are often used.9 The exact nature of what might constitute a useful and valid choice for defining PAR depends on the research focus and is a source of difficulty.8,9 All analyses drawing on chart reviews, electronic medical records, and telephone surveys must address the extent to which the sample represents the source population (i.e., generalizability). Even when samples are drawn directly from the PAR, the dynamic nature of human populations and subtleties of selection, differential non-response, and attrition may introduce variability and impact validity of inferences. Anticipation and mitigation of these issues are vital for investigations drawing on traditional healthcare-based data and PGD. As with RCTs, where findings may not generalize with the imposition of strict eligibility criteria and participant self-selection into research,10 PGD may not fully represent patient or community populations. Moreover, once individuals contribute data, bias may be introduced through imperfect methods of observation and engagement. Researchers wishing to use PGD will need to acknowledge and, when possible, measure how characteristics of a sample differ from the PAR (i.e., issues relevant to external validity). Three challenges to using PGD stand out and are elaborated here—sample representativeness and selection bias, poorly characterized reference populations, and spatiotemporally inconsistent denominators (Table 1 and Appendix Table 1, available online). From the 1Department of Computer Science and Engineering, New York University Tandon School of Engineering, Brooklyn, New York; 2College of Global Public Health, New York University, New York, New York; 3 Division of Adolescent/Young Adult Medicine, Boston Children’s Hospital, Boston, Massachusetts; 4Department of Pediatrics, Harvard Medical School, Harvard University, Boston, Massachusetts; and 5Computational Health Informatics Program, Boston Children’s Hospital, Boston, Massachusetts Address correspondence to: Rumi Chunara, PhD, Department of Computer Science and Engineering, New York University Tandon School of Engineering, 2 Metrotech Center, 10th Floor, 10.007, Brooklyn NY 11201. E-mail: [email protected]. 0749-3797/$36.00 http://dx.doi.org/10.1016/j.amepre.2016.10.038 & 2016 American Journal of Preventive Medicine. Published by Elsevier Inc. This is an Am J Prev Med 2017;52(4):549–553 open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). 549 550 Chunara et al / Am J Prev Med 2017;52(4):549–553 Figure 1. An overview of challenges in assessing the population-at-risk working with both healthcare-based data and personally generated data. Sample Representativeness and Selection Bias Selection biases arise when the sample does not accurately represent an underlying population. They are among the most common validity threats when making inferences (Appendix Table 1, available online). In healthcare-based studies, different types of settings, recruitment procedures, and inclusion and exclusion criteria must be carefully considered to ensure valid estimates and extrapolation of findings. For example, access bias was found to affect the Behavioral Risk Factor Surveillance System, a U.S. survey that collects data via random-digit-dial telephone interviews. However, using only data from landline-based sampling excluded important segments of the U.S. population, and resulted in measurable biases for many key health indicators.19 Consequently, the Behavioral Risk Factor Surveillance System shifted their sampling approach to additionally incorporate cell phones, yielding more valid, reliable, and representative measures. Many types of selection bias also apply to PGD (Appendix Table 1, available online). Access bias, for example, is a concern when disparities in Internet access affect representativeness of PGD. Efforts to close the Internet access gap have succeeded, with only 15% of the U.S. population offline in the past year, yet variability and inequalities persist. For instance, among teens, Internet platform preferences vary by household income.20 Among adults, health literacy is highly correlated with health information seeking on the Internet and self-rated health.21 Hence, individuals who passively source or actively contribute data via online platforms may vary by income and be more literate and healthy than the general population. Access differences can also reflect group preferences, norms, technology diffusion, and sharing patterns. For example, in the U.S., 40% of African Americans aged 18–29 years who use the Internet say that they use Twitter compared with 28% among their white counterparts.22 Across multiple influenza-focused participatory surveillance systems, women have higher participation levels than men, with peak levels found among those aged 30–60 years.23,24 As well, in participatory diabetes surveillance within an online diabetes community, early adopters of an app that enabled PGD sharing for population health monitoring reported greater diabetes control than did later adopters, and preference toward openness in data sharing was greater among individuals with better diabetes control.18 Because technology diffusion, preferences, and norms vary by age, income, and gender, observations gleaned passively from processing of Internet search or microblogging data (as done for sentiment analysis or outbreak detection) may be affected by differential selection stemming from the composition of groups using one versus another platform or tool.20 Bias may also be present in active surveillance, which relies on the planned donation of information by persons using Internet or mobile tools. For example, in participatory influenza reporting systems, individuals tend to sign up more when their symptoms are worse, potentially skewing incidence estimates.25 Technology standards, protections on data use, or controls over secondary data use can also affect representativeness. To address these issues, investigators might preselect a subsample of users that represents the PAR using probability sampling, conduct sensitivity analyses to ascertain how sample composition affects observations, or integrate data gleaned from multiple platforms (Table 1). www.ajpmonline.org Chunara et al / Am J Prev Med 2017;52(4):549–553 551 Table 1. Approaches to Resolving Denominator Challenges in PGD PGD challenge type Sample representativeness and selection bias Poorly characterized reference populations Illustrative case(s) Use of social media data for health monitoring where data are collected via keyword search Use of anonymized data as indicator of health phenomenon that do not include demographic or sampling details, such as internet search data from Google Trends Corrective approaches (via design or analytic strategy) Select out a subsample that represents the PAR using probability sampling or other matching methods Conduct sensitivity analyses varying denominator Control for potential confounders at appropriate scale Focus on concurrent validation Spatiotemporally inconsistent denominator Data donation of steps or symptoms by members of an optin distributed, online health community against criterion standards, with the understanding that criteria are rarely available to compare against Utilize multiple datasets (including a criterion standard, when available) to establish concurrent validity Use Bayesian evidence synthesis to improve model performance Supervised learning and some ground truth data to perform inference of latent characteristics Engage a cohort for crosssectional or prospective evaluation where spatial and temporal frames can be identified Partner with PGD sources/ platforms to elucidate denominator features Imputation for missing or outlier data Practical examples Propensity score matching used to evaluate effect of information on Twitter users11 Assess effect of participatory symptom contribution frequency on population burden estimates12 SES controlled for at ZIP code when assessing relationship between online check-ins and obesity levels13 Assess flu incidence from internet sources versus CDC data14 Combine multiple flu proxy data to improve incidence estimates15 Estimate burden of influenza from hospital-reported case data16 Infer demographics of Twitter users17 Estimate influenza burden via examining contribution of participatory reports12 Obtain geographic information on members of an online diabetes social network and compare to geographic patterns of app uptake to test for bias18 Remove spurious spikes in influenza data15 CDC, Centers for Disease Control and Prevention; PAR, population at risk; PGD, personally generated data. Poorly Characterized Reference Populations Even if recruitment, access, and other factors are accounted for, there still can be multiple PAR options. Complexity when choosing a reference population using healthcare-based data has been discussed when assessing cohort patterns for cerebral palsy (CP), for example.26 For a congenital disorder such as CP, in which newborns and infants might logically be considered to constitute the PAR, the birth cohort from which a case arises is generally used as the denominator. However, poorly defined or tracked birth cohorts may necessitate the use of other denominators, such as all children aged o2 years residing in a particular area. Yet, if families of a child with CP move and cluster around health centers that are best suited for caring for children with CP, prevalence estimates might skew higher; alternatively, April 2017 such health centers might be more adept at screening and diagnosis, leading to a diagnostic access bias. Ultimately, statistics calculated with different denominators will answer slightly different questions and researchers must select the most relevant estimate for their objectives. Similarly, when using PGD, there will be multiple options for specifying the reference population. For example, investigators may elect to use all search queries, all geotagged queries, or all queries by week, and these elections may yield different estimates.15,27 Accordingly, researchers will need to select and justify their choice of a reference population based on their aims. Investigators can evaluate their choice of denominator by comparing estimates that reflect different underlying PARs or they can benchmark PGD measures against criterion standards (assessing concurrent validity), as has been done for 552 Chunara et al / Am J Prev Med 2017;52(4):549–553 influenza surveillance when anonymous Internet search activity data are modeled against healthcare measures.14 Yet, criterion standards are not always available, particularly as PGD are often used to fill existing data gaps. Other approaches, including triangulating PGD, may improve estimates. For example, investigators might compare influenza symptom data derived from Twitter and a participatory surveillance system, where both sources derive from the same geographic area and time period. Participatory methods, even where extrapolation to PAR and formulation of a denominator is not feasible, may be informative too, where information about a health phenomenon is otherwise missing. Such numerator-only estimates gleaned through voluntary reporting may shed light on emerging phenomenon without a practical reference population for benchmarking change (Table 1). In this case, transparency around measurement and caveats on inference are especially important. Spatiotemporally Inconsistent Denominator In healthcare-based data, prevalence incidence bias, diagnostic vogue bias, and noncontemporaneous control bias are all known contributors to a denominator that varies over time or by place (Appendix Table 1, available online). In traditional longitudinal cohort studies, inclusion time is demarcated by directly observed benchmarks (e.g., date of diagnosis or treatment). For PGD, the absence of structured reporting formats or observation periods and the influence of exogenous factors, such as tool availability, create subtle spatiotemporal issues. Fixing a known and appropriate spatiotemporal data frame is vital because the underlying, relevant PAR could be dynamic, which affects interpretability, reproducibility, and prediction. For example, inclusion time is conditioned on availability of search engines, operating systems, and technology standards. Shifts in app or device popularity and privacy policies may affect data availability and reporting. Views into these factors may be obscured when data are proprietary, resulting in transparency concerns, such as those observed in pharmaceutical trials.28 Although some statistics on Internet and mobile tool use exist, they are not necessarily reproducible or consistent over time, providing static views into dynamic patterns. Other exogenous factors can also cause spatiotemporal shifts. News regarding an outbreak can generate “spurious spikes” attributed to increased discussion/awareness (rather than experience of actual symptoms) in incidence time series.5 This can affect the validity and stability of disease incidence estimates generated from close monitoring of Internet search activity patterns, a common indicator of infectious disease activity.5,7 These spikes may not reproducibly occur in every situation; monitoring of malaria trends via Google search queries in Thailand did not reveal “spurious spikes”—possibly due to low media activity in the area. Variations in annual influenza-related search incidence estimates drawn from Google search queries reflect the lack of structure in PGD and underscore the importance of understanding how the dynamic nature of population reporting behaviors may affect the denominator. As well, PGD may be influenced by the condition of participants, analogous to the healthy participant effect. For example, associations between self-report of health status and willingness to share personal health information in studies via data sharing from web-based sources have been noted.29 Temporally inconsistent denominators can be addressed by utilizing data from only individuals with acceptable participation levels over a particular time period, as done to understand incidence from participatory influenza reports.12 However, efforts must be made to ensure unmeasured confounding is not introduced. Spurious spikes have been distinguished from true increases in illness burden by identifying the effects of media or other exogenous factors unrelated to actual disease dynamics. Comparing with reasonable growth rates of disease allows investigators to distinguish excess variation over time.5 Methods to reduce these spikes can be developed after or simultaneous to data collection, as has been done in monitoring dengue trends from Internet search query data and influenza-like illness trends from participatory reports.5,24 CONCLUSIONS Challenges common to both healthcare-based data and PGD pertain to knowledge about PAR and specification of a denominator. All surveillance data have challenges that may lead to samples that do not represent the population from which they are drawn. Excitement over the potential for PGD to accelerate knowledge and inform prevention can obscure the significance of these challenges when using PGD—simply put, extra care is needed to define PAR and a denominator. The commercial nature of many novel data types amplifies these challenges. Ambitious new initiatives such as NIH’s call for creation of a vast, volunteer cohort to donate data to advance precision medicine will require exquisite attention to how PGD are invoked, sampled, compiled, and used.4 Similar attention is vital to optimizing use of PGD distilled from web platforms, search activity, and other sources. Overall, in both healthcare-based data and PGD, there are no “one-size-fits-all” solutions. For PGD as with other forms of data, concerns about measurement, inference, and extrapolation are mitigated where careful specification of PAR is evident, limitations addressed forthrightly and through confirmatory investigations as feasible. www.ajpmonline.org Chunara et al / Am J Prev Med 2017;52(4):549–553 ACKNOWLEDGMENTS This work is supported by grant R21 AA023901-01 from the National Institutes of Alcohol Abuse and Alcoholism at NIH (ERW and RC), and grant IIS-1343968 from the National Science Foundation (RC). Rumi Chunara, PhD and Elissa R. Weitzman, ScD, MSc, conceived the paper and Rumi Chunara, PhD, Elissa Weitzman, ScD, MSc, and Lauren Wisk, PhD, wrote the paper. The authors thank Melanie Kenney for help with assembling relevant literature. No financial disclosures were reported by the authors of this paper. SUPPLEMENTAL MATERIAL Supplemental materials associated with this article can be found in the online version at http://dx.doi.org/10.1016/j. amepre.2016.10.038. REFERENCES 1. Lavallee DC, Chenok KE, Love RM, et al. Incorporating patientreported outcomes into health care to engage patients and enhance care. Health Aff (Millwood). 2016;35(4):575–582. http://dx.doi.org/ 10.1377/hlthaff.2015.1362. 2. Chunara R, Smolinski MS, Brownstein JS. Why we need crowdsourced data in infectious disease surveillance. Curr Infect Dis Rep. 2013;15 (4):316–319. http://dx.doi.org/10.1007/s11908-013-0341-5. 3. Ayers JW, Althouse BM, Dredze M. Could behavioral medicine lead the web data revolution? JAMA. 2014;311(14):1399–1400. http://dx. doi.org/10.1001/jama.2014.1505. 4. Collins FS, Varmus H. A new initiative on precision medicine. N Engl J Med. 2015;372(9):793–795. http://dx.doi.org/10.1056/NEJMp1500523. 5. Chan EH, Sahai V, Conrad C, Brownstein JS. Using web search query data to monitor dengue epidemics: a new model for neglected tropical disease surveillance. PLoS Negl Trop Dis. 2011;5(5):e1206. http://dx. doi.org/10.1371/journal.pntd.0001206. 6. Chunara R, Bouton L, Ayers JW, Brownstein JS. Assessing the online social environment for surveillance of obesity prevalence. PLoS One. 2013;8(4):e61373. http://dx.doi.org/10.1371/journal.pone.0061373. 7. Ginsberg J, Mohebbi MH, Patel RS, et al. Detecting influenza epidemics using search engine query data. Nature. 2009;457(7232):1012–1014. http://dx.doi.org/10.1038/nature07634. 8. Krogh-Jensen P. The denominator problem. Scand J Prim Health Care. 1983;1(2):53. http://dx.doi.org/10.3109/02813438309034933. 9. Schlaud M, Brenner M, Hoopmann M, Schwartz F. Approaches to the denominator in practice-based epidemiology: a critical overview. J Epidemiol Community Health. 1998;52(suppl 1): 13S 19S. 10. CDC. Principles of Epidemiology in Public Health Practice, Third Edition: An Introduction to Applied Epidemiology and Biostatistics. Atlanta, GA: CDC; 2012. 11. Rehman N, Liu J, Chunara R. Using propensity score matching to understand the relationship between online health information sources and vaccination sentiment. Proceedings of Association for the Advancement of Artificial Intelligence Spring Symposia; 2016 March 23–25, Stanford University, Stanford, CA. Palo Alto, CA: Association for the Advancement of Artificial Intelligence, 2016. 12. Chunara R, Goldstein E, Patterson-Lomba O, Brownstein JS. Estimating influenza attack rates in the United States using a participatory cohort. Sci Rep. 2015;5:9540. http://dx.doi.org/10.1038/srep09540. 13. Bai H, Chunara R, Varshney LR. Social capital deserts: obesity surveillance using a location-based social network. Proceedings of the Data for Good Exchange (D4GX), New York, September 28, 2015. https://dl.dropbox April 2017 553 usercontent.com/u/389406195/web-Papers/PH_Varshney_55.pdf. Accessed November 20, 2016. 14. Santillana M, Nguyen AT, Dredze M, et al. Combining search, social media, and traditional data sources to improve influenza surveillance. PLoS Comput Biol. 2015;11(10):e1004513. http://dx.doi.org/10.1371/ journal.pcbi.1004513. 15. Santillana M, Zhang DW, Althouse BM, Ayers JW. What can digital disease detection learn from (an external revision to) Google Flu Trends? Am J Prev Med. 2014;47(3):341–347. http://dx.doi.org/ 10.1016/j.amepre.2014.05.020. 16. Presanis AM, De Angelis D, Hagy A, et al. The severity of pandemic H1N1 influenza in the United States, from April to July 2009: a Bayesian analysis. PLoS Med. 2009;6(12):e1000207. http://dx.doi.org/ 10.1371/journal.pmed.1000207. 17. Rao D, Yarowsky D, Shreevats A, Gupta M. Classifying latent user attributes in Twitter. Proceedings of the 2nd International Workshop on Search and Mining User-Generated Contents; 2010 Oct 30. Toronto, ON, Canada, New York: Association for Computing Machinery, 2010: 37–44. http://dx.doi.org/10.1145/1871985.1871993. 18. Weitzman ER, Adida B, Kelemen S, Mandl KD. Sharing data for public health research by members of an international online diabetes social network. PLoS One. 2011;6(4):e19256. http://dx.doi.org/10.1371/journal.pone.0019256. 19. Hu SS, Balluz L, Battaglia MP, Frankel MR. Improving public health surveillance using a dual-frame survey of landline and cell phone numbers. Am J Epidemiol. 2011;173(6):703–711. http://dx.doi.org/ 10.1093/aje/kwq442. 20. Pew Research Center. Social media update; 2014. www.pewinternet.org/files/2015/01/PI_SocialMediaUpdate20144.pdf. Accessed December 1, 2016. 21. Kutner M, Greenburg E, Jin Y, Paulsen C. The Health literacy of America’s adults: results from the 2003 National Assessment of Adult Literacy. NCES 2006-483. Washington, DC: National Center for Education Statistics; 2006. 22. Smith A. African Americans and Technology Use: A Demographic Portrait. Washington, DC: Pew Research Center; 2014. 23. Bajardi P, Vespignani A, Funk S, et al. Determinants of follow-up participation in the internet-based European influenza surveillance platform influenzanet. J Med Internet Res. 2014;16(3):e78. http://dx. doi.org/10.2196/jmir.3010. 24. Smolinski MS, Crawley AW, Baltrusaitis K, et al. Flu near you: crowdsourced symptom reporting spanning 2 influenza seasons. Am J Public Health. 2015;105(10):2124–2130. http://dx.doi.org/10.2105/ AJPH.2015.302696. 25. Kjelsø C, Galle M, Bang H, Ethelberg S, Krause TG. Influmeter—an online tool for self-reporting of influenza-like illness in Denmark. Infect Dis (Lond). 2016;48(4):322–327. http://dx.doi.org/10.3109/23744235. 2015.1122224. 26. Paneth N, Hong T, Korzeniewski S. The descriptive epidemiology of cerebral palsy. Clin Perinatol. 2006;33(2):251–267. http://dx.doi.org/ 10.1016/j.clp.2006.03.011. 27. Morstatter F, Pfeffer J, Liu H, Carley KM. Is the sample good enough? Comparing data from Twitter’s streaming API with Twitter’s Firehose. Proceedings of the Seventh International Association for the Advancement of Artificial Intelligence Conference on Weblogs and Social Media; 2013 July 8–12; Cambridge, Massachusetts. Palo Alto, CA: Association for the Advancement of Artificial Intelligence, 2013. 28. Sykes R. Being a modern pharmaceutical company: involves making information available on clinical trial programmes. BMJ. 1998;317 (7167):1172. http://dx.doi.org/10.1136/bmj.317.7167.1172. 29. Weitzman ER, Kelemen S, Kaci L, Mandl KD. Willingness to share personal health record data for care improvement and public health: a survey of experienced personal health record users. BMC Med Inform Decis Mak. 2012;12(1):39. http://dx.doi.org/10.1186/ 1472-6947-12-39.
© Copyright 2026 Paperzz