- American Journal of Preventive Medicine

CURRENT ISSUES
Denominator Issues for Personally Generated Data in
Population Health Monitoring
Rumi Chunara, PhD,1,2 Lauren E. Wisk, PhD,3,4 Elissa R. Weitzman, ScD, MSc3,4,5
INTRODUCTION
W
idespread use of Internet and mobile technologies provides opportunities to gather
health-related information to complement
data generated through traditional healthcare and public
health systems. These personally generated data (PGD)
are increasingly viewed as informative of the patient
experience of conditions, symptoms, treatments, and side
effects.1 Behavior, sentiment, and disease patterns can be
discerned from mining unstructured PGD in text, image,
or metadata form, and from analyzing PGD collected via
structured, opt-in, and web-enabled platforms and devices, including wearables.2,3 Models that employ PGD
from distributed cohorts are being used increasingly to
measure public health outcomes; moreover, PGD collection forms the centerpiece of important new federal
investments into personalized medicine that seek to
energize vast cohorts in donating data via apps and
devices.4 PGD offer the opportunity to inform gap areas
of health research through high-resolution views into
spatial, temporal, or demographic features. However,
when PGD are used to answer epidemiologic questions,
it is not always clear what constitutes the population at
risk (PAR), or the denominator, challenging researchers’
abilities to make inferences, draw comparisons, and
evaluate change. Because of this, initial PGD studies
have tended toward numerator-only investigations5–7;
however, the field is advancing. This report summarizes
issues related to specifying PAR and denominator
metrics when using PGD for health research and outlines
approaches for resolving these issues using design and
analytic strategies.
CHALLENGES IN SPECIFYING THE
POPULATION AT RISK
Individuals considered capable of acquiring a disease are
considered the PAR (Figure 1). Accurate knowledge of
the PAR is a fundamental requirement in preventive
medicine and public health, and is necessary for measuring effects and gauging impact. Challenges in assessing
PAR have been deliberated even for traditional
healthcare-based data sources8; when PAR is unobservable, different proxy metrics are often used.9 The exact
nature of what might constitute a useful and valid choice
for defining PAR depends on the research focus and is a
source of difficulty.8,9 All analyses drawing on chart
reviews, electronic medical records, and telephone surveys must address the extent to which the sample
represents the source population (i.e., generalizability).
Even when samples are drawn directly from the PAR, the
dynamic nature of human populations and subtleties of
selection, differential non-response, and attrition may
introduce variability and impact validity of inferences.
Anticipation and mitigation of these issues are vital for
investigations drawing on traditional healthcare-based
data and PGD. As with RCTs, where findings may not
generalize with the imposition of strict eligibility criteria
and participant self-selection into research,10 PGD may
not fully represent patient or community populations.
Moreover, once individuals contribute data, bias may be
introduced through imperfect methods of observation
and engagement. Researchers wishing to use PGD will
need to acknowledge and, when possible, measure how
characteristics of a sample differ from the PAR (i.e.,
issues relevant to external validity). Three challenges to
using PGD stand out and are elaborated here—sample
representativeness and selection bias, poorly characterized reference populations, and spatiotemporally inconsistent denominators (Table 1 and Appendix Table 1,
available online).
From the 1Department of Computer Science and Engineering, New York
University Tandon School of Engineering, Brooklyn, New York; 2College of
Global Public Health, New York University, New York, New York;
3
Division of Adolescent/Young Adult Medicine, Boston Children’s Hospital, Boston, Massachusetts; 4Department of Pediatrics, Harvard Medical
School, Harvard University, Boston, Massachusetts; and 5Computational
Health Informatics Program, Boston Children’s Hospital, Boston,
Massachusetts
Address correspondence to: Rumi Chunara, PhD, Department of
Computer Science and Engineering, New York University Tandon School
of Engineering, 2 Metrotech Center, 10th Floor, 10.007, Brooklyn NY
11201. E-mail: [email protected].
0749-3797/$36.00
http://dx.doi.org/10.1016/j.amepre.2016.10.038
& 2016 American Journal of Preventive Medicine. Published by Elsevier Inc. This is an
Am J Prev Med 2017;52(4):549–553
open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
549
550
Chunara et al / Am J Prev Med 2017;52(4):549–553
Figure 1. An overview of challenges in assessing the population-at-risk working with both healthcare-based data and
personally generated data.
Sample Representativeness and Selection Bias
Selection biases arise when the sample does not accurately represent an underlying population. They are
among the most common validity threats when making
inferences (Appendix Table 1, available online). In
healthcare-based studies, different types of settings,
recruitment procedures, and inclusion and exclusion
criteria must be carefully considered to ensure valid
estimates and extrapolation of findings. For example,
access bias was found to affect the Behavioral Risk Factor
Surveillance System, a U.S. survey that collects data via
random-digit-dial telephone interviews. However, using
only data from landline-based sampling excluded important segments of the U.S. population, and resulted in
measurable biases for many key health indicators.19
Consequently, the Behavioral Risk Factor Surveillance
System shifted their sampling approach to additionally
incorporate cell phones, yielding more valid, reliable, and
representative measures.
Many types of selection bias also apply to PGD
(Appendix Table 1, available online). Access bias, for
example, is a concern when disparities in Internet access
affect representativeness of PGD. Efforts to close the
Internet access gap have succeeded, with only 15% of the
U.S. population offline in the past year, yet variability and
inequalities persist. For instance, among teens, Internet
platform preferences vary by household income.20
Among adults, health literacy is highly correlated with
health information seeking on the Internet and self-rated
health.21 Hence, individuals who passively source or
actively contribute data via online platforms may vary
by income and be more literate and healthy than the
general population. Access differences can also reflect
group preferences, norms, technology diffusion, and
sharing patterns. For example, in the U.S., 40% of African
Americans aged 18–29 years who use the Internet say
that they use Twitter compared with 28% among their
white counterparts.22 Across multiple influenza-focused
participatory surveillance systems, women have higher
participation levels than men, with peak levels found
among those aged 30–60 years.23,24 As well, in participatory diabetes surveillance within an online diabetes
community, early adopters of an app that enabled PGD
sharing for population health monitoring reported
greater diabetes control than did later adopters, and
preference toward openness in data sharing was greater
among individuals with better diabetes control.18
Because technology diffusion, preferences, and norms
vary by age, income, and gender, observations gleaned
passively from processing of Internet search or microblogging data (as done for sentiment analysis or outbreak
detection) may be affected by differential selection stemming from the composition of groups using one versus
another platform or tool.20 Bias may also be present in active
surveillance, which relies on the planned donation of
information by persons using Internet or mobile tools. For
example, in participatory influenza reporting systems,
individuals tend to sign up more when their symptoms
are worse, potentially skewing incidence estimates.25 Technology standards, protections on data use, or controls over
secondary data use can also affect representativeness. To
address these issues, investigators might preselect a subsample of users that represents the PAR using probability
sampling, conduct sensitivity analyses to ascertain how
sample composition affects observations, or integrate data
gleaned from multiple platforms (Table 1).
www.ajpmonline.org
Chunara et al / Am J Prev Med 2017;52(4):549–553
551
Table 1. Approaches to Resolving Denominator Challenges in PGD
PGD challenge
type
Sample
representativeness
and selection bias
Poorly
characterized
reference
populations
Illustrative case(s)
Use of social media data for health
monitoring where data are
collected via keyword search
Use of anonymized data as
indicator of health phenomenon
that do not include demographic
or sampling details, such as
internet search data from Google
Trends
Corrective approaches (via
design or analytic strategy)
Select out a subsample that
represents the PAR using
probability sampling or other
matching methods
Conduct sensitivity analyses
varying denominator
Control for potential
confounders at
appropriate scale
Focus on concurrent validation
Spatiotemporally
inconsistent
denominator
Data donation of steps or
symptoms by members of an optin distributed, online health
community
against criterion standards,
with the understanding that
criteria are rarely available to
compare against
Utilize multiple datasets
(including a criterion standard,
when available) to establish
concurrent validity
Use Bayesian evidence
synthesis to improve model
performance
Supervised learning and some
ground truth data to perform
inference of latent
characteristics
Engage a cohort for crosssectional or prospective
evaluation where spatial and
temporal frames can be
identified
Partner with PGD sources/
platforms to elucidate
denominator features
Imputation for missing or
outlier data
Practical examples
Propensity score matching
used to evaluate effect of
information on Twitter users11
Assess effect of participatory
symptom contribution
frequency on population
burden estimates12
SES controlled for at ZIP code
when assessing relationship
between online check-ins and
obesity levels13
Assess flu incidence from
internet sources versus CDC
data14
Combine multiple flu proxy
data to improve incidence
estimates15
Estimate burden of influenza
from hospital-reported case
data16
Infer demographics of Twitter
users17
Estimate influenza burden via
examining contribution of
participatory reports12
Obtain geographic information
on members of an online
diabetes social network and
compare to geographic
patterns of app uptake to test
for bias18
Remove spurious spikes in
influenza data15
CDC, Centers for Disease Control and Prevention; PAR, population at risk; PGD, personally generated data.
Poorly Characterized Reference Populations
Even if recruitment, access, and other factors are
accounted for, there still can be multiple PAR options.
Complexity when choosing a reference population using
healthcare-based data has been discussed when assessing
cohort patterns for cerebral palsy (CP), for example.26
For a congenital disorder such as CP, in which newborns
and infants might logically be considered to constitute
the PAR, the birth cohort from which a case arises is
generally used as the denominator. However, poorly
defined or tracked birth cohorts may necessitate the use
of other denominators, such as all children aged o2
years residing in a particular area. Yet, if families of a
child with CP move and cluster around health centers
that are best suited for caring for children with CP,
prevalence estimates might skew higher; alternatively,
April 2017
such health centers might be more adept at screening and
diagnosis, leading to a diagnostic access bias. Ultimately,
statistics calculated with different denominators will
answer slightly different questions and researchers must
select the most relevant estimate for their objectives.
Similarly, when using PGD, there will be multiple
options for specifying the reference population. For
example, investigators may elect to use all search queries,
all geotagged queries, or all queries by week, and these
elections may yield different estimates.15,27 Accordingly,
researchers will need to select and justify their choice of a
reference population based on their aims. Investigators
can evaluate their choice of denominator by comparing
estimates that reflect different underlying PARs or they
can benchmark PGD measures against criterion standards (assessing concurrent validity), as has been done for
552
Chunara et al / Am J Prev Med 2017;52(4):549–553
influenza surveillance when anonymous Internet search
activity data are modeled against healthcare measures.14
Yet, criterion standards are not always available, particularly as PGD are often used to fill existing data gaps.
Other approaches, including triangulating PGD, may
improve estimates. For example, investigators might
compare influenza symptom data derived from Twitter
and a participatory surveillance system, where both
sources derive from the same geographic area and time
period. Participatory methods, even where extrapolation to
PAR and formulation of a denominator is not feasible, may
be informative too, where information about a health
phenomenon is otherwise missing. Such numerator-only
estimates gleaned through voluntary reporting may shed
light on emerging phenomenon without a practical reference
population for benchmarking change (Table 1). In this case,
transparency around measurement and caveats on inference
are especially important.
Spatiotemporally Inconsistent Denominator
In healthcare-based data, prevalence incidence bias, diagnostic vogue bias, and noncontemporaneous control bias are
all known contributors to a denominator that varies over
time or by place (Appendix Table 1, available online). In
traditional longitudinal cohort studies, inclusion time is
demarcated by directly observed benchmarks (e.g., date of
diagnosis or treatment). For PGD, the absence of structured
reporting formats or observation periods and the influence of
exogenous factors, such as tool availability, create subtle
spatiotemporal issues. Fixing a known and appropriate
spatiotemporal data frame is vital because the underlying,
relevant PAR could be dynamic, which affects interpretability, reproducibility, and prediction. For example, inclusion time is conditioned on availability of search engines,
operating systems, and technology standards. Shifts in app or
device popularity and privacy policies may affect data
availability and reporting. Views into these factors may be
obscured when data are proprietary, resulting in transparency concerns, such as those observed in pharmaceutical
trials.28 Although some statistics on Internet and mobile tool
use exist, they are not necessarily reproducible or consistent
over time, providing static views into dynamic patterns.
Other exogenous factors can also cause spatiotemporal
shifts. News regarding an outbreak can generate “spurious spikes” attributed to increased discussion/awareness (rather than experience of actual symptoms) in
incidence time series.5 This can affect the validity and
stability of disease incidence estimates generated from
close monitoring of Internet search activity patterns, a
common indicator of infectious disease activity.5,7 These
spikes may not reproducibly occur in every situation;
monitoring of malaria trends via Google search queries in
Thailand did not reveal “spurious spikes”—possibly due to
low media activity in the area. Variations in annual
influenza-related search incidence estimates drawn from
Google search queries reflect the lack of structure in PGD
and underscore the importance of understanding how the
dynamic nature of population reporting behaviors may affect
the denominator. As well, PGD may be influenced by the
condition of participants, analogous to the healthy participant effect. For example, associations between self-report of
health status and willingness to share personal health
information in studies via data sharing from web-based
sources have been noted.29
Temporally inconsistent denominators can be
addressed by utilizing data from only individuals with
acceptable participation levels over a particular time
period, as done to understand incidence from participatory influenza reports.12 However, efforts must be made
to ensure unmeasured confounding is not introduced.
Spurious spikes have been distinguished from true
increases in illness burden by identifying the effects of
media or other exogenous factors unrelated to actual
disease dynamics. Comparing with reasonable growth
rates of disease allows investigators to distinguish excess
variation over time.5 Methods to reduce these spikes can
be developed after or simultaneous to data collection, as
has been done in monitoring dengue trends from
Internet search query data and influenza-like illness
trends from participatory reports.5,24
CONCLUSIONS
Challenges common to both healthcare-based data and
PGD pertain to knowledge about PAR and specification
of a denominator. All surveillance data have challenges
that may lead to samples that do not represent the
population from which they are drawn. Excitement over
the potential for PGD to accelerate knowledge and
inform prevention can obscure the significance of these
challenges when using PGD—simply put, extra care is
needed to define PAR and a denominator. The commercial nature of many novel data types amplifies these
challenges. Ambitious new initiatives such as NIH’s call
for creation of a vast, volunteer cohort to donate data to
advance precision medicine will require exquisite attention to how PGD are invoked, sampled, compiled, and
used.4 Similar attention is vital to optimizing use of PGD
distilled from web platforms, search activity, and other
sources. Overall, in both healthcare-based data and PGD,
there are no “one-size-fits-all” solutions. For PGD as with
other forms of data, concerns about measurement,
inference, and extrapolation are mitigated where careful
specification of PAR is evident, limitations addressed
forthrightly and through confirmatory investigations as
feasible.
www.ajpmonline.org
Chunara et al / Am J Prev Med 2017;52(4):549–553
ACKNOWLEDGMENTS
This work is supported by grant R21 AA023901-01 from the
National Institutes of Alcohol Abuse and Alcoholism at NIH (ERW
and RC), and grant IIS-1343968 from the National Science
Foundation (RC). Rumi Chunara, PhD and Elissa R. Weitzman,
ScD, MSc, conceived the paper and Rumi Chunara, PhD, Elissa
Weitzman, ScD, MSc, and Lauren Wisk, PhD, wrote the paper.
The authors thank Melanie Kenney for help with assembling
relevant literature.
No financial disclosures were reported by the authors of
this paper.
SUPPLEMENTAL MATERIAL
Supplemental materials associated with this article can be
found in the online version at http://dx.doi.org/10.1016/j.
amepre.2016.10.038.
REFERENCES
1. Lavallee DC, Chenok KE, Love RM, et al. Incorporating patientreported outcomes into health care to engage patients and enhance
care. Health Aff (Millwood). 2016;35(4):575–582. http://dx.doi.org/
10.1377/hlthaff.2015.1362.
2. Chunara R, Smolinski MS, Brownstein JS. Why we need crowdsourced
data in infectious disease surveillance. Curr Infect Dis Rep. 2013;15
(4):316–319. http://dx.doi.org/10.1007/s11908-013-0341-5.
3. Ayers JW, Althouse BM, Dredze M. Could behavioral medicine lead
the web data revolution? JAMA. 2014;311(14):1399–1400. http://dx.
doi.org/10.1001/jama.2014.1505.
4. Collins FS, Varmus H. A new initiative on precision medicine. N Engl J
Med. 2015;372(9):793–795. http://dx.doi.org/10.1056/NEJMp1500523.
5. Chan EH, Sahai V, Conrad C, Brownstein JS. Using web search query
data to monitor dengue epidemics: a new model for neglected tropical
disease surveillance. PLoS Negl Trop Dis. 2011;5(5):e1206. http://dx.
doi.org/10.1371/journal.pntd.0001206.
6. Chunara R, Bouton L, Ayers JW, Brownstein JS. Assessing the online
social environment for surveillance of obesity prevalence. PLoS One.
2013;8(4):e61373. http://dx.doi.org/10.1371/journal.pone.0061373.
7. Ginsberg J, Mohebbi MH, Patel RS, et al. Detecting influenza epidemics
using search engine query data. Nature. 2009;457(7232):1012–1014.
http://dx.doi.org/10.1038/nature07634.
8. Krogh-Jensen P. The denominator problem. Scand J Prim Health Care.
1983;1(2):53. http://dx.doi.org/10.3109/02813438309034933.
9. Schlaud M, Brenner M, Hoopmann M, Schwartz F. Approaches to the
denominator in practice-based epidemiology: a critical overview.
J Epidemiol Community Health. 1998;52(suppl 1): 13S 19S.
10. CDC. Principles of Epidemiology in Public Health Practice, Third
Edition: An Introduction to Applied Epidemiology and Biostatistics.
Atlanta, GA: CDC; 2012.
11. Rehman N, Liu J, Chunara R. Using propensity score matching to
understand the relationship between online health information sources
and vaccination sentiment. Proceedings of Association for the
Advancement of Artificial Intelligence Spring Symposia; 2016 March
23–25, Stanford University, Stanford, CA. Palo Alto, CA: Association
for the Advancement of Artificial Intelligence, 2016.
12. Chunara R, Goldstein E, Patterson-Lomba O, Brownstein JS. Estimating influenza attack rates in the United States using a participatory
cohort. Sci Rep. 2015;5:9540. http://dx.doi.org/10.1038/srep09540.
13. Bai H, Chunara R, Varshney LR. Social capital deserts: obesity surveillance
using a location-based social network. Proceedings of the Data for Good
Exchange (D4GX), New York, September 28, 2015. https://dl.dropbox
April 2017
553
usercontent.com/u/389406195/web-Papers/PH_Varshney_55.pdf.
Accessed November 20, 2016.
14. Santillana M, Nguyen AT, Dredze M, et al. Combining search, social
media, and traditional data sources to improve influenza surveillance.
PLoS Comput Biol. 2015;11(10):e1004513. http://dx.doi.org/10.1371/
journal.pcbi.1004513.
15. Santillana M, Zhang DW, Althouse BM, Ayers JW. What can digital
disease detection learn from (an external revision to) Google Flu
Trends? Am J Prev Med. 2014;47(3):341–347. http://dx.doi.org/
10.1016/j.amepre.2014.05.020.
16. Presanis AM, De Angelis D, Hagy A, et al. The severity of pandemic
H1N1 influenza in the United States, from April to July 2009: a
Bayesian analysis. PLoS Med. 2009;6(12):e1000207. http://dx.doi.org/
10.1371/journal.pmed.1000207.
17. Rao D, Yarowsky D, Shreevats A, Gupta M. Classifying latent user
attributes in Twitter. Proceedings of the 2nd International Workshop on
Search and Mining User-Generated Contents; 2010 Oct 30. Toronto,
ON, Canada, New York: Association for Computing Machinery, 2010:
37–44. http://dx.doi.org/10.1145/1871985.1871993.
18. Weitzman ER, Adida B, Kelemen S, Mandl KD. Sharing data for public
health research by members of an international online diabetes social
network. PLoS One. 2011;6(4):e19256. http://dx.doi.org/10.1371/journal.pone.0019256.
19. Hu SS, Balluz L, Battaglia MP, Frankel MR. Improving public health
surveillance using a dual-frame survey of landline and cell phone
numbers. Am J Epidemiol. 2011;173(6):703–711. http://dx.doi.org/
10.1093/aje/kwq442.
20. Pew Research Center. Social media update; 2014. www.pewinternet.org/files/2015/01/PI_SocialMediaUpdate20144.pdf.
Accessed
December 1, 2016.
21. Kutner M, Greenburg E, Jin Y, Paulsen C. The Health literacy of
America’s adults: results from the 2003 National Assessment of Adult
Literacy. NCES 2006-483. Washington, DC: National Center for
Education Statistics; 2006.
22. Smith A. African Americans and Technology Use: A Demographic
Portrait. Washington, DC: Pew Research Center; 2014.
23. Bajardi P, Vespignani A, Funk S, et al. Determinants of follow-up
participation in the internet-based European influenza surveillance
platform influenzanet. J Med Internet Res. 2014;16(3):e78. http://dx.
doi.org/10.2196/jmir.3010.
24. Smolinski MS, Crawley AW, Baltrusaitis K, et al. Flu near you:
crowdsourced symptom reporting spanning 2 influenza seasons. Am
J Public Health. 2015;105(10):2124–2130. http://dx.doi.org/10.2105/
AJPH.2015.302696.
25. Kjelsø C, Galle M, Bang H, Ethelberg S, Krause TG. Influmeter—an
online tool for self-reporting of influenza-like illness in Denmark. Infect
Dis (Lond). 2016;48(4):322–327. http://dx.doi.org/10.3109/23744235.
2015.1122224.
26. Paneth N, Hong T, Korzeniewski S. The descriptive epidemiology of
cerebral palsy. Clin Perinatol. 2006;33(2):251–267. http://dx.doi.org/
10.1016/j.clp.2006.03.011.
27. Morstatter F, Pfeffer J, Liu H, Carley KM. Is the sample good enough?
Comparing data from Twitter’s streaming API with Twitter’s Firehose.
Proceedings of the Seventh International Association for the Advancement of Artificial Intelligence Conference on Weblogs and Social
Media; 2013 July 8–12; Cambridge, Massachusetts. Palo Alto, CA:
Association for the Advancement of Artificial Intelligence, 2013.
28. Sykes R. Being a modern pharmaceutical company: involves making
information available on clinical trial programmes. BMJ. 1998;317
(7167):1172. http://dx.doi.org/10.1136/bmj.317.7167.1172.
29. Weitzman ER, Kelemen S, Kaci L, Mandl KD. Willingness to share
personal health record data for care improvement and public health: a
survey of experienced personal health record users. BMC Med
Inform Decis Mak. 2012;12(1):39. http://dx.doi.org/10.1186/
1472-6947-12-39.