editorial Projecting the Benefits and Harms of

JNCI J Natl Cancer Inst (2015) 107(7): djv145
doi: 10.1093/jnci/djv145
First published online May 26, 2015
Editorial
editorial
Projecting the Benefits and Harms of Mammography Using
Statistical Models: Proof or Proofiness?
Barnett S. Kramer, Joann G. Elmore
Affiliations of authors: Division of Cancer Prevention, National Cancer Institute, Rockville, MD (BSK); University of Washington School of Medicine, Seattle, WA
(JGE).
Correspondence to: Barnett S. Kramer, MD, MPH, National Cancer Institute, Division of Cancer Prevention, 9609 Medical Center Drive, Room 5E410, Rockville, MD
20852 (e-mail: [email protected]).
On numbers:
“If you want to get people to believe something…just stick a number on it.” (1)
-Charles Seife, Proofiness
On statistical modeling:
“With four parameters I can fit an elephant, and with five I can
make him wiggle his trunk.” (2)
-John von Neumann
Statistical models are often used in medicine and public health
when there are important gaps in a body of empirical evidence
regarding the impact of interventions on health outcomes. The
models generally incorporate multiple parameters and variables
with uncertain values. For example, lacking firm evidence that
stage at diagnosis is a valid surrogate for a health outcome such
as mortality, statistical models produce projections based on
a chain of assumptions. Health outcomes are often projected
beyond available evidence from clinical trials, perhaps years or
decades into the future—a classic “out of sample” problem (3).
Such modeling requires assumptions, many of which are unobserved or even unobservable, such as progression rates of preclinical biological processes.
In this issue of the Journal, a team of very experienced modelers tackles an important question: What are the benefits and
harms of mammography screening after the age of 74 years? (4)
They conclude that the balance of benefits and harms of routine
screening mammography is likely to remain positive until about
age 90 years. To reach this conclusion, the authors employ three
complex statistical microsimulation models, a necessity given
that the well of reliable empirical evidence from randomized trials runs dry beyond the age of 74 years (5). The average reader
will lack the time, patience, or skill to dissect the three models
or their underlying assumptions, and so many will have to take
on faith the model outputs emphasized in the abstract, despite
the recognition by most modelers that identifying and studying
the uncertainties in the assumptions that drive the output is
as important as—and perhaps more important than—the actual
output. As telegraphed in the title of the paper, the methods of
estimating overdiagnosis are major drivers of the models.
A well-worn trope by statistician George E. P. Box is that
“essentially, all statistical models are wrong, but some are useful” (6). That raises two key questions for any model: 1) How
wrong is it? and 2) How useful is it? It is worthwhile examining
every model through the lens of the two questions implied by
Box’s maxim.
How Wrong Are the Models?
Every forecasting model is prone to three major components
of uncertainty, as described by Nate Silver in The Signal and the
Noise: Why So Many Predictions Fail—but Some Don’t (7): 1) uncertainty in the initial condition (eg, variability in baseline risk of
breast cancer, drift in incidence trends), 2) structural uncertainty
(eg, imprecise knowledge about outcome utilities, dynamics of
subclinical disease progression, validity of intermediate endpoints such as tumor size or stage), and 3) scenario uncertainty
(eg, variation in screening mammography sensitivity and specificity among radiologists, drifts in therapy patterns and efficacy). The uncertainty in the third category increases over time
(see Figure 1 adapted from [7]). We have inverted and modified
Figure 1 to draw an analogy with stepping from the terra firma of
observed empirical evidence derived from clinical trials to wading into a figurative lake or pool of estimated data derived from
statistical modeling (Figure 2). As one wades into the water from
the shore of a lake, moving through longer time projections into
the future lives of patients, there is progressively less support
from firm evidence under foot. Suddenly, one loses contact with
the underlying empirical evidence. At that point, the swimmer
can no longer touch bottom and does not even know if the bottom is inches or many feet below.
Any statistical model of a biological system bumps up
against the concept of chaos theory, in which predictions of outcome are difficult because of hypersensitivity to starting conditions. Models can become chaotic when two criteria hold: 1) the
Received: April 21, 2015; Accepted: April 23, 2015
Published by Oxford University Press 2015. This work is written by (a) US Government employee(s) and is in the public domain in the US.
1 of 2
2 of 2 | JNCI J Natl Cancer Inst, 2015, Vol. 107, No. 7
of the frequency of overdiagnosis. Any such underestimation
would inflate projected benefits of breast cancer screening over
time in the Ravestyn et al. models.
How Useful Are the Models?
Figure 1. Sources of model uncertainty. Adapted with permission from Nate
Silver, The Signal and the Noise: Why So Many Predictions Fail—But Some Don’t (7).
Figure 2. Wading into deep water.
system is dynamic (ie, there are feedback loops in which factors
influence each other, including tumor-microenvironment interactions); and 2) the processes follow exponential rather than
additive relationships (8). Most would agree that the intertwined
processes involved in breast cancer pathogenesis, progression,
detection, and treatment fulfill both criteria.
Van Ravesteyn and colleagues (4) tell the reader that all
three models have been “validated,” providing evidence in their
Figure 2 that they reproduce the incidence data from SEER over
the period 1975 to 2000. This provides evidence of calibration
rather than true validation. Leaving aside the fact that the most
recent year of the comparison Surveillance, Epidemiology, and
End Results (SEER) data is about 15 years ago, one of the models consistently predicts lower breast cancer incidence over the
entire period compared with SEER, and one overpredicts for
the first 10 years then underpredicts considerably. It is risky to
assume that even the third model, which approximates SEER
incidence relatively well, is fully “validated.” Even purposely
mis-specified models can be shown to fit existing datasets well
(9). John von Neumann’s quote is apropos here. It is also well
known that statistical models that are “validated” by showing
close correlation to previous economic downturns notoriously
fall short in predicting the next downturn.
The van Ravesteyn article also reports substantial differences
among the models in estimates of age-specific overdiagnosis
(Figure 3, A-C, and Table 3 of [4]) a major pillar of the models.
There is active debate in the field (9–11) about the appropriate
methods of overdiagnosis estimation using models that attempt
to adjust for lead times, as in the three models used in van
Ravesteyn et al. (4). These methods can lead to underestimation
Policy makers and clinicians should only use models if they
understand what goes on inside the “black box” and the potential limitations of extrapolation beyond observed and observable
evidence. This is a tall order, given the issues we raised above.
However, the models are useful in that they do afford insights
about the important role of the informed decision-making process that is encouraged with respect to breast cancer screening. The biggest driver of personal decisions regarding screening
for cancer is likely to be personal values. Van Ravestyn et al. (4)
recognize this, stating that divergence of individual preferences
from their assumed values is the most important drawback of
their models. Given that challenge, an important area of future
research is learning how best to incorporate patients’ values
into informed decision-making, with or without models.
In summary, the models presented in the van Ravesteyn
study (4) provide important insights and new directions for
research. However, direct application to policy and clinical practice remains a challenge. We need to better gauge the depth of
the lake and to avoid the pseudo-precision that proofiness can
convey.
Notes
Opinions expressed in this manuscript are those of the authors
and do not necessarily represent the opinions or official positions of the US Department of Health and Human Services or
the US National Institutes of Health.
References
1. Seife C. Proofiness: The Dark Arts of Mathematical Deception. New York, NY: Penguin Group; 2010:295.
2. Attributed to John von Neumann by Enrico Fermi, a.q.b.F.D. Turning points:
A meeting with Enrico Fermi. Nature. 2004;427:297.
3. Silver N. Out of sample, out of mind: a formula for failed prediction, in The
Signal and the Noise: Why So Many Predictions Fail--But Some Don’t. New York,
NY: Penguin Group; 2012:44.
4. van Ravesteyn NT, Stout NK, Schechter CB, et al. Benefits and harms of mammography screening after age 74 years: model estimates of overdiagnosis. J
Natl Cancer Inst. 2015;107(7):djv103 doi:10.1093/jnci/djv103.
5. Gotzsche P, Jorgensen K. Screening for breast cancer with mammography
(review). Cochrane Database Syst Rev. 2013(6):1–81.
6. Box G, Draper N. Empirical Model Building and Response Surfaces. New York, NY:
John Wilery & Sons; 1987.
7. Silver N. A climate of healthy skepticism, in The Signal and the Noise: Why
So Many Predictions Fail—But Some Don’t. New York, NY: Penguin Group;
2012:370–411.
8. Silver N. The Signal and the Noise: Why So Many Predictions Fail--But Some Don’t.
New York, NY: Penguin Group; 2012:534.
9. Baker S, Prorok P, Kramer B. Lead time and overdiagnosis. J Natl Cancer Inst.
2014;106(12).
10. Zahl P-H, Jorgensen K, Gotzsche P. Lead-time models should not be used to
estimate overdiagnosis in cancer screening. J Gen Intern Med. 2014;29(9):1283–
1286.
11.Zahl P-H, Jorgensen K, Gotzsche P. Overestimated lead times in cancer
screeninghas led to substantial underestimation of overdiagnosis. Br J Cancer. 2013;109:2014–2019.